Recent advancements in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds, and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation — a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structural quality assessment. We represent protein structures as 2D α-carbon distance matrices (“contact maps”), and train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model performs exceptionally well, achieving a per-residue accuracy of 90.8% on the test set (95.0% average accuracy over all classes; 87.8% average within-structure accuracy). The unique aspect of our classifier is that it encodes sequence agnostic residue environments from the PDB and can assess structural quality as quantitative probabilities. We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design.
Recent computational advances have allowed researchers to predict the structure of many proteins from their amino acid sequences, as well as designing new sequences that fold into predefined structures. However, these tasks are often challenging because they require selection of a small subset of promising structural models from a large pool of stochastically generated ones. Here, we describe a novel approach to protein model selection that uses 2D image classification techniques to evaluate 3D protein models. Our method can be used to select structures based on the fold that they adopt, and can also be used to identify regions of low structural quality. These capabilities yield a powerful tool for both protein design and structure prediction.