Near Native dataset of comparative protein structure models



General Description of the dataset

This database contains a total of 152 near native protein structure models generated by comparative modeling. Both sets were obtained after a careful classification and selection of models from a large database composed of 3.375 “good models”. These “good models” were calculated by large-scale comparative modelling of the protein chains representative of the Protein Data Bank (PDB, Berman et al., 2002) and have been used in previous studies (Sánchez and Sali, 1998; Melo et al., 2002; Melo and Marti-Renom, 2006). These models were derived to assess the performance of knowledge-based potentials in the discrimination between native and near native protein structures (see Figure 1). These models are equal or larger than 100 amino acids, have at least 90% equivalent α-carbons with their corresponding native structures, a target chain coverage equal or larger than 90% and a total or global root mean square deviation (RMSD) of less than 3.0 Å for all α-carbons. All models were built for target monomeric proteins. The models in this set represent 78 distinct folds. Therefore, even though not all the models were built for proteins representing different folds, they are not strongly biased to any particular fold since a large fraction of them has a distinct fold. In terms of a more general classification, as the one defined by the composition and arrangement of secondary structure elements, a fairly good and poorly biased representation is achieved in this set. A 26% of the models contain only alpha helix as secondary structure elements, 29% contains only beta sheets, 17% contains alpha and beta, and 28% of the models contains alpha plus beta secondary structure elements on their structure.

Building the sets

This set of near-native protein models was selected from a existing set of 3,375 models with a correct fold that has been described previously (Sanchez and Sali 1998; Melo et al. 2002). This original set of 3,375 models was built by the comparative modeling of representative chains of the Protein Data Bank (PDB) (Berman et al. 2002). The models were built based on the correct templates and mostly correct alignments between the target sequences and the template structures. The models were obtained by applying MODPIPE to 1.085 representative chains of the PDB (Sanchez and Sali 1998). These representative sequences corresponded to the protein chains in PDB that shared <30% sequence identity or were >30 residues different in size. The templates for comparative modeling, selected by MODPIPE, were 1,637 PDB chains with <80% sequence identity to each other or more than 30 residue difference in length. Each target sequence was aligned separately with each one of the 1,637 known structures using the program ALIGN that implements local sequence alignment by dynamic programming (Altschul 1998). Only the target-template alignments with a significance score higher than 22 nats (corresponding approximately to the PSI-BLAST e-value of 10−4) were used, resulting in 3,993 models. Models with <30% structural overlap with the actual experimentally determined structure were eliminated. Structural overlap was defined as the fraction of the equivalent atoms upon least squares superposition of the two structures with the 3.5 Å cutoff. This procedure also removed models based on correct templates that had a poor alignment and models based on templates that had large domain or rigid body movements with respect to the target structure. The final set contained 3,375 correct models (Melo et al. 2002).

The set of 3,375 correct models was initially filtered by updating the target and template structures currently available at the PDB and checking the sequence alignments originally used to build the models. A total of 132 models presented inconsistencies between the target sequence in the original alignment used to build the model and the current target sequence available at the PDB. These 132 models were removed, thus resulting in a total of 3,243 entries (Fig. 1). Then, a second filter was applied and only those protein models of a length larger than 100 residues, with at least 90% equivalent α-carbons with their corresponding native structures, a target chain coverage equal or larger than 90% and a total or global root mean square deviation (RMSD) of less than 3.0 Å for all α-carbons were selected. The final set contains a total of 152 models, which are defined as near-native models. All protein models, along with their corresponding three-dimensional superpositions can be downloaded from the table below.


Figure 1: Flow chart of the procedure used to build the set of near-native protein models. (A) Flowchart of the procedure used to generate the set of 152 near-native models. The source protein model set was originally constituted of 3.375 models with the correct fold, which were built in a previous study using MODPIPE (the original set is available here). After applying several filtering criteria a subset of near-native protein models was produced. For details see Methods (B) Distributions of some model features in the final set of 152 near-native models.


File name


File format

Number of elements

General description



plain text

518 pdb file names

A list of the experimental protein structures for calculating the statistical potentials.



plain text

152 model names

A list of models.



PDB format (plain text)

152 files

PDB files of model structures.



PDB format (plain text)

152 files

PDB files of target structures.



PDB format (plain text)

152 files

PDB files of each model superposed with its corresponding native structure.





Altschul, S. (1998)
Generalized affine gap costs for protein sequence alignment.
Proteins 32,88-96.

Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D., and Zardecki, C. (2002)
The Protein Data Bank.
Acta Crystallogr D Biol Crystallogr. 58, 899-907.

Melo, F. and Marti-Renom, M. (2006)
Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets.
Proteins 63(4), 986-995.

Melo, F., Sánchez, R., and Sali, A. (2002)
Statistical potentials for fold assessment.
Protein Science 11, 430-448.

Sali, A., Fiser, A., Sanchez, R., Marti-Renom, M.A., Jerkovic, B., Badretdinov, A., Melo, F., Overington, J., and Feyfant, E. (2001).

MODELLER, A Protein Structure Modeling Program, Release 6v0.

Sánchez R. and Sali, A. (1998)
Large-scale protein structure modeling of the Saccharomyces cerevisiae genome.
Proc. Natl. Acad. Sci. USA 95,13597-13602.