Accurate and Highly Accurate datasets of protein structure models

 

 

General Description of the dataset

This database contains a total of 55 highly accurate and 57 accurate protein structure models. Both sets were obtained after a careful classification and selection of models from a large database composed of 3.375 “good models”. These “good models” were calculated by large-scale comparative modelling of the protein chains representative of the Protein Data Bank (PDB, Berman et al., 2002) and have been used in previous studies (Sánchez and Sali, 1998; Melo et al., 2002; Melo and Marti-Renom, 2006). Accurate and highly accurate sets of models were derived to assess the performance of knowledge-based potentials in the detection of structural errors of small magnitude (see Figure 1). The first set (class A) contains 55 highly-accurate protein models. The second set (class B) contains 57 accurate protein models. The highly-accurate protein models from the class A set have more than 95% equivalent a-carbons with their corresponding native structures and a total root mean square deviation (RMSD) of less than 1.1 Å for all a-carbons. The accurate protein models from the class B set have more than 90% equivalent a-carbons with their corresponding native structures and a total RMSD of more than 1.5 Å for all a-carbons.

 

Building the sets

The 57 accurate and 55 highly-accurate protein models were selected from a existing set of 3,375 models with a correct fold that has been described previously (Sanchez and Sali 1998; Melo et al. 2002). This original set of 3,375 models was built by the comparative modelling of representative chains of the Protein Data Bank (PDB) (Berman et al. 2002). The models were built based on the correct templates and mostly correct alignments between the target sequences and the template structures. The models were obtained by applying MODPIPE to 1.085 representative chains of the PDB (Sanchez and Sali 1998). These representative sequences corresponded to the protein chains in PDB that shared <30% sequence identity or were >30 residues different in size. The templates for comparative modelling, selected by MODPIPE, were 1,637 PDB chains with <80% sequence identity to each other or more than 30 residue difference in length. Each target sequence was aligned separately with each one of the 1,637 known structures using the program ALIGN that implements local sequence alignment by dynamic programming (Altschul 1998). Only the target-template alignments with a significance score higher than 22 nats (corresponding approximately to the PSI-BLAST e-value of 10−4) were used, resulting in 3,993 models. Models with <30% structural overlap with the actual experimentally determined structure were eliminated. Structural overlap was defined as the fraction of the equivalent Cα atoms upon least squares superposition of the two structures with the 3.5 Å cut-off. This procedure also removed models based on correct templates that had a poor alignment and models based on templates that had large domain or rigid body movements with respect to the target structure. The final set contained 3,375 correct models (Melo et al. 2002). These models are available here.

The set of 3,375 correct models was initially filtered by updating the target and template structures currently available at the PDB and checking the sequence alignments originally used to build the models. A total of 132 models presented inconsistencies between the target sequence in the original alignment used to build the model and the current target sequence available at the PDB. These 132 models were removed, thus resulting in a total of 3,243 entries (see Figure 1). Then, a second filter was applied and we selected only those protein models of a length larger than 100 residues for which more than 90% of their residues was possible to model. Finally, as explained above, two independent filters were applied to select those models belonging to the class A and class B sets (see Figure 1). All models in both sets were built for target monomeric proteins. PDB files of each model superposed with its corresponding native structure can be downloaded from the links available in the table below.

 

Identification and definition of those residues incorrectly modeled

To identify those residues that contained structural errors, all accurate and highly-accurate protein models were optimally superposed to their corresponding native target structures using MODELLER software release 7v7 (Sali et al. 2001). Those residues that had an RMSD larger than 1.8 Å for all main chain atoms and a total sidechain RMSD larger than 3.5 Å were defined as wrongly modeled. Otherwise, the residues were defined as correctly modeled. This binary classification of structural quality for each residue is called ‘real classification’. Upon this classification scheme, 201 residues (1.95%) are defined as wrongly modeled in the 55 models belonging to the class A set, which contains a total of 10,295 residues. In the 57 models belonging to the class B set, 1,257 residues (11.73%) are defined as wrongly modeled, from a total of 10,714 residues. Though arbitrary, the definition of residues wrongly modeled upon these RMSD cutoffs was based on the visual inspection of all protein models after optimal superposition with their corresponding native structures. This definition of error clearly correlates with the observed structural deviation. All protein models, along with their corresponding error definitions and computer scripts for RASMOL software (Sayle and Milner-White 1995; Bernstein 2000) to visualize them graphically in three-dimensions can be downloaded from the table below.

 

 

 

 

 

 

Figure 1: Flow chart of the procedure used to build the sets of comparative protein models. The source protein model set was originally constituted of 3.375 models with the correct fold, which were built in a previous study using Modpipe (Sanchez and Sali 1998). After applying several filtering criteria two subsets of models were produced.

 

 

Raw data

 

File name

Type

File format

Number of elements

General description

Models class A

classA.list

list

plain text

55 model names

A list of class A models

classA.mdl.tar.gz

coordinates

PDB format (plain text)

55 files

The PDB files of the models (“mdl”) class A.

classA.spd.tar.gz

coordinates

PDB format (plain text)

55 files

The PDB files of each model class A superposed (“spd”) with their corresponding native structure.

errors_A.tar.gz

data file

plain text

55 files

Four column file. 1) residue number, 2) backbone cRMSD, 3) sidechain cRMSD, 4) incorrectly modelled residue (1) or correctly modelled residue (0).

scripts_A.tar.gz

RasMol script

plain text

55 files

RasMol scripts to visualize those residues wrongly modelled.

Models class B

classB.list

list

plain text

57 model names

A list of class B models

classB.mdl.tar.gz

coordinates

PDB format (plain text)

57 files

The PDB files of all models (“mdl”) class B.

classB.spd.tar.gz

coordinates

PDB format (plain text)

57 files

The PDB files of each model class B superposed (“spd”) with their corresponding native structure.

errors_B.tar.gz

data file

plain text

57 files

Four column file. 1) residue number, 2) backbone cRMSD, 3) sidechain cRMSD, 4) incorrectly modelled residue (1) or correctly modelled residue (0).

scripts_B.tar.gz

RasMol script

plain text

57 files

RasMol scripts to visualize those residues wrongly modelled.

 

 

 

References

 

 

Altschul, S. (1998)
Generalized affine gap costs for protein sequence alignment.
Proteins 32,88-96.

Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D., and Zardecki, C. (2002)
The Protein Data Bank.
Acta Crystallogr D Biol Crystallogr. 58, 899-907.

Melo, F. and Marti-Renom, M. (2006)
Accuracy of sequence alignment and fold assessment using reduced amino acid alphabets.
Proteins 63(4), 986-995.

Melo, F., Sánchez, R., and Sali, A. (2002)
Statistical potentials for fold assessment.
Protein Science 11, 430-448.

Sali, A., Fiser, A., Sanchez, R., Marti-Renom, M.A., Jerkovic, B., Badretdinov, A., Melo, F., Overington, J., and Feyfant, E. (2001).

MODELLER, A Protein Structure Modeling Program, Release 6v0.

http://salilab.org/modeller/.

Sánchez R. and Sali, A. (1998)
Large-scale protein structure modeling of the Saccharomyces cerevisiae genome.
Proc. Natl. Acad. Sci. USA 95,13597-13602.

Sayle, R., and Milner-White, E.J. (1995).

RasMol: Biomolecular graphics for all.

TIBS 20: 374.