Near Native
dataset of comparative protein structure models
General
Description of the dataset
This database contains a total of 152 near native protein structure
models generated by comparative modeling. Both sets were obtained after a
careful classification and selection of models from a large database composed
of 3.375 “good models”. These “good models” were calculated by large-scale
comparative modelling of the protein chains representative of the Protein Data
Bank (PDB, Berman et al., 2002) and have been used in previous studies (Sánchez and Sali, 1998; Melo et
al., 2002; Melo and Marti-Renom, 2006). These models
were derived to assess the performance of knowledge-based potentials in the discrimination
between native and near native protein structures (see Figure 1). These models
are equal or larger than 100 amino acids, have at least 90% equivalent α-carbons
with their corresponding native structures, a target chain coverage equal or
larger than 90% and a total or global root mean square deviation (RMSD) of less
than 3.0 Å for all α-carbons. All models were built for target monomeric proteins. The models in this set represent
78 distinct folds. Therefore, even though not all the models were built for
proteins representing different folds, they are not strongly biased to any
particular fold since a large fraction of them has a distinct fold. In terms of
a more general classification, as the one defined by the composition and
arrangement of secondary structure elements, a fairly good and poorly biased
representation is achieved in this set. A 26% of the models contain only alpha
helix as secondary structure elements, 29% contains only beta sheets, 17%
contains alpha and beta, and 28% of the models contains alpha plus beta
secondary structure elements on their structure.
Building
the sets
This set of
near-native protein models was selected from a
existing set of 3,375 models with a correct fold that has been described
previously (Sanchez and Sali 1998; Melo et al. 2002).
This original set of 3,375 models was built by the comparative modeling of
representative chains of the Protein Data Bank (PDB) (Berman et al. 2002). The
models were built based on the correct templates and mostly correct alignments
between the target sequences and the template structures. The models were
obtained by applying MODPIPE to 1.085 representative chains of the PDB (Sanchez
and Sali 1998). These representative sequences
corresponded to the protein chains in PDB that shared <30% sequence identity
or were >30 residues different in size. The templates for comparative
modeling, selected by MODPIPE, were 1,637 PDB chains with <80% sequence
identity to each other or more than 30 residue difference in length. Each
target sequence was aligned separately with each one of the 1,637 known
structures using the program ALIGN that implements local sequence alignment by
dynamic programming (Altschul 1998). Only the target-template alignments with a significance
score higher than 22 nats (corresponding
approximately to the PSI-BLAST e-value of 10−4) were used, resulting in
3,993 models. Models with <30% structural overlap with the actual
experimentally determined structure were eliminated. Structural overlap was
defined as the fraction of the equivalent Cα
atoms upon least squares superposition of the two structures with the 3.5 Å cutoff. This procedure also removed models based on correct
templates that had a poor alignment and models based on templates that had
large domain or rigid body movements with respect to the target structure. The
final set contained 3,375 correct models (Melo et al. 2002).
The set of 3,375
correct models was initially filtered by updating the target and template
structures currently available at the PDB and checking the sequence alignments
originally used to build the models. A total of 132 models presented
inconsistencies between the target sequence in the original alignment used to
build the model and the current target sequence available at the PDB. These 132
models were removed, thus resulting in a total of 3,243 entries (Fig. 1). Then,
a second filter was applied and only those protein models of a length larger
than 100 residues, with at least 90% equivalent α-carbons with their
corresponding native structures, a target chain coverage equal or larger than
90% and a total or global root mean square deviation (RMSD) of less than 3.0 Å
for all α-carbons were selected. The final set contains a total of 152
models, which are defined as near-native models. All protein models, along with their
corresponding three-dimensional superpositions can be
downloaded from the table below.
Figure 1: Flow chart of the procedure used to
build the set of near-native protein models. (A) Flowchart of the procedure used to generate the set of 152
near-native models. The source protein model set was originally constituted of
3.375 models with the correct fold, which were built in a previous study using
MODPIPE (the original set is available here).
After applying several filtering criteria a subset of near-native protein
models was produced. For details see Methods (B) Distributions of some model
features in the final set of 152 near-native models.
File name |
Type |
File format |
Number of elements |
General description |
list |
plain text |
518 pdb file names |
A list of the
experimental protein structures for calculating the statistical potentials. |
|
list |
plain text |
152 model names |
A list of models. |
|
coordinates |
PDB format (plain
text) |
152 files |
PDB files of model structures. |
|
coordinates |
PDB format (plain
text) |
152 files |
PDB files of target structures. |
|
coordinates |
PDB format (plain
text) |
152 files |
PDB files of each model superposed with its corresponding native
structure. |
References
Altschul, S. (1998) |
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran,
V., Schneider, B., Thanki,
N., Weissig, H., Westbrook,
J.D., and Zardecki, C. (2002) |
Melo,
F. and Marti-Renom, M. (2006) |
Melo,
F., Sánchez, R., and Sali,
A. (2002) |
Sali,
A., Fiser, A., Sanchez, R., Marti-Renom, M.A., Jerkovic, B., Badretdinov, A., Melo, F., Overington,
J., and Feyfant, E. (2001). MODELLER,
A Protein Structure Modeling Program, Release 6v0. |
Sánchez R. and Sali, A. (1998) |