ValTrendsDB

Help

What is ValTrendsDB?

ValTrendsDB is a web database of discovered relationships between various structure and ligand factors. Relationships, stored in the database, have been discovered as a result of analysing large amounts of freely available (macro)molecular data from prominent databases. These data encompass all macromolecular structures and associated metadata from the Protein Data Bank database, as well as validation information of structures and their ligands. ValTrendsDB is specialized only on macromolecules which structures have been obtained using X-ray crystallography.

Users can explore relationships via plots of pairs of factors that have been generated and assessed during the analysis. Alternatively, it is also possible to visualize custom plots of factor pair relationships using fully custom settings, as well as explore value distribution of a factor. In the case of further interest, the whole database, as well as all source data, can be downloaded.

The current version of ValTrendsDB is based on data from 17th February 2016. The database is updated once a year, or on request.

Table of contents

Description of ValTrendsDB pages

Explore relationships

On page Explore relationships, you can explore (non)existing relationships between pairs of factors. You can select the first and the second factor from a set of tiles that are located to the left and right on the page below the plot. When you first visit this page, two factors will already be preselected - their tiles, along with the tiles of groups they belong to, will be striped green.

If you are interested in a different version of analysis data than the latest one, you can select your desired version using the version picker that is located below the plot. Note, however, that selecting a different version will reset currently selected pair of visualized factors.

The process of choosing two factors for visualization is very intuitive. To choose a different factor for an axis of the plot, click on its tile (or click on the tile of its group, and then click on the tile of your requested factor that will show itself above). If your newly selected factor/factor group is incompatible with factor or factor group selected for the other axis of the plot (and its tile is thus filled with lighter shade of gray), the selection for the other axis of the plot will be cancelled and compatible factors for you to choose from will be painted solid green.

The plot on the top of page Explore relationships is an information-rich visualisation tool made specifically for ValTrendsDB. The X axis, as well as the primary Y axis, is labeled by factor group which factor is visualized on it, along with minimum and maximum values of this factor. The secondary Y axis shows the number of PDB structures in each interval using a blue bar plot, and is labeled by the total number of PDB structures considered in the plot. Values on the X axis show interval borders as well as box plot-like floating bars that show lower quartile, median and higher quartile values with their lower border, center bar and higher border (the center bar might be missing - in that case, median is equal either to the lower or to the higher quartile). The whole floating bar can be represented by a single line symbol - in that case, all three values it represents are equal to the value of the symbol. Values on the Y axis show average values of the factor visualized on the primary Y axis using red dot above relevant intervals.

Information content of the plot can be simplified using two switch elements below the data version picker. The left switch can hide box plots, while the right switch can hide underlying bar plot that shows number of structures in each interval.

Custom visualization

The functionality of page Custom visualization lets you visualize relationships between pairs of ligands using your custom settings. The plot that you can depict has the same features as the plot on page Explore relationships. The control you have over the plotting process however is much wider. Plots created on this page are also slightly different than the ones that you get on page Explore relationships: Instead of semi-manual classification of data into manually refined intervals, page Custom visualization splits data of the X axis factor automatically using an algorithm.

First, you can choose a factor for the X axis, as well as a factor for the Y axis. The plot will be drawn with default settings that have been determined during the process of statistical validation. If you cannot select a pair of factor groups or factors, it means that such combination was not considered during the analysis.

Then, you can enable the custom plot settings section. Using that, you can:

Statistics

Feature-rich plots that are offered to you on other pages of ValTrendsDB offer insight into relationships between pairs of relevant factors. However, they do not enable enough insight into value distribution of individual factors. This insight is provided on page Statistics, where you can draw plots that show number of occurrences of either each value, or a range of values of a factor (in the case of either a continuous factor, or a discrete factor with too many distinct values).

To draw such plot, simply choose a factor from relevant drop-down menu. X axis of these plots represents factor values (distinct values, or ranges of values), Y axis represents number occurrences and has logarithmic scale to strengthen the expressiveness of this plot type.

You can also download additional results of relationship evaluation, analysis and statistical validation from this page, such as color coded tables of relationship strength of relevant pairs of factors.

Data download

All data that are relevant to this database can be downloaded from page Data download. That includes raw data from both sources, as well as complete dataset of ValTrendsDB with merged entries for all PDB structures. Additional results of statistical processing and validation can be downloaded from this page as well.

Scientific introduction

Large molecules of biological origin (or biomacromolecules in short) have been the focus of research for a long time. This is no surprise since they constitute every living organism. The advancement of methods for structure discovery have facilitated acquisition of new structures that are mostly stored in publicly available databases, the most prominent being the Protein Data Bank (PDB).

In recent years, it has been discovered that some published structures contain serious errors that in several cases led to retraction of relevant papers[1-3]. Scientific community reacted with development of software tools designed to validate structures, both old and new. Tools that have been developed first were made for validation of standard residues of biomacromolecules (their basic building blocks, e.g. amino acids in the case of proteins)[4-7]. Said programs scrutinize chemically relevant properties of wild-type residues (e.g. bond lengths, bond angles, atom clashes, electron density). Another problem to solve was validation of ligands, since they reside in complexes with biomacromolecules, and have been found incorrect couple of times. Some software tools use the same approach to ligand validation as the one they use to validate standard residues[8-10]. Other programs (mainly MotiveValidator[11]) use the validation of annotation approach, where they compare a wild-type ligand with its reference quality specimen.

Validation of structures has reached such maturity that either it is often required to run validation on every structure before submitting it to a database, or the submission system does it automatically. The latter is the case for the PDB database. Validation reports[12] for all structures in this database that have been obtained using X-ray crystallography are available for download. These reports also contain ligand validation information, although ligands are validated similarly to standard residues. More extensive results of ligand validation for all PDB structures are present in the ValidatorDB[13] database, where they were generated by MotiveValidator.

Combination of validation data from the above mentioned two sources with structures and their metadata from PDB database contains significant potential for discovery of scientifically significant trends using methods of statistical analysis. It should be possible to elucidate eventual relationships between structure resolution and quality, between structure age and quality, or between structure quality and ligand quality. We have created ValTrendsDB to present results of such analysis that we have performed. 1852 meaningful pairs of 88 factors have been considered. The set of factors contains structure metadata factors (e.g. year of release, resolution, ligand count, residue count), structure quality factors, and ligand quality factors. Their potential relationships have been assessed using a statistical correlation analysis that returned a Spearman coefficient value for each factor pair. Overall result of the analysis is presented to the users of ValTrendsDB mainly via plots of factor pairs, as well as via correlation tables. Additionally, ValTrendsDB enables users to plot factor pair plots with custom settings. Miscellaneous statistical information about the dataset (e.g. value distribution of each factor) is provided as well.

References

  1. Kleywegt, G.J. (2009) On vital aid: the why, what and how of validation. Acta Crystallogr. D. Biol. Crystallogr., 65, 134–139.
  2. Matthews, B.W. (2007) Five retracted structure reports: inverted or incorrect? Protein Sci., 16, 1013–1016.
  3. Johnston, C.A., Kimple, A.J., Giguère, P.M. and Siderovski, D.P. (2008) Structure of the parathyroid hormone receptor C terminus bound to the G-protein dimer Gbeta1gamma2. Structure, 16, 1086–1094.
  4. Hooft, R.W., Vriend, G., Sander, C. and Abola, E.E. (1996) Errors in protein structures. Nature, 381, 272.
  5. Laskowski, R.A., MacArthur, M.W., Moss, D.S. and Thornton, J.M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr., 26, 283–291.
  6. Chen, V.B., Arendall, W.B., Headd, J.J., Keedy, D.A., Immormino, R.M., Kapral, G.J., Murray, L.W., Richardson, J.S. and Richardson, D.C. (2010) MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D. Biol. Crystallogr., 66, 12–21.
  7. Kleywegt, G.J. and Jones, T.A. (1996) Efficient rebuilding of protein structures. Acta Crystallogr. D. Biol. Crystallogr., 52, 829–832.
  8. Kleywegt, G.J. and Harris, M.R. (2007) ValLigURL: a server for ligand-structure comparison and validation. Acta Crystallogr. D. Biol. Crystallogr., 63, 935–938.
  9. Bruno, I.J., Cole, J.C., Kessler, M., Luo, J., Motherwell, W.D.S., Purkis, L.H., Smith, B.R., Taylor, R., Cooper, R.I., Harris, S.E. et al. Retrieval of crystallographically-derived molecular geometry information. J. Chem. Inf. Comput. Sci., 44, 2133–2144.
  10. Debreczeni, J.É. and Emsley, P. (2012) Handling ligands with Coot. Acta Crystallogr. D. Biol. Crystallogr., 68, 425–430.
  11. Vařeková, R.S., Jaiswal, D., Sehnal, D., Ionescu, C.-M., Geidl, S., Pravda, L., Horský, V., Wimmerová, M. and Koča, J. (2014) MotiveValidator: interactive web-based validation of ligand and residue structure in biomolecular complexes. Nucleic Acids Res., 42, W227–W233.
  12. Read, J.R., Adams, P.D., Arendall, W.B., Brugner, A.T., Emsley, P., Joosten, R.P., Kleywegt, G.J., Krissinel, E.B., Lütteke, T., Otwinowski, Z., Perrakis, A., Richardson, J.S., Heffler, W.H., Smith, J.L., Tickle, I.J., Vriend, G. and Zwart, P.H. (2011) A New Generation of Crystallographic Validation Tools for the Protein Data Bank. Structure, 19, 1935-1412.
  13. Sehnal, D., Vařeková, R.S., Pravda, L., Ionescu, C.-M., Geidl, S., Horský, V., Jaiswal, D., Wimmerová, M. and Koča, J. (2015) ValidatorDB: database of up-to-date validation results for ligands and non-standard residues from the Protein Data Bank. Nucleic Acids Res., 43, D369-D375.

Terminology

Structure
A structure of a biomacromolecule is the three-dimensional arrangement of atoms in such molecules. Structure of a biomacromolecule can be determined by numerous methods, e.g. X-ray crystallography or NMR spectroscopy. Determined structures are stored in databases, the largest being the PDB. Since the structure is what represents real-life biomacromolecules in silico, they are referred to as structures across all ValTrendsDB pages.
It is also worth noting that for the purposes of ValTrendsDB, only structures determined using X-ray crystallography are considered.
Biopolymer
Biopolymer refers to a polymeric molecule of biological origin in general. Biopolymers stored in the PDB database are proteins, nucleic acids and polysaccharides. They consist of standard residues named amino acids, nucleotides and monosaccharides respectively.
For the purpose of ValTrendsDB, biopolymer name refers to parts of a PDB structure that are built from standard residues, and are not ligands of water molecules.
Chain
A chain is a single biopolymer molecule. Larger biomacromolecule complexes can consist of more than one chain. Spatial superposition of chains determine quaternary structure of such complex.
Assembly
Assembly is an arrangement of units that form a biomacromolecular complex together. Some assemblies consist of only one chain and no ligands, while other assemblies are made of hundreds of biopolymers and ligands. Each entry of the PDB database has exactly one preferred assembly.
Residue
Residue, as in standard residue, is a molecule that comprises biopolymers. It is not a ligand.
Ligand
Ligands are molecules that interact with biomacromolecules. Those interactions enable biomacromolecules to fulfill their roles in living organisms. Ligands are rarely covalently bound to biomacromolecules - non-covalent interactions are much more common. From the point of view of the PDB database, everything that comprises a PDB structure and is not a standard residue is a ligand.
In ValTrendsDB, water molecules are considered as being ligand by most factors that deal with ligands and ligand data with the exception of factors based on MotiveValidator output and factors where water molecules are explicitly omitted (e.g. average ligands size in structure, disregarding water ligands factor).
Ligand validated
by MotiveValidator
ValidatorDB contains validation results, computed by MotiveValidator, for only a subset of ligands stored in structures in the PDB. Specifically, in considers only ligands that are nontrivial (i.e. ligands that contain at least 7 atoms of element other than hydrogen). Derivatives of standard residues are omitted as well[link].
Factor
For the purpose of ValTrendsDB, factor is a metric of a certain property of a PDB structure. Quantified properties range from various interpretations of simple atom count to complex quality metrics of biopolymers and their ligands.
Factor group
Factor groups have been established to group together similar factors for user convenience, since e.g. on page Explore relationships it is easier to pick from fewer factor groups first than to pick from all the factors offered at the same time.
Z-score
Z-score is defined as the difference between an observed value and either expected or average value, divided by the standard deviations of either the expected or the average value[link]. Z-scores are used in PDB validation reports.
Relationship
In the context of ValTrendsDB, if there is a relationship between a pair of factors, then there is a degree of linear correlation between their values. This correlation has been statistically quantified during the statistical analysis step.

Factor overview

88 factors in total have been considered. Among them are structure metadata factors along with structure size factors, structure quality factors, and ligand quality factors. Because of their total count, factors have been split into groups for user convenience.

Structure metadata factors

Atom count factor group contains factors that sum all atoms of parts of a PDB structure (structure itself, non-water ligands of a structure, water molecules of a structure). Parts of PDB structures considered are different for nearly every factor in this group (see the name of each factor).

Average ligand size factor group contains factors which values have been enumerated by a ratio of total number of ligand atoms in a PDB structure to the total number of ligands in a PDB structure. Set of considered ligands is different for each factor (see its name), but both parts of the fraction draw from the same set of ligands.

Chiral carbon count in ligands factor represents sum of chiral carbon atoms across all ligands in a PDB structure.

Ligand count factor group contains factors which values are sums of ligands in a PDB structure. Set of considered ligands is different for each factor (see its name).

Molecular weight factor group contains factors which values are sums of weight of particular parts of a PDB structure. The unit used here is one kilodalton.

Preferred structure assembly factor group contain factors that deal with metadata of preferred structure assembly of each PDB structure. Explanation of structure assembly can be found here. Three types of factors can be found in this group. Weight factors enumerate total molecular weight of selected parts of the preferred structure assemblies. Weight unit used here is kilodalton [kDa]. Molecule count factors enumerate total number of molecules of selected types that comprise the preferred structure assembly. Flexibility ratio factor shows how flexible are ligands that comprise the preferred structure assembly. It is enumerated as a ratio of rotatable bonds to all bonds of all ligands in the preferred structure assembly. The higher it is, the more flexible are the ligands.

Ratio of single bonds in ligands factor value is a fraction of sum of all sigma bonds of all ligands in a PDB structure to the sum of all bonds in all ligands in said PDB structure.

Residue count factor group contains factors which values represent the total number of standard residues in a PDB structure. Some factors in this group add number of relevant ligands in a PDB structure to its value (what ligands are relevant for each factor is clear from its name).

Structure resolution factor represents highest resolution of a PDB structure in Ångströms. Formally, it is the smallest value of the interplanar spacings for the reflection data to be used in the refinements[link].

Year of release factor represents the year when a PDB structure was published in the PDB database.

Structure quality factors

Average RSR of residues in structure factor represents the average deviation size of standard residue structure in real space from its atomic model[link].

Clashscore factor represents the amount of atom clashes (i.e. pairs of atoms that are unusually close to each other) in structure. Formally, it is expressed as number of clashes per thousand atoms of a PDB structure[link]. Ligand atoms are considered as well as atoms of standard residues. Two variants of the clashscore factor have been considered.

Ramachandran outliers factor represents percentage of standard residues in a PDB structure that are identified as Ramachandran outliers. A standard residue is identified as a Ramachandran outlier if the combination of backbone φ-ψ torsion angle values is unusual[link]. Two variants of the Ramachandran outliers factor have been considered.

Rfree factor is a refinement statistic of a PDB structure model. It measures similarity between observed structure factor amplitudes and those calculated from the model while using reflections that were not using during model refinement. Lower value is usually better[link]. Too low value may point to overfitting of the model though.

RMSZ factor group contains factors that quantify deviation of bond angles and bond lengths in standard residues of a PDB structure. It is calculated for individual standard residues, then averaged for each chain, and then - in the case of two factors from this factor group - averaged over the whole structure. Scores of factors from this factor group are expected to lie between 0 and 1[link].

RSCC factor group contains factors that determine how well the calculated electron density map matches the electron density map that has been computed from experimental data. They are alternatives to the RSR factor family. Standard residue qualifies as an outlier if its RSCC value is below 0.8[link]. Two variants of RSCC factors have been considered.

RSRZ factor group contains factors that quantify percentage of standard residues in a PDB structure that qualify as real-space R-value outliers (RSR). RSR is a measure of fit quality between the atomic model of a standard residue and its data in real space. Standard residue qualifies as an outlier if its RSR value is above 2[link]. Two variants of the RSRZ outlier percentage factor have been considered.

Rvalue factor is a refinement statistic of a PDB structure model. It measures similarity between observed structure factor amplitudes and those calculated from the model. Lower value is usually better[link]. Too low value may point to overfitting of the model though.

Sidechain outliers factor represents percentage of standard residues in a PDB structure with sidechains which torsion angle combination is considered to be an outlier, i.e. is not a preferred combination[link]. Two variants of the Sidechain outliers factor have been considered.

Ligand quality factors

Average RSR of ligands in structure factor represents the average deviation size of ligand molecule structure in real space from its atomic model[link].

Chiral quality of ligands in structure factor group contains factors that quantify the relative amount of chiral carbon atoms with incorrect configuration.

Combined quality of ligands in structure factor group contains factors that quantify both topological and chiral problems of ligands in a PDB structure (see links for details).

LLDF factor group quantifies the Local Ligand Density Values (LLDF) of ligands in a PDB structure. LLDF is a Z-score computed as a statistical comparison of RSR of a ligand to the RSR values of neighboring standard residues that are present within 5 Å of the ligand in question. If there are no standard residues within 5 Å of a ligand, LLDF cannot be computed for such ligand. A ligand is considered to be a negative quality outlier if its LLDF value is greater than 2[link]. Two variants (plus two more) of LLDF factors have been considered.

RMSZ factor group contains factors that quantify deviation of bond angles and bond lengths in ligands of a PDB structure. It is calculated for individual ligands, then averaged over the whole molecule. Scores of factors from this factor group are expected to lie between 0 and 1[link]

RSCC factor group contains factors that determine how well the calculated electron density map matches the electron density map that has been computed from experimental data. They are alternatives to the RSR factor family. Ligand qualifies as an outlier if its RSCC value is below 0.8[link]. Two variants (plus two more) of RSCC factors have been considered.

Topological quality of ligands in structure factor group contains factors that quantify the relative amount of atoms that are either missing or redundant in ligands of a PDB structure. Atoms that are not present in a ligand for chemically valid reasons (e.g. atoms that were lost when a covalent bond was formed) are not quantified by factors from this factor group.

Data acquisition, processing and analysis

Data acquisition and processing

Data for the analysis have been obtained from four sources:

The data from all four sources have then been parsed and merged together using the 4-character PDB ID as the primary key. The parsing task has been handled by our above mentioned custom tool. After parsing, several clean-up steps have been executed so that the data were relevant and useful for the purposes of this analysis:

Complete raw data, as well as the parsed data, are available for download on the Download page.

Data analysis

The first step in analysing the data at hand was to estimate the existence, type and strength of relationships between pairs of various factors. Such pairs were chosen based on their scientific significance and interest of this group of authors. During the estimation process, the set of prospective factors has expanded several times after successful estimation of several interesting strong relationship between pairs of factors.

The next step was to validate the existence of estimated relationships using methods of statistical analysis, namely using correlation analysis. It has been carried out in four consecutive steps:

  • Interval determination

    63 factors out of the total 88 have had their values split into non equidistant intervals. These factors will appear on the X axis of the final plots that you can view on pages Explore relationships and Custom visualization.

    • The intervals created here are not equidistant, because the data distribution of the vast majority of factors is extremely skewed (as can be seen in page Statistics). If the classic equidistant intervals were used (as was the case during the estimation process), the number of structures classified in the intervals would differ by as many as 5 orders (10s of structures in one interval, 10000s of structures in other intervals). Such structure count discrepancy would seriously impair the soundness of any claims made on the basis of data distributed in such way.
    • Each interval contains at least 97 structures. Most factors have been divided into 88 - 100 intervals.

  • Box plot plotting

    Box plots have been plotted for each interval of each factor that will appear for the X axis. This type of plot is useful for gaining insight into value distribution of every single interval. Using box plots, it has been determined that value variability in edge intervals (the first and the last one or two intervals at the beginning and at the end of every plot) of some factors is too high and thus their influence is too out of proportions in regards to the count of values they contain relative to the size of the dataset. Such problematic edge intervals have been, along with structures sorted into them, omitted from the statistical validation step of the analysis.

  • Histogram plotting

    Histograms have been plotted for factors that were to appear on the Y axis of the final plots. They were used to assess whether values of those factors were reasonably lower bound as well as upper bound. All factors fulfilled this criterion.

  • Correlation analysis

    Finally, the correlation analysis has been carried out. Weighted averages of values of both paired factors have been computed for every interval. Weighted averages have been selected in the stead of arithmetic averages to eliminate further possible data skew. An explanation of the weighted average (formally weighted arithmetic mean) can be found here. Each interval has been split into 10 equidistant subintervals. A weight equal to the number of structures in each subinterval is used for each sorted value. Then, Spearman correlation coefficient has been computed for every factor pair. Its value was then interpreted to determine presence and strength of a relationship using established guidelines, while its sign was interpreted to determine whether it is a direct, or indirect linear relationship. The semantic of Spearman coefficient value intervals has been established in this way:

    • <0.7; 1.0): strong linear relationship
    • <0.3; 0.7): weak linear relationship
    • (0.0; 0.3): no linear relationship