Difference between revisions of "PatternQuery:Principles"

From WebChem Wiki
Jump to: navigation, search
(A new paradigm for describing structural fragments in molecules)
Line 1: Line 1:
 
'''A new paradigm for describing structural fragments in molecules'''
 
'''A new paradigm for describing structural fragments in molecules'''
  
This text describes the basic principles of the MotiveQuery language for describing structural fragments in molecules. First, the text puts two approaches for identifying molecular fragments in contrast: the imperative approach, and the declarative one, taken by the MotiveQuery language. Next, basic principles of the language are illustrated on several examples and later described in a more formal way. Finally, the original example is revisited and explained in the terms of the newly introduced concepts.
+
This text describes the basic principles of the '''MotiveQuery''' language for describing structural fragments in molecules. First, the text puts two approaches for identifying molecular fragments in contrast: the imperative approach, and the declarative one, taken by the '''MotiveQuery''' language. Next, basic principles of the language are illustrated on several examples and later described in a more formal way. Finally, the original example is revisited and explained in the terms of the newly introduced concepts.
  
 
==Example, Part One==
 
==Example, Part One==
Line 25: Line 25:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Now, let’s extend our example to "all HET residues and atoms within around them".
+
Now, let’s extend our example to "all HET residues and atoms within 4A around them".
  
 
In the imperative approach we would need to do something along the following lines:
 
In the imperative approach we would need to do something along the following lines:
Line 54: Line 54:
 
Internally, the function <code>[[MotiveQuery:Language_Reference#AmbientAtoms | AmbientAtoms()]]</code> might run code similar to the imperative version. However, what is important is that this complexity is hidden from the user when the declarative approach is used.
 
Internally, the function <code>[[MotiveQuery:Language_Reference#AmbientAtoms | AmbientAtoms()]]</code> might run code similar to the imperative version. However, what is important is that this complexity is hidden from the user when the declarative approach is used.
  
Now, let’s extend our example even further: "all HET residues and atoms within around them, where the entire structure contains at least one calcium atom".
+
Now, let’s extend our example even further: "all HET residues and atoms within 4A around them, where the entire structure contains at least one calcium atom".
  
 
We will not bother the reader with writing down the imperative version - implementing the condition "at least one calcium atom" is rather boring. However, using the declarative approach, the description of the fragment becomes just:
 
We will not bother the reader with writing down the imperative version - implementing the condition "at least one calcium atom" is rather boring. However, using the declarative approach, the description of the fragment becomes just:
Line 68: Line 68:
 
As we've seen in the example above, it is very easy to ''compose'' our ideas about the final shape of the fragment we are interested in. The way this works is that the input molecule is decomposed into a stream of fragments. These streams can then be modified and combined into new streams, which can be modified and combined again.
 
As we've seen in the example above, it is very easy to ''compose'' our ideas about the final shape of the fragment we are interested in. The way this works is that the input molecule is decomposed into a stream of fragments. These streams can then be modified and combined into new streams, which can be modified and combined again.
  
As an example, take the query <code>Atoms('Ca')</code>. What the MotiveQuery language does is to extract all calcium atoms from the input molecule and represent them as a stream of sets containing one atom each as illustrated on the image below:
+
As an example, take the query <code>Atoms('Ca')</code>. What the '''MotiveQuery''' language does is to extract all calcium atoms from the input molecule and represent them as a stream of sets containing one atom each as illustrated on the image below:
  
 
[[Image:MotiveQuery-Principles Atoms(Ca).png|center|600px]]
 
[[Image:MotiveQuery-Principles Atoms(Ca).png|center|600px]]
  
Now, each element of this stream can be modified, for example to include all atoms within the original calcium atom. Now we have a stream of sets of atoms, where each set contains the original Ca atom and the atoms within the given radius. This would be represented by the query <code>Atoms('Ca').AmbientAtoms(4)</code> and is illustrated on the image bellow:
+
Now, each element of this stream can be modified, for example to include all atoms 4A within the original calcium atom. Now we have a stream of sets of atoms, where each set contains the original Ca atom and the atoms within the given radius. This would be represented by the query <code>Atoms('Ca').AmbientAtoms(4)</code> and is illustrated on the image bellow:
  
 
[[Image:MotiveQuery-Principles Atoms(Ca) surr.png|center|600px]]
 
[[Image:MotiveQuery-Principles Atoms(Ca) surr.png|center|600px]]
Line 82: Line 82:
 
The previous filter query also demonstrates another interesting concept of the language: ability to identify fragments within fragments, which is what the expression <code>m.Count(Atoms())</code> does - the <code>[[MotiveQuery:Language_Reference#Atoms | Atoms()]]</code> query is executed for each fragment from the original input sequence provided by the expression <code>Atoms('Ca').AmbientAtoms(4)</code>, and creates a new sequence of fragments that each contain a single atom. Then the Count function takes over and returns the number of fragments produced by its argument. In this way, the query <code>[[MotiveQuery:Language_Reference#Atoms | Atoms()]]</code> inside the Count function can be replaced by any function that also produces a sequence of fragments, for example <code>[[MotiveQuery:Language_Reference#Rings | Rings()]]</code>.
 
The previous filter query also demonstrates another interesting concept of the language: ability to identify fragments within fragments, which is what the expression <code>m.Count(Atoms())</code> does - the <code>[[MotiveQuery:Language_Reference#Atoms | Atoms()]]</code> query is executed for each fragment from the original input sequence provided by the expression <code>Atoms('Ca').AmbientAtoms(4)</code>, and creates a new sequence of fragments that each contain a single atom. Then the Count function takes over and returns the number of fragments produced by its argument. In this way, the query <code>[[MotiveQuery:Language_Reference#Atoms | Atoms()]]</code> inside the Count function can be replaced by any function that also produces a sequence of fragments, for example <code>[[MotiveQuery:Language_Reference#Rings | Rings()]]</code>.
  
Finally, streams of fragments can be combined. For example, let’s say we want to find all pairs of calcium atoms that are no further than within each other. This can be achieved using the query <code>Near(4, Atoms('Ca'), Atoms('Ca'))</code>. So this query takes as the input two identical streams of calcium atoms and for each pair of them determines if the atoms are closer than to each other. For each pair that satisfies this condition, a new fragment from the 2 atoms is created. Therefore, the result of the above <code>[[MotiveQuery:Language_Reference#Near | Near()]]</code> query is a stream of sets of atoms (fragments) that each contain two calcium atoms that are no further than from each other:
+
Finally, streams of fragments can be combined. For example, let’s say we want to find all pairs of calcium atoms that are no further than 4A within each other. This can be achieved using the query <code>Near(4, Atoms('Ca'), Atoms('Ca'))</code>. So this query takes as the input two identical streams of calcium atoms and for each pair of them determines if the atoms are closer than 4A to each other. For each pair that satisfies this condition, a new fragment from the 2 atoms is created. Therefore, the result of the above <code>[[MotiveQuery:Language_Reference#Near | Near()]]</code> query is a stream of sets of atoms (fragments) that each contain two calcium atoms that are no further than 4A from each other:
  
 
[[Image:Near.png|center|600px]]
 
[[Image:Near.png|center|600px]]

Revision as of 12:15, 16 December 2014

A new paradigm for describing structural fragments in molecules

This text describes the basic principles of the MotiveQuery language for describing structural fragments in molecules. First, the text puts two approaches for identifying molecular fragments in contrast: the imperative approach, and the declarative one, taken by the MotiveQuery language. Next, basic principles of the language are illustrated on several examples and later described in a more formal way. Finally, the original example is revisited and explained in the terms of the newly introduced concepts.

Example, Part One

Goal: Find all HET residues in a protein.

Let’s assume we have loaded a protein stored in a PDB or mmCIF file with correctly annotated HET groups.

The characteristics of an imperative approach is explicitly stating steps that need to be performed in order achieve a particular goal. In contrast a declarative approach states the goal we would like to achieve, leaving the individual steps as an "implementation detail".

Using the imperative approach, we would do something like this:

result = new List()
for residue in molecule.Residues:
  if residue.IsHet():
    result.Add(residue)

In the declarative approach, our code would look like this:

HetResidues()

Now, let’s extend our example to "all HET residues and atoms within 4A around them".

In the imperative approach we would need to do something along the following lines:

temp = new List()

for residue in molecule.Residues():
  if residue.IsHet():
    temp.Add(residue)

neighborhoodLookup = new NeighborhoodLookup(molecule.Atoms())
result = new List()

for residue in temp:
  surroundings = neiborhoodLookup.Find(residue.Atoms, 4.0)
  result.Add(union(residue, surroundings))

return result

Declaratively, our code would be just:

HetResidues().AmbientAtoms(4.0)

Internally, the function AmbientAtoms() might run code similar to the imperative version. However, what is important is that this complexity is hidden from the user when the declarative approach is used.

Now, let’s extend our example even further: "all HET residues and atoms within 4A around them, where the entire structure contains at least one calcium atom".

We will not bother the reader with writing down the imperative version - implementing the condition "at least one calcium atom" is rather boring. However, using the declarative approach, the description of the fragment becomes just:

HetResidues()
  .AmbientAtoms(4.0)
  .Filter(lambda m: m.Count(Atoms('Ca')) >= 1)

Basic Principles of the Language

As we've seen in the example above, it is very easy to compose our ideas about the final shape of the fragment we are interested in. The way this works is that the input molecule is decomposed into a stream of fragments. These streams can then be modified and combined into new streams, which can be modified and combined again.

As an example, take the query Atoms('Ca'). What the MotiveQuery language does is to extract all calcium atoms from the input molecule and represent them as a stream of sets containing one atom each as illustrated on the image below:

PatternQuery-Principles Atoms(Ca).png

Now, each element of this stream can be modified, for example to include all atoms 4A within the original calcium atom. Now we have a stream of sets of atoms, where each set contains the original Ca atom and the atoms within the given radius. This would be represented by the query Atoms('Ca').AmbientAtoms(4) and is illustrated on the image bellow:

PatternQuery-Principles Atoms(Ca) surr.png

In the next step, we might wish to keep only these fragments that contain at least 6 atoms. This is achieved by looking at each fragment, counting the number of atoms and throwing away these fragments that do not meet the criteria. Written as a query, this could be represented as Atoms('Ca').AmbientAtoms(4).Filter(lambda m: m.Count(Atoms()) >= 6). In the graphical form:

PatternQuery-Principles Atoms(Ca) surr filt.png

The previous filter query also demonstrates another interesting concept of the language: ability to identify fragments within fragments, which is what the expression m.Count(Atoms()) does - the Atoms() query is executed for each fragment from the original input sequence provided by the expression Atoms('Ca').AmbientAtoms(4), and creates a new sequence of fragments that each contain a single atom. Then the Count function takes over and returns the number of fragments produced by its argument. In this way, the query Atoms() inside the Count function can be replaced by any function that also produces a sequence of fragments, for example Rings().

Finally, streams of fragments can be combined. For example, let’s say we want to find all pairs of calcium atoms that are no further than 4A within each other. This can be achieved using the query Near(4, Atoms('Ca'), Atoms('Ca')). So this query takes as the input two identical streams of calcium atoms and for each pair of them determines if the atoms are closer than 4A to each other. For each pair that satisfies this condition, a new fragment from the 2 atoms is created. Therefore, the result of the above Near() query is a stream of sets of atoms (fragments) that each contain two calcium atoms that are no further than 4A from each other:

PatternQuery-Principles-Near.png

With these basic types queries outline in the previous paragraphs, the sky's the limit. Due to the composable nature of the language if a new type of motif emerges, only a single function needs to be added to the language for it to work with all its other parts. As an example, assume we didn’t know that proteins had secondary structure called “sheet” and we just discovered it and a fancy algorithm to identify these "sheets". Now we would be interested in how this new type of protein substructure interacts with other parts of the molecule. All that would be needed is to add a function called Sheets() to the language and immediately we would be able to analyze and filter it’s neighborhood using the functions AmbientAtoms() and Filter().


Basic Principles, More Formally this Time

There are two basic data structures that the language is built upon. These are:

  • Fragment. A fragment is simple an arbitrary set of atoms.
  • Fragment Sequence. A sequence of fragments. In mathematical terms, can be understood as a "set of fragments" which is another way of saying "set of sets of atoms".

And on these data structures, there are three basic types of queries:

  • Generator queries. Generator queries, as the name suggests, generate sequences of fragments from the original input. They are the tool that transforms the input molecule into a stream of fragments that can be later modified or combined. Examples of these queries include Atoms(), Residues(), and RegularMotifs().
  • Modifier queries. These queries operate on individual fragments and modify them or throw them away. Examples include AmbientAtoms(), ConnectedResidues(), and Filter().
  • Combinator queries. Combinatorial queries take as input two or more sequence of fragments and combine them into a single new sequence that satisfies given criteria. Examples include Near(), Cluster(), and Star().

Example, Revised

Now that we know the basic building blocks of the language, let’s go back to our original example and analyze it:

HetResidues()
  .AmbientAtoms(3.0)
  .Filter(lambda m: m.Count(Atoms('Ca')) >= 1)

This corresponds to the following process.

  1. A generator query HetResidues() is executed that produces a sequence of fragments that are composed of atoms corresponding to HET residues.
  2. Next, the original sequence is modified by adding atoms within 3 angstrom from any original atom to each fragment.
  3. Finally, each fragment in the modified sequence is examined: all “calcium atom fragments” are identified and counted. Only these fragments that contain at least 1 Ca atom are kept.

HetResidues().

PatternQuery-Principles-HetResidues.png

.AmbientAtoms(3)

PatternQuery-Principles-AmbientAtoms.png

.Filter(lambda l: l.count(Atoms('Ca')) > 0)

PatternQuery-Principles-AmbientAtoms-filter.png