Ever since we’ve been able to sequence proteins, three-dimensional structures have received a tremendous experimental attention. Thanks to the development of new methods and technological advancements, determining these structures has become a more accurate and progressive process over time.
The problem, however, lays in the fact that the progress of discovering new protein structures has not kept pace with the rate at which new sequences are being produced. As a result, we see a continuously growing gap between the number of new sequences being produced and the three-dimensional structures being identified.
Given sufficient accuracy, a possible solution is the computational prediction of protein structures. Methods such as homology modelling, fold recognition and novel modelling can be used to fill in this gap. However, regardless of which method is used, with the rapid rise in the amount of sequence data, the underlying problem remains to be the lack of one single knowledge base that allows a rapid and powerful scan over the universe of protein sequences. All publicly available data currently sits in various databases across many different sources. Moving from one source to another is not — and certainly must not — be the biggest challenge in this process.
In this post, I aim to show how a Knowledge Graph can accelerate the protein structure prediction process by allowing you to:
- query for insights over one single, comprehensive and interconnected dataset of protein sequences, and
- to search and produce a shortlisted set of sequences to be passed on to the next computational component in the prediction process.
All Data In One Knowledge Graph
The image below illustrates how I think the model of a knowledge graph in this domain of protein sequence structure could look like.
This Grakn knowledge graph plays the role of a single knowledge base that contains all relevant data pulled in from various sources, such as Uniprot and PDB. The data could also be pulled in from running BLAST with Grakn.
Migrating data to Grakn: To learn how data in CSV, JSON and XML formats can be migrated to a Grakn Knowledge Graph, have a look at the comprehensive and step-by-step Migration Guide.
Query For Insights
Now that we have all relevant data represented (as shown above) in a Grakn knowledge graph, we can go ahead and ask the following questions over this dataset. Under each question, I’ve included the relevant query.
- What are the structures of the following sequence?
- Which sequences have the structure with PDB id of "2RHC"?
- The following sequence has no known structure. What are the structures of other sequences that are at least 80% identical to this particular sequence?
The code you saw above is Graql. Graql is the language for Grakn. Its expressivity is what makes it extremely human-readable and intuitive. In simple terms, Graql is a query language that can be understood and written by anyone, not just programmers.
Extending the Knowledge Graph
As we decide to pull in more relevant data sources into the Grakn knowledge graph, the model can evolve and be extended with minimal effort.
Below I’ve included the code that defines the model that I illustrated earlier in this post. If we were to extend this model and introduce the protein sequence
function with a
mapping relationship to protein sequence
structure, we could do so by extending the model (aka. schema) like so:
sequence-sequence-alignment sub relationship,
has positivity;sequence-structure-mapping sub relationship,
relates mapping-sequence;structure-function-mapping sub relationship,
relates mapped-function;sequence sub attribute datatype string,
plays mapping-sequence;structure sub entity,
has pdb-id;function sub attribute datatype string,
plays mapped-function;identicality sub attribute datatype double;
positivity sub attribute datatype double;
pdb-id sub attribute datatype string;
The bolded lines above is the extra code that we need to add. Nothing else needs to change. This extended model of the knowledge graph looks like this now.
Given the new relationship
structure-function-mapping and the previous relationship
sequence-structure-mapping, we can make use of Grakn’s automated reasoning capability to make an inference, resulting in new knowledge — the implied
implied-sequence-function-mapping rule above is telling Grakn that:
- there is a sequence, and
- there is a structure, and
- there is a function, and
- the sequence and the structure have a mapping relationship, and
- the structure and the function have a mapping relationship,
- consider the sequence and the function to have a mapping relationship.
With these additions to the schema, we can now ask the following questions:
- Which sequences have the function “enzyme”?
- Which functions are mapped either directly to the following sequence or indirectly via an aligned sequence that is at least 80% identical to the given sequence?
The sequence: MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNL IHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
What you saw above is simply one example of how a Grakn knowledge graph can be extended to infer new knowledge. Rules can be written in any research domain to:
- inject biological facts
- infer based on new findings (hypotheses)
- enforce constraints
It’s entirely up to you how you choose to make your knowledge graph more intelligent by writing rules tailored to your own work.
The Opportunities are endless!
Grakn is about modelling intelligent knowledge graphs in an intelligent way. We believe simplicity to be a cornerstone of intelligence. Hence, the query language — Graql. What you can model and query with a Grakn knowledge graph is only limited by your will and imagination.
For Your Inspiration
Have an idea?
I’d love to hear about how you see Grakn applicable in your field. Whether it’s a question, comment or feedback, get in touch with me on Grakn’s slack community, shoot me an email at email@example.com or tweet me at @SaffariSoroush.