Protein Structure & Function Prediction Powered by a Grakn Knowledge Graph

Soroush Saffari
Nov 3, 2018 · 5 min read

Ever since we’ve been able to sequence proteins, three-dimensional structures have received a tremendous experimental attention. Thanks to the development of new methods and technological advancements, determining these structures has become a more accurate and progressive process over time.

The problem, however, lays in the fact that the progress of discovering new protein structures has not kept pace with the rate at which new sequences are being produced. As a result, we see a continuously growing gap between the number of new sequences being produced and the three-dimensional structures being identified.

Given sufficient accuracy, a possible solution is the computational prediction of protein structures. Methods such as homology modelling, fold recognition and novel modelling can be used to fill in this gap. However, regardless of which method is used, with the rapid rise in the amount of sequence data, the underlying problem remains to be the lack of one single knowledge base that allows a rapid and powerful scan over the universe of protein sequences. All publicly available data currently sits in various databases across many different sources. Moving from one source to another is not — and certainly must not — be the biggest challenge in this process.

In this post, I aim to show how a Knowledge Graph can accelerate the protein structure prediction process by allowing you to:

  • query for insights over one single, comprehensive and interconnected dataset of protein sequences, and
  • to search and produce a shortlisted set of sequences to be passed on to the next computational component in the prediction process.

All Data In One Knowledge Graph

The image below illustrates how I think the model of a knowledge graph in this domain of protein sequence structure could look like.

The model of this Grakn knowledge graph: the red node is an entity, green nodes are relationships and blue nodes are attributes.

This Grakn knowledge graph plays the role of a single knowledge base that contains all relevant data pulled in from various sources, such as Uniprot and PDB. The data could also be pulled in from running BLAST with Grakn.

Migrating data to Grakn: To learn how data in CSV, JSON and XML formats can be migrated to a Grakn Knowledge Graph, have a look at the comprehensive and step-by-step Migration Guide.

Query For Insights

Now that we have all relevant data represented (as shown above) in a Grakn knowledge graph, we can go ahead and ask the following questions over this dataset. Under each question, I’ve included the relevant query.

  • What are the structures of the following sequence?
MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY
  • Which sequences have the structure with PDB id of "2RHC"?
  • The following sequence has no known structure. What are the structures of other sequences that are at least 80% identical to this particular sequence?
MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNL IHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY

The code you saw above is Graql. Graql is the language for Grakn. Its expressivity is what makes it extremely human-readable and intuitive. In simple terms, Graql is a query language that can be understood and written by anyone, not just programmers.

Extending the Knowledge Graph

As we decide to pull in more relevant data sources into the Grakn knowledge graph, the model can evolve and be extended with minimal effort.

An Example

Below I’ve included the code that defines the model that I illustrated earlier in this post. If we were to extend this model and introduce the protein sequence function with a mapping relationship to protein sequence structure, we could do so by extending the model (aka. schema) like so:

define
sequence-sequence-alignment sub relationship,
relates target-sequence,
relates matched-sequence,
has identicality,
has positivity;
sequence-structure-mapping sub relationship,
relates mapped-structure,
relates mapping-sequence;
structure-function-mapping sub relationship,
relates mapping-structure,
relates mapped-function;
sequence sub attribute datatype string,
plays target-sequence,
plays matched-sequence,
plays mapping-sequence;
structure sub entity,
plays mapped-structure,
plays mapping-structure,
has pdb-id;
function sub attribute datatype string,
plays mapped-function;
identicality sub attribute datatype double;
positivity sub attribute datatype double;
pdb-id sub attribute datatype string;

The bolded lines above is the extra code that we need to add. Nothing else needs to change. This extended model of the knowledge graph looks like this now.

The extended model: ‘function’ is added as an attribute and mapped with ‘structure’ (directly) and with ‘sequence’ (via inference)

Given the new relationship structure-function-mapping and the previous relationship sequence-structure-mapping, we can make use of Grakn’s automated reasoning capability to make an inference, resulting in new knowledge — the implied sequence-function-mapping relationship.

The implied-sequence-function-mapping rule above is telling Grakn that:

when:

  • there is a sequence, and
  • there is a structure, and
  • there is a function, and
  • the sequence and the structure have a mapping relationship, and
  • the structure and the function have a mapping relationship,

then:

  • consider the sequence and the function to have a mapping relationship.

With these additions to the schema, we can now ask the following questions:

  • Which sequences have the function “enzyme”?
  • Which functions are mapped either directly to the following sequence or indirectly via an aligned sequence that is at least 80% identical to the given sequence?
The sequence: MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNL IHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY

Grakn Rules

What you saw above is simply one example of how a Grakn knowledge graph can be extended to infer new knowledge. Rules can be written in any research domain to:

  • inject biological facts
  • infer based on new findings (hypotheses)
  • enforce constraints

It’s entirely up to you how you choose to make your knowledge graph more intelligent by writing rules tailored to your own work.

The Opportunities are endless!

Grakn is about modelling intelligent knowledge graphs in an intelligent way. We believe simplicity to be a cornerstone of intelligence. Hence, the query language — Graql. What you can model and query with a Grakn knowledge graph is only limited by your will and imagination.

See an example of the thought process behind modelling a dataset in Grakn. Read about the Schema Concepts and Rules. Go through examples of Graql queries or see how to write your own.

For Your Inspiration

Have an idea?

I’d love to hear about how you see Grakn applicable in your field. Whether it’s a question, comment or feedback, get in touch with me on Grakn’s slack community, shoot me an email at soroush@grakn.ai or tweet me at @SaffariSoroush.

Vaticle

Creators of TypeDB and TypeQL