Using a Grakn Knowledge Graph for Sequence Alignment Analysis

Soroush Saffari
Vaticle
Published in
6 min readOct 16, 2018

The sequencing of proteins and DNA has arguably become one of the biggest biological evolutions in the last 50 years. It has enabled researchers to produce sequence alignments that can be precisely analysed for discovering meaningful and evolutionary relationships.

The Basic Local Alignment Tool (BLAST) has become the go-to place for many bioinformaticians who routinely search for sequence alignments, as part of their research workflow. The flexibility of its search algorithms and the reliable output produced by them are known to be the main reasons for BLAST’s ever-growing popularity.

There are many articles that delve deep into explaining the capabilities and performance of various BLAST algorithms. In this post, however, we aim to focus on the workflow that involves using BLAST. This article attempts to show how this workflow can be significantly improved to:

  • reduce the time the researcher spends on pre-processing the data, and
  • to represent and visualise the BLAST output in a more efficient way that enables more in-depth and interconnected analysis.

The Typical Workflow

The diagram below illustrates the typical workflow of using BLAST in a research environment.

The typical BLAST workflow

As we can see here, for each target sequence, the researcher has to:

  1. run a BLAST search,
  2. go through the BLAST Report to identify the relevant alignments,
  3. transfer each identified alignment to one single spreadsheet, and
  4. use that as a base for analysis.

Imagine repeating the above for 1000 target sequences. If there is one word that could describe the insanity of this process, it would be “tedious”. I doubt that any bioinformatician sees their skills and intelligence utilised in a process that is heavily slowed down by such an enormous overhead.

Furthermore, the flat representation of data in a spreadsheet or the BLAST report as seen below, or any table-based view for that matter, is in no way the best approach to represent such complex and interconnected data.

A typical BLAST report

It’s clear that this workflow is far from ideal. An optimised alternative would be one that:

  1. automates the manual process of running BLAST searches, and
  2. automatically pulls the output of each BLAST search into one single knowledge graph that represents the data in its true nature — complex and interconnected.

An Optimised Workflow

The diagram below illustrates a workflow that requires minimal effort to pre-process and prepare multiple BLAST outputs for analysis and discovery.

An optimised workflow

As shown in the diagram above, Grakn’s Knowledge Graph plays the role of a central (single) knowledge base that contains the results from all BLAST searches. Here is how the process takes places:

  1. A set of proteins are imported into the knowledge graph.
  2. A Graql query is written to extract the target sequences from the knowledge graph.
  3. A BLAST search runs for each of the target sequences, via the NCBI BLAST API.
  4. As soon as each BLAST search completes, the alignments and all their interconnected data are imported back into the knowledge graph.
  5. The analysis begins by querying the knowledge graph to filter and provide the relevant alignments found by multiple BLAST searches.

By now you might have already guessed it — a workflow that relies heavily on such automation can only be feasible by writing code. That couldn’t be further from the truth, but at the end of the day it’s about how much effort we have to put into it, and if the outcome is worth it.

To answer this question, we built an example knowledge graph for protein sequences, and set up the automated components around it.

More technical details about this example as well as full instructions about running it on your machine, are available on the github.com > graknlabs > biograkn > blast repository.

The Ontology of Protein Sequence Alignments

Before we can see some example queries, it’s important that we understand how the knowledge graph is modelled — a representation that truly describes a proteins dataset for sequence alignments.

red nodes are entities, green nodes are relationships and blue nodes are attributes

The visualised model above translates to:

  • A protein: a) has a name, a sequence and an identifier, b) is owned by a species, c) is stored in a database
  • The sequence of a protein: a) can have an alignment with another protein sequence, b) is stored in a database
  • An alignment between two sequences: a) implies an alignment between the proteins that own the aligned sequences b) has identicality, positivity, gaps, midline and an identifier (the BLAST Reference)
  • A Species: has name
  • A Database: has name

Analysing Alignments in a Knowledge Graph

What follows is a small set of query examples that are merely meant for demonstration purposes of Graql — Grakn’s query language.

Filter Based on the Alignment’s Attributes

Below is the sequence for the target protein ORM1-like protein 3.

MNVGTAHSEVNPNTRVMNSRGIWLSYVLAIGLLHIVLLSIPFVSVPVVWTLTNLIHNMGMYIFLHTVKGTPFETPDQGKARLLTHWEQMDYGVQFTASRKFLTITPIVLYFLTSFYTKYDQIHFVLNTVSLMSVLIPKLPQLHGVRIFGINKY

We’d like to get sequence alignments with identicality of at least 90% and positivity of at least 85%.

Query and Visualise

Purple nodes are proteins (target at top | matched at bottom), green nodes are protein sequences, blue nodes are attributes of the alignment and red node is the database that contains the matched protein

Query and Get Results

Partial Results

Filter based on the Species of Matched Sequences

Given the same protein sequence as shown above, we’d like to get only the sequence alignments, where the matched sequence belongs to the species Fukomys damarensis.

Query and Visualise

Purple nodes are proteins (target at top | matched at bottom), green nodes are protein sequences, blue nodes are attributes of the alignment and red node is the species that has the matched protein

Query and Get Results

Partial Results

Filter Based on a Particular Subset of the Matched Sequences

Given the same protein sequence as above, we’d like to get only the sequence alignments that contain the subset KF.

Query and Visualise

Purple nodes are proteins (target at top | matched at bottom), green nodes are the protein sequences and blue nodes are attributes of the alignments

Query and Get Results

Partial Results

The Code

The code you saw above is Graql. Graql is the language for Grakn — the knowledge graph. The expressivity of Graql is what makes it the most readable query language since the beginning of databases. In simple terms, Graql is a language that can be understood and written by anyone, not just programmers.

In this optimised workflow, Graql is used for analysis. The part of the code that automates the workflow consists of two files:

  1. migrate.py: reads a .fasta file containing 12 proteins that relate to asthma (exported from UniProt) and inserts each protein along with its sequence into the knowledge graph.
  2. blast.py: 1) extracts the target sequences from the knowledge graph, 2) runs a BLAST search for each sequence, and 3) imports the result from each BLAST search back into the knowledge graph.

To run this example on your local machine, follow these instructions.

Tip of the Iceberg

This example only touches the surface of how a knowledge graph can revolutionise workflows that rely on bioinformatic analytical tools, such as BLAST. The schema presented in this article can be extended and others can be modelled to represent complex concepts and rules. This enables writing powerful and intuitive Graql queries for obtaining valuable insights over complex datasets.

Grakn’s Knowledge Graph has the potential to complement a vast majority of Biological Research domains. Some of those fields include; analysing DNA sequence alignments, exploring genomics, drug discovery, disease networks, neuro-informatics, and even investigating protein structure, function, properties and classifications.

We want to hear from you :)

Share your unique experience in using BLAST and how you think this optimised workflow relates (or not) to the way you use BLAST. As a part of this or other workflows, what other publicly available analytical tools and programs do you normally use? How do you see a central knowledge base in form of a knowledge graph, adding value in your field?

Talk to us and discuss your ideas with the Grakn community 🙂

If you have any questions or comments about this work, please send me an email at soroush@grakn.ai or tweet me @SaffariSoroush.

--

--