Modelling & Migrating Big Biological Data with Grakn

Syed Irtaza Raza
Jan 28, 2019 · 3 min read

Welcome to the multi-omics age of biological big data. Where data about a myriad of biological systems, all interacting with one another, is being collected. The data may represent genomes, proteomes, epigenomes, metabolomes, transcriptoms, biological pathways, diseases, drugs, published articles, imaging and even medical and clinical data. There is no doubt in the immense value of this tremendous amount of growing data.

Biological data is being accumulated all over the world from various organisations. Some examples include; NCBI, NIH, 1000 Genomes, Ensemble Genomes, Gene Expression Omnibuss, Gene Ontology, Global Biotic Interactions, Human Genome Diversity Project, KEGG, Reactome, MIT Cancer Genomics Data, NCI, Protein Data Bank, Drug Bank, DisGeNet, The Drug Gene Interaction Database, PubChem Project, The Cancer Genome Atlas, and UniProt.

However, dealing with these terabytes and petabytes of data comes with its own challenges; how to accumulate data from heterogeneous sources, how to handle the complexity of their interactions and how to standardise the structure of the complex data. The solution? Grakn.

I’ve been working on various projects utilising some of the above data sources and encountered the following methodology to cater to various formats (e.g. CSV, JSON, XML) and disparateness of the data. (Inspired by the migration tutorial here)

  1. First, model the data into a single structured architecture — a schema.
  2. Then, migrate the data into a single place — a knowledge graph.

1. Modelling

The first is the “disease_names” dataset from NCBI ClinVar which lists diseases.

sample data

The second is the “curated_gene_disease_associations” from DisGeNet containing relationships between diseases and genes.

sample data

The Grakn schema for the above data would look like the following:

We can visualise the schema using Grakn’s Knowledge graph IDE — Workbase:

For a detailed documentation of a Grakn schema, refer to the documentation here. Another useful example of schema modelling is available here.

2. Migrating

For a detailed tutorial for migrating data you can refer to the following documentation:

  1. Node.js example
  2. Java example
  3. Python example

If you have any questions, comments or would like to collaborate, please shoot me an email at syed@grakn.ai or reach out through linkedin. You can also talk to us and discuss your ideas with the Grakn community.

Thank you. :)

Vaticle

Creators of TypeDB and TypeQL

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store