Welcome to the multi-omics age of biological big data. Where data about a myriad of biological systems, all interacting with one another, is being collected. The data may represent genomes, proteomes, epigenomes, metabolomes, transcriptoms, biological pathways, diseases, drugs, published articles, imaging and even medical and clinical data. There is no doubt in the immense value of this tremendous amount of growing data.
Biological data is being accumulated all over the world from various organisations. Some examples include; NCBI, NIH, 1000 Genomes, Ensemble Genomes, Gene Expression Omnibuss, Gene Ontology, Global Biotic Interactions, Human Genome Diversity Project, KEGG, Reactome, MIT Cancer Genomics Data, NCI, Protein Data Bank, Drug Bank, DisGeNet, The Drug Gene Interaction Database, PubChem Project, The Cancer Genome Atlas, and UniProt.
However, dealing with these terabytes and petabytes of data comes with its own challenges; how to accumulate data from heterogeneous sources, how to handle the complexity of their interactions and how to standardise the structure of the complex data. The solution? Grakn.
I’ve been working on various projects utilising some of the above data sources and encountered the following methodology to cater to various formats (e.g. CSV, JSON, XML) and disparateness of the data. (Inspired by the migration tutorial here)
- First, model the data into a single structured architecture — a schema.
- Then, migrate the data into a single place — a knowledge graph.
Let’s look at how this can be done using the following example datasets:
The first is the “disease_names” dataset from NCBI ClinVar which lists diseases.
The second is the “curated_gene_disease_associations” from DisGeNet containing relationships between diseases and genes.
The Grakn schema for the above data would look like the following:
We can visualise the schema using Grakn’s Knowledge graph IDE — Workbase:
To migrate your data instances we can use the following java migration script:
For a detailed tutorial for migrating data you can refer to the following documentation:
If you have any questions, comments or would like to collaborate, please shoot me an email at firstname.lastname@example.org or reach out through linkedin. You can also talk to us and discuss your ideas with the Grakn community.
Thank you. :)