Developing and implementing a data warehouse to support genomic analyses
Here we discuss how to implement and model a data warehouse in order to support genomic analyses treating them as big data input.
We need to observe the data warehouse UML:
As we see we have:
- 1 Fact table
- 3 Dimension tables
This is the final structure we will obtain after querying our RDBMS.
We need to prepare our tools:
Yes! We don’t need anything else for achieve our goal!
Let’s move on…
With this script you can create your first data warehouse!
You don’t know how a data warehouse works?! Check this…
Data warehouse creation sample script
- At first we have the drop statement. Be carful if you don’t want to re-upload each time data.
- Then we initialise the dimension_tables…
- …and then the fact_table
- Choose the length of your element columns
- And here we are! First milestone reached!
After the creation of our data warehouse we need to populate the db with some data.
And here it is a simple ETL flow that allows you to get data from excel files and to put into the data warehouse
This flow grants you to upload simple gVCF file data into your data warehouse.
gVCF how it works?
Is a standar de-facto in genomics to describe variants in patients.
With this kind of file format you can easily retrieve data you need and put it into the various tables.
My ETL flow has a pre-processing step that deletes the header and get only the data needed for our purpose.
The entire ETL flow need from 30min to ~1h to upload 1.2GB of data (genome or exome).
Transforming a DAG dataset into a tree
All we know that modern SAIKU plugin does not support DAG (Direct acyclic graphs) due to their structure. All we can do is to transform a DAG into a tree (Yes we lose some information but we have a usable data set!).
First which dataset has a DAG data structure? It’s HPO database we are going to integrate!.
Human phenotype ontology it’s a great database containing all the terms referring to a particular disease, from general view point to the specific term.
Source: http://human-phenotype-ontology.github.io/ (It’s free an open source)
Let’s move on to the implementation
This is a simple DAG graph, note that a member could have more than one parent!
Then we we will do is to cut the leaf where the difference in number is lower.
For example we will cut 3–10, 7–11, 11–2 etc…
With this query we select all the terms which are mostly similar.
Then with this recursive query we will construct our tree term where each term is highly related to it’s predecessor and avoiding duplicates on the last column (Phenotype) which is our Primary key!
This is done because each disease has a phenotype term, most SNP are associated to a phenotype term and every phenotype term is associated to a SNP.
And here we have obtained our dt_term somewhat in a easy way!
If you have a better way to do this do not hesitate to tell me!
This dimension table is the only one that we need to pay more attention because of it’s data source. The best feature of data warehouse is that is integrated, we can integrate every kind of data source inside it!.
Creating a SAIKU analysis and a dashboard with Pentaho
This is the funniest and easiest part of the project.
There are a lot of video tutorial on how to develop a dashboard via Pentaho so here i will not teach you those things but i will show you what this kind of data warehouse could produce at the end.
This is a simple dashboard (yes toooo simple!) in which you can describe all the percent of your variants found.
Here instead it is a a kind of graph you can find in SAIKU plugin.
Those are powerful tools that could bring rich and meaningful information simply by some graphs.
Where you can find more infos
You can find more info about this project at:
- https://github.com/xCloudx8/datawarehouse_for_genomes (Main repo of the project)
- https://github.com/xCloudx8/datawarehouse_for_genomes/blob/master/data_warehouse_thesis.pdf (File thesis where everything is explained better with a deeper informations of all steps.)
BSD 3-clause “New” or “Revised” License
A permissive license similar to the BSD 2-Clause License, but with a 3rd clause that prohibits others from using the name of the project or its contributors to promote derived products without written consent.
Hope you enjoyed…=)