Insight into complex scientific data using a graph data store
The intent of this story is to demonstrate the value of capturing complex scientific data and metadata in a graph data store and how this can provide advanced insight into these data. I will demonstrate this using a publicly available dataset and describe how this generalizes to other scientific data.
In order to fully understand a scientific dataset, it is important to understand the context in which the data was acquired. This knowledge is often captured in the form of notes in a lab notebook, an Excel spreadsheet or some other external file (Matlab, Text, CSV, Code, etc.). Unfortunately, sometimes this information resides solely in the collective memory of the investigators who performed the experiment. Obviously, we can do better than that.
Ideally, this “metadata” is captured alongside scientific data files in a way that allows scientists to have instant insight into all relevant information regarding the dataset. Datasets should be self-descriptive, and stored in a way that is easily queryable allowing the investigator/data scientist/clinician to rapidly view the data, test hypotheses or analyze the data. In addition, the dataset should be stored in a way that easily facilitates data transfer/sharing with collaborators and ultimately allow the data to be made publicly available without additional friction.
Capturing scientific data and metadata in such a meaningful way is far from trivial. Data for the Neurosciences spans a wide variety of data modalities and formats and there are many different scientific workflows within the community. Intuitively, it seems that a database solution is ideally suited to capture scientific metadata however, few laboratories use database solutions despite some of the clear benefits that it provides (i.e. standardization, querying, access, etc…).
One reason for the limited use of databases is because they require specific domain knowledge to create and maintain over time. Few scientific labs have the resources or domain knowledge to support this. Another reason is that a database typically requires static schemas which don’t match the dynamic reality of the scientific workflow. This suggests that a user-friendly database solution, that doesn’t require continuous management and specific domain expertise, would provide a meaningful resource for the scientific community, would increase productivity in the lab and foster collaboration.
Enter graph databases…
A graph database is particularly well-suited for the complex metadata that we typically see in scientific datasets. Graph databases differ from relational databases in that relationships between objects are treated as first class citizens. Data in a graph database is represented by a set of nodes and edges. Nodes represent things and edges represent relationships between those things.
Example of a graph based scientific metadata schema
An example scientific dataset (see figure below) is represented by objects of type: Experiment, Trial, Stimulation parameters, Electrode, Subject, Protocol. Each object can have multiple properties associated with it, such as a ‘name’ for an experiment, and an ‘age’ for a subject. In addition, various types of relationships exist between these objects. Here, Subjects ‘participate’ in an Experiment. Trials ‘belong-to’ an Experiment and Electrodes are ‘used’ during an Experiment and are ‘associated’ with stimulation parameters. Finally, we can associate scientific data files (e.g. EEGs, MRIs, Videos, etc.) with any of these objects. In the figure below, scientific data-files are associated with Trial objects in the dataset.
Organizing scientific data and metadata in this manner provides an intuitive way to interact with the data. Graph databases provide a transparent way to model data and the flexibility to expand the schema in an ever-changing scientific data environment. These characteristics make it an ideal data management solution for the sciences and adoption could lead to improved quality of data, better retention of knowledge and provenance, and an increased the ability of the scientific community to collaborate by sharing data that is inherently meaningful.
Hands-on with a real scientific dataset on the Blackfynn platform
To demonstrate the above, I will use an existing public dataset to show how you can leverage the Blackfynn platform to manage scientific data as a graph. Blackfynn provides a high performance intuitive interface (Web and API) for scientific data management and leverages AWS Neptune services to store data as a graph on the backend (https://aws.amazon.com/neptune/).
Hetionet is a complex scientific dataset developed by Daniel Himmelstein and Sergio Baranzini¹ and is structured as a graph. The project website describes the dataset as: “Hetionet is a network of biology, disease, and pharmacology. Knowledge from millions of biomedical studies over the last half century have been encoded into a single hetnet. Version 1.0 contains 47,031 nodes of 11 types and 2,250,197 relationships of 24 types”. A schematic describing the structure of the dataset is provided below. The entire dataset is available in JSON format on GitHub: https://github.com/dhimmel/hetionet
het.io hompage · Hetnets (heterogeneous networks) in biomedicine · Understanding complex diseases through data…het.io
I used Python and the Blackfynn Python client to create a script to import the dataset into the Blackfynn platform. You can find the script here: import_hetio_code. The script creates the models, the records and all the edges between the records. Once imported, the structure of the dataset is derived from the links between the data records. It is easy to see how records of type “Gene” are heavily connected to many of the other records in the platform
In a graph, relationships (edges) between records can have a specific relationship-type and direction. It is therefore possible to see which genes are upregulated, or downregulated by various compounds. In fact, it is entirely possible to add more data to these relationships, which opens up a plethora of possibilities to enrich the a dataset (e.g. you could provide a level of confidence in a relationship).
Looking at the Disease record for “Alzheimer’s disease”, we can see how this record relates to the rest of the dataset. We can see that there are a number of properties associated with the record, and that it is related to 275 Genes, 3 other diseases, 9 Drug compounds, 20 Anatomy and 44 symptoms. It is easy to traverse the graph and see the specifics of the related records.
In addition to the fact that organizing data as a graph provides an intuitive insight in the structure of the data, the benefits of a graph database quickly become apparent when we start to interrogate the data. For example, it is easy to find all drug compounds that are used to palliate Migraine as demonstrated below. The database returns all records of type “Compound” which are connected to records of type “Disease” with the value for the property “name” of “migraine”.
These examples provide a glimpse into the possibilities of graph-based data management. The graph database provides rich capabilities to construct advanced queries, which are increasingly available through Blackfynn. As the platform matures, the functionality and tools to gain insight into the data will continue to expand enabling users of the platform, with all their different backgrounds and levels of expertise, to leverage the significant benefits of an enterprise ready, HIPAA compliant, graph-based data management solution.
To learn more about Blackfynn, visit blackfynn.com.
Blackfynn allows users to manage, analyze, visualize and collaborate on scientific and clinical data - all in one…www.blackfynn.com
¹ Himmelstein DS, Baranzini SE. Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Asssociated Genes. PLOS Computational Biology. doi:10.1371/journal.pcbi.1004259