Using a Knowledge Graph for Precision Medicine

Syed Irtaza Raza
Vaticle
Published in
10 min readFeb 21, 2019

One of the biggest challenges in our current state of medicine is to provide relevant, personalised, and precise diagnoses and treatments. Rather than treating all patients in the same way, the goal is to fully take into account a person’s demographics and genetic profile while treating or diagnosing them.

In a nutshell, the problem is that a large number of drugs and treatments prescribed to patients do not treat the individual patient, but rather the generic disease. This, in addition to the fact that not all treatments affect every patient in the same way, is something doctors are well aware of. Yet for decades, the strategy of trial and error has still been used to a large extent to treat and diagnose patients. Not the most reassuring of thoughts.

However, there is hope. Due to recent advancements in bioinformatics and genomic sequencing, precision medicine (also known as personalised medicine) is slowly becoming a reality rather than a dream.

Precision medicine is a nascent field focused on disease prevention and treatment while taking into account an individual’s variability in genes, environment, and lifestyle. It aims to integrate vast data sets to build predictive and preventive models of complex versatile diseases. Precision medicine may simply mean prescribing the right drug to the right patient, with the right dose at the right time. Or, it may entail matching patients to the relevant therapeutic treatments which are relevant to their individual biological make up.

The benefits that this brings means that precision medicine is getting us much closer to e.g., curing cancer, and predicting and preventing chronic diseases. In doing so, we may also be saving trillions of dollars to the health industry.

Though promising, there are still many challenges facing precision medicine. Some of these are related to the handling of the data which is needed to personalise medicine. In this article, I’ll look at these problems and propose how they can be solved using a TypeDB knowledge graph.

What are some Challenges in Precision Medicine?

When approaching precision medicine, we are faced with a myriad of challenges — from obtaining correct and relevant data sources, all the way to validating our treatments and diagnoses. With this in mind, I have been particularly interested in how precision medicine data should be handled, and observed the following challenges:

1. Integrating heterogeneous biological data is difficult
The first challenge starts with the raw data. This data, which pertains to biological concepts and relationships, is scattered all over the world and is produced at an unprecedented rate, by a multitude of institutions, in various formats and standards. This is also known as the big data disruption paradigm, where high throughput data pipelines are creating bottlenecks to the analysis and processing of that data. It is extremely difficult to ingest and integrate these multi-format and disparate data sets.

2. Normalising raw complex biological data is difficult
The second challenge stems from the fact that the raw data contained in these data sets have no structure. This lack of structure makes it difficult to maintain and assure integrity, accuracy and consistency over this data. It also causes a lack of control over the validity of data when integrating such heterogeneous data sources.

3. Discovering new and valuable insights is difficult
Finally, due to the heterogeneous nature of the biological data necessary for precision medicine, it becomes extremely tedious to generate or investigate insights in a scalable way. Of course, one valuable insight might be discovered manually for a single instance of a disease or patient, but such an approach is impossible to scale across thousands of patients. Moreover, in many cases, a manual approach may actually be impossible. What do we do then?

What Solutions Address These challenges?

With this in mind, we need to find potential solutions to address these challenges. Based on my research, the below is what I suggest:

1. Integrating heterogeneous biological data into one single database
To solve the disparateness of heterogeneous data, we need a method to easily accumulate patient profiles and biological data into one collection —in other words, we need a knowledge graph.

2. Normalising biological data using a contextual structure
To enable the intelligent analysis and integration of such data — while maintaining data integrity — we need to impose an explicit structure on the concepts contained in that data. This will not only help to contextualise the concepts themselves, but also the relationships between them. This translates to having some sort of high-level data model to encompass the various types of data and consolidate their presence in the knowledge graph. This will also allow us to validate the data at the time of ingestion.

3. Discovering new insights using automated reasoning
In order to extract or infer as much information as possible from our knowledge graph, we need some sort of automated reasoning tool to propagate our domain expertise throughout the entirety of the data. This will enable us to ask questions of our knowledge graph and get the right answers — while other traditional methods would fail.

Having identified a template to the solutions of the previously listed challenges, I wondered whether there was any one technology out there, that encompassed all three points?

Well, to my luck, TypeDB solves all of these.

If you’re unfamiliar with this technology, TypeDB is an intelligent database in the form of a knowledge graph which organises complex networks of data. It contains a knowledge representation system based on hyper-graphs, enabling the modelling of every complex biological relationship. This knowledge representation system is then interpreted by an automated reasoning engine, which performs reasoning in real-time. The software gets exposed to the user in the form of a flexible and easily understood query language — TypeQL.

Building a Precision Medicine Knowledge Graph

But how do we actually go about building a precision medicine knowledge graph using TypeDB?

Identifying the Right Data
Every data driven system needs to start with the data itself. As such, the first step is to identify the right types of data sources we need in order to go about personalising medicine.

There are two types of data we need. First, the data related to the personalised aspect, which may include the demographic and medical profile of the patients we plan on examining. These may include:

  1. Genes
  2. Variants
  3. Medical history
  4. Age
  5. Gender
  6. Ethnicity
  7. Electronic health records

In order to investigate how the patient’s data correlates with medicine, we also need other types of biomedical data:

  1. Diseases
  2. Drugs
  3. Clinical trials
  4. Medical literature

Once we have identified the raw data we want for our application, we need to find reliable sources to retrieve the data from. The following lists some of the sources where one can download raw data pertaining to precision medicine:

  1. NCBI
  2. DisGeNet
  3. OncoKB
  4. DrugBank
  5. PubMed
  6. ClinicalTrails.gov
  7. ClinGen
  8. PHARMGKB
  9. ClinVar
  10. Drugs@FDA

Normalising Raw Data
Now that we have raw data, we need to tackle the second problem — that of normalisation. To this end, TypeDB utilises the entity-relationship model to group each concept into either an entity, attribute, or relationship. This means that all we have to do is map each concept to a schema concept type, and recognise the relationships between them. Let us look at an example to demonstrate how we would go about doing this:

First, let us assume we have the following two raw data sets representing instances of people in one data set, and diseases in another:

In the “People” dataset, we can see that the fourth column refers to “disease”, giving us a diagnosis about the diseases a person has. In other words, the raw data shows a relationship between a person, who is linked with a disease through a diagnosis:

This structure can be represented using TypeQL as follows:

link to code

We recognised a person and a disease as an entity, having certain attributes. Then, we defined a relationship between the disease and person that we called diagnosis: where the role-players in the relationships are the patient and diagnosed-disease.

In order to constrain the attributes of each concept we also need to denote what data type they adhere to. In this case, we defined four attributes: one attribute with datatype double, and three of string type.

Migrating the Raw Data into TypeDB
Now that we have the data, and a structure imposed on this data, the next step is to migrate this into TypeDB. Please note there are many different ways to do migration, but here I would like to specifically touch on how we would go about using Java, NodeJS and Python.

For this, we can easily use any of these languages to read/parse the raw data file and iterate over each entry within those files. The image below depicts how to insert a single instance of a disease with a name and type into TypeDB using any of these three languages:

To learn more about migrating data into TypeDB, make sure to read this article.

Discovering a New Relevant Therapy for a Patient
After migration, we can get started on discovering new insights. Discovering insights refers to finding new data that may be valuable to what we are trying to accomplish. In order to do that, we need to first look or ask for something. In other words, we start with a question — the questions we ask to find answers to in precision medicine. These questions can range from asking if a particular individual’s profile is susceptible to any disease, or which treatments are relevant to an individual’s biological make-up.

Let us look at an example, and see how our precision medicine knowledge graph may provide answers to them:

Question: Given a person suffering from melanoma, what clinical trial could be recommended to her/him?

Answer:

The answer that TypeDB returns is a relationship called personalised-therapy, which connects a person and a clinical-trial. The clinical trial returned has information about the trial which we ingested in the graph; including its title (GSK1120212 vs Chemotherapy in Advanced or Metastatic BRAF V600E/K Mutation-positive Melanoma), intervention type (drug), and also a URL to the trial page.

The response to this query can also be visualised in TypeDB Studio (Knowledge graph IDE):

Even though TypeDB gives us a correct answer to our question, this data was actually never ingested into TypeDB— no connections exist between persons and clinical trials. So, how did we get this relevant answer?

In short, TypeDB’s automated reasoner created this answer for us through automated reasoning. As this type of reasoning is fully explainable, we can interpret any inferred concept to understand how it was inferred or created. Below you can see how this explanation looks like in Studio. In the next section, I will dive deeper into how we created the logic and rules that allowed TypeDB to infer these relationships.

Forming Rules to Propagate Reasoning Over the Graph

Rule 1: personalised-patient-therapy. Creates the personalised therapy relationship.

The inferred relationship between a person and a clinical-trial was called a personalised-therapy, as is shown in rule 1: personalised-patient-therapy. This gets created when two other conditions (represented by relationships) are met:

Rule 2: trial-participant-eligibility. The person is eligible for the clinical trial.

First, the person must be eligible to take part in the clinical-trial (eligible-trial-participant). This is evaluated by another rule (rule 2: trial-participant-eligibility) which takes into consideration the patient’s age, gender and diagnosis, and compares it with the clinical-trial.

Rule 3: trial-participant-relevance. The person is relevant to the clinical trial.

Second, the person must be relevant to the trial. This is evaluated by rule 3, trial-participant-relevance, which checks that the title of the clinical trials contain the gene and variant symbol belonging to the person. This would indicate relevance between the individual person and the clinical-trial.

It should be noted that the rules above are a demonstration of how automated reasoning can be used to create a high level abstraction over complex insights which scale though the data. They by no means compare to the full knowledge of how a biologist would go about achieving such an insight.

How do all the Pieces fit Together in one Architecture?
Now, let us take a step back, and look at how all the components of building a precision medicine knowledge graph piece together.

We start with the data which can come from multiple sources and in various formats. That raw data is used to create a schema (high level data model) to enforce a structure on the raw data. Once that is done, we use one of TypeDB’s clients to migrate the instances of data into TypeDB, making sure every insertion adheres to the schema. TypeDB stores this in its knowledge representation system, which can be queried for insights to discover complex relationships or even test out hypotheses. These insights may already be in the graph, or can even be created at the time of query by the reasoning engine; for example, discovering personalised diagnoses and therapies.

Conclusion

So we know that precision medicine is extremely promising in revolutionising global health care. We understand that there are barriers and bottlenecks associated with achieving it. I hope to have shown that TypeDB can help to bring us many steps closer to having precision medicine as the standard for health care. In summary, TypeDB helps to solve the three key challenges in precision medicine when it comes to handling data:

If you have any questions, comments or would like to collaborate, please shoot us an email at community@vaticle.com. You can also talk to us and discuss your ideas with the TypeDB community.

Grakn Labs was rebranded in 2021 to Vaticle; Grakn is now TypeDB, and Graql is now TypeQL.

--

--