Drug Discovery Knowledge Graphs

Syed Irtaza Raza
Vaticle
Published in
11 min readJun 7, 2019

Combinatorial chemistry has produced a huge amount of chemical libraries and data banks which include prospective drugs. Despite all of this progress, the fundamental problem still remains: how do we take advantage of this data to identify the prospective nature of a compound as a vital drug? Traditional methodologies fail to provide a solution to this.

Knowledge graphs, however, provide a framework which can make drug discovery much more efficient, effective and approachable. This radical advancement in technology can model biological knowledge complexity as it is found at its core. With concepts such as hyper relationships, type hierarchies, automated reasoning, and analytics we can finally model, represent, and query biological knowledge at an unprecedented scale.

But before we delve into the methodology of creating a drug discovery knowledge graph, let us look at what methodologies exist today, and why new techniques are required.

Drug Discovery and its Shortcomings

Drug discovery, as the name suggests, is the process by which new medications are discovered. It involves a wide range of scientific disciplines, including biology, chemistry, and pharmacology.

Historically, drugs were discovered either serendipitously, or through identifying the active ingredient in traditional remedies. More recently, chemical libraries of synthetic small molecules, natural products, or extracts were screened in intact cells or whole organisms to identify substances which have a desirable therapeutic effect in a process known as classical/forward pharmacology or phenotypic drug discovery.

The sequencing of the human genome changed things drastically, as it allowed the rapid cloning and synthesis of large quantities of purified proteins. It has since become common practice to use high throughput screening of large compound libraries against isolated biological targets which are hypothesised to be disease modifying, in a process known as reverse pharmacology, or target-based drug discovery. Hits from these screens are then tested in cells. then animals, for efficacy.

While the afore-mentioned processes may seem systematically effective, they are neither as efficient nor economical as desired. For a drug to be discovered, it takes on average $2 billion and almost 14 years. Further, there is a greater than 90% failure rate associated with drug discovery.

With billions of dollars being spent on finding new drugs, and very low outcomes, it has therefore become essential to develop newer and more intelligent ways to do so. To illustrate this need, the diagram above depicts the trend of the return on pharmaceutical investments.

What are the Challenges in Data-Driven Drug Discovery?

While studying the various approaches that leverage data in drug discovery, I found several barriers to finding the insights that would accelerate and assist the industry. These can be summarised in the following three points:

  1. Integration: Difficulty in ingesting and integrating complex networks of biological data
    The first challenge starts with the raw data. This data, which pertains to biological concepts and relationships, is scattered all over the world and is produced at an unprecedented rate by a multitude of institutions in various formats and standards. This is also known as the big data disruption paradigm, where high throughput data pipelines are creating bottlenecks to the analysis and processing of that data. It is extremely difficult to ingest and integrate these multi-format and disparate data sets.
  2. Normalisation: Difficulty in contextualising relations within biomedical data
    The second challenge stems from the fact that the raw data contained in these data sets have no structure. This lack of structure makes it difficult to maintain and assure integrity, accuracy, and consistency over the data. It also causes a lack of control over the validity of the data when integrating such heterogeneous data sources. Taken together, this makes it hard to contextualise or understand the relationships contained within the data.
  3. Discovery: Difficulty in investigating insights over a magnitude of data in a scalable way
    Finally, due to the magnitude of data, it becomes extremely tedious to generate or investigate insights in a scalable way. Of course, valuable insights can be discovered manually for single data instances, but such an approach is impossible to scale across millions of data points. Moreover, in most cases, doing this manually is simply practically impossible. What do we do then?

What Solutions can Address These Challenges?

With these problems in mind, we need to think of potential solutions that can address those challenges. Based on my research, the below is what I suggest:

Having identified a template to the solutions of the previously listed challenges, I wondered whether there was any one technology out there, that encompassed all three points?

Well, to my luck, TypeDB solves all of these.

If you’re unfamiliar with this technology, TypeDB is an intelligent database in the form of a knowledge graph which organises complex networks of data. It contains a knowledge representation system based on hyper-graphs, enabling the modelling of every complex biological relationship. This knowledge representation system is then interpreted by an automated reasoning engine, which performs reasoning in real-time. The software gets exposed to the user in the form of a flexible and easily understood query language - TypeQL.

Building a Drug Discovery Knowledge Graph

But how do we actually go about building a drug discovery knowledge graph using TypeDB?

Identifying the Right Data

Studying the various types and instances of data, I found it beneficial to leverage and navigate the complex relationships between compounds, genes, diseases, and proteins.

To augment the above data types, we can also incorporate other types of data which may enrich our knowledge graph, as well as provide more powerful insights. This data may entail various biological pathway data or even text mined medical literature (which can be used to connect certain research studies to give confidence to our insights).

Once we have the raw data we want for our application, we need to find reliable sources where we can retrieve this data from. The following list exhibits some of the sources that may be used to download the raw data pertaining to drug discovery:

  1. Gene Ontology
  2. NCBI
  3. ClinVar
  4. CTD
  5. DisGeNET
  6. Gene Expression Omnibus
  7. IntAct
  8. PubMed

Normalising Raw Data
Now that we have raw data, we need to tackle the second problem: that of normalisation. To this end, TypeDB utilises the entity-relationship model to group each concept into either an entity, attribute, or relationship. This means that all we have to do is map each concept to a schema concept type, and recognise the relationships between them. Let us look at an example to demonstrate how we would go about doing this.

First, let us assume we have the following two raw data sets representing instances of genes in one data set, and diseases in another:

If we were to visualise the above two concepts and the relations between them, it would look like the following:

This structure can be represented using TypeQL as follows:

We recognised a gene and a disease as an entity, having symbol and name respectively. Then, we defined a relationship between the gene and person which we called gene-disease-association, where the role-players in the relationships are the associated-gene and diagnosed-disease.

In order to constrain the attributes of each concept, we also need to denote what data type they adhere to. In this case, we defined them asstring type.

Migrating the Raw Data into TypeDB
Now that we have the data from our sources, as well a structure imposed on this data, the next step is to migrate this into TypeDB. While there are many different ways to do migration, I will specifically touch on how we would go about using Java, NodeJS and Python.

For this, we can easily use any of these languages to read/parse the raw data file and iterate over each entry within those files. The image below depicts how to insert a single instance of a gene with a name and symbol into TypeDB, using any of these three languages:

You can learn more about migrating data into TypeDB in this article.

Discover Insights that Scale

After migration, we can start to discover new insights. Discovering insights refers to finding new data that may be valuable to what we are trying to accomplish. In order to do that, we need to first look or ask for something. In other words, we start with a question — the questions we ask to find answers to in drug discovery.

Let us look at an example, and see how our drug discovery knowledge graph may provide answers to them:

Question: What are the potential targets of the disease Melanoma?

Answer: The answer returned is a list of proteins which are usually potential targets. For proteins to be identified as viable targets they must be related to the disease in some way. Let us say that in this case, they were part of a protein-protein-interaction pathway:

The question that then arises is — how did TypeDB recognise the proteins that were not directly related to the disease?

That comes down to the power of automated reasoning inherent in TypeDB, in the form of rules, which infers that all proteins which fall in a pathway leading to a disease can be identified as drug targets. Two rules were encoded in the knowledge graph that enabled this.

First, we used a rule that formed the protein-protein-association relations between proteins that had a transitive relations between them. The TypeQL syntax above shows this rule, while the following diagram displays how this rule works:

The dotted relation on the top is the inferred protein-protein-association relation that was created by TypeDB. This rule resulted from all the proteins that are in a pathway and are linked to each other.

The second rule used a somewhat similar flavour of rules to form the initial protein-target-identification relation between the disease under observation and the potential target proteins.

The syntax for this rule can be seen above, where we are now using the protein which is directly related to the disease and the proteins which are related to the directly related protein. A visual representation of this rule is as follows:

The dotted relations on the top are the inferred protein-protein-association relations which were created due to the logic of the first rule, while the ones on the bottom are the inferred protein-target-identification relations created by the second rule.

As much as we may want it to be, biology is not as simple as assumed above. Pathways are not linear in nature, but reveal much more complex sets of relations. Proteins that take part in disease pathways are also responsible for causing a variety of normal cell functions. A slightly more realistic model would look as follows:

Question: Given this model, what are the potential targets for the disease Melanoma that do not take part in any normal cell function pathways?

This question can be translated into TypeQL with the help of negation, which enables us to ask for concepts which do not adhere to certain statements, and in this case enables us to retrieve the proteins that are not related to any normal cell function.

Answer: What is returned is the identification of the protein which is not causing any normal cell function.

But what if all proteins are responsible for cell functions? Can we get some sort of measure that can guide us to find out an insight which may help our research?

Question: What are the potential targets for the disease Melanoma and what are the number of occurrences of those proteins in normal cell functions?

Distributed analytics is a set of scalable algorithms that allows you to perform computation over large amounts of data in a distributed fashion. The TypeQL query above shows an example of this which helps us answering our question.

Answer:

Here, we get back exactly what we asked for, allowing us to shortlist the target proteins that we want to prioritise in our research; and expediting our research to viable compounds capable of treating a disease with minimal effects to your normal physiology.

How do all the Pieces Fit Together in one Architecture?
Now, let us take a step back, and look at how all of the components of building a drug discovery knowledge graph piece together.

We start with the data which can come from multiple sources and in various formats. That raw data is used to create a schema (high level data model) to enforce a structure on the raw data. Once that is done, we use one of TypeDB’s clients to migrate the instances of data into TypeDB making sure every insertion adheres to the schema. TypeDB stores this in its knowledge representation system, which can be queried for insights to discover complex relationships or even test out hypotheses. These insights may already be in the graph or can even be created at the time of query by the reasoning engine.

Conclusion

So, we know that Data Driven Drug Discovery is extremely promising in revolutionising global health care. We understand that there are barriers and bottle-necks associated with achieving it. I hope to have shown that TypeDB can help to bring us many steps closer to efficiently, effectively, and economically discovering drugs which can cure and treat those patients with diseases whose medications are still at large. In summary, TypeDB helps to solve the three key challenges in drug discovery when it comes to handling data:

You can checkout out our open source BioGrakn GitHub repo to explore TypeDB and its capabilities in the biology space.

If you have any questions, comments or would like to collaborate, please shoot me an email at community@vaticle.com. You can also talk to us and discuss your ideas with the TypeDB community.

Grakn Labs was rebranded in 2021 to Vaticle; Grakn is now TypeDB, and Graql is now TypeQL.

--

--