Enable Intelligent Query with Biological NLP and Knowledge Graphs
Biology is one domain in which a huge amount of information is encoded in written form. We demonstrate the automatic construction of a knowledge graph from scientific text, in order to enable us to query scientific content in a manner driven by an understanding of the biological relationships conveyed therein. This article provides a minimal technical demonstration of a model which, applied at scale, offers a way to increase the speed and comprehensibility with which we can bring existing information to bear on biological questions. We use the Luscient API to extract mechanistic relationships from biomedical text, and Grakn to store and reveal the connections that emerge from these relationships.
You can reproduce the steps in this article with instructions and code from the accompanying repository.
Text to Graph
Mechanistic Information from Text
This section demonstrates the use of our API to extract mechanistic biological relationships from text.
We will take the following three sentences for this example (truncated in parts for simplicity):
You can use a text substrate of your choice by following the instructions here.
Parts and Interactions
Our system produces the following results from the above text. The components are annotated with named entities (purple) and directional terms (arrows; blue). We employ these annotations in the following step.
We consider valenced terms (e.g., ‘induction’, ‘increased’, ‘reduces’), and named entities in the above structures (along with the voice of the sentence), to cast these relationships as assertions which link changes in the ‘drive’ of biological concepts. We thereby generate the following representation:
Implementing the Graph
We now turn to Grakn to store and query this information.
The schema for this example is visualised below:
- ‘driven-concept’ represents the upward or downward change in the ‘drive’ of a biological concept. The name of the biological concept and the direction of the drive change (i.e., ‘UP’ or ‘DOWN’) are given by the attributes ‘name’ and ‘valence’, respectively.
- ‘triggering-relationship’ represents a functional relationship between two such drive change events, where a manipulation in one biological entity or concept triggers another. The provenance of each relationship is given by the attributes: ‘source-text’ (the text from which it was extracted), ‘source-name’, (e.g., PubMed Central), and ‘source-id’ (e.g, PMC3174648).
Loading the Information
We now instantiate the network of relationships within Grakn:
(The script used to insert the data is found here.)
Let’s inspect one of our relationships — the relationship between B. fragilis and spermine oxidase for instance. (From here on we’ll use Grakn’s query language, Graql, to interact with the graph.)
Inferring New Relationships
In our above sample text, we read that:
- B. fragilis toxin induces spermine oxidase.
- Spermine oxidase itself leads to several effects (i.e., ↑ reactive oxygen species, ↑ DNA damage, ↑ cancer).
We want our system to be able to tell us therefore that increasing drive to B. fragilis toxin might by extension also trigger these effects.
To allow Grakn to carry out this form of deduction, we define the rule:
Our system will now be able to fill in the gaps. For example, it can draw the line between B. fragilis and DNA damage — a relationship that does not explicitly appear in the underlying data:
We can now use Grakn to ask questions that require traversing an arbitary number of connections between concepts. We demonstrate three classes of question:
- What might be the consequences of increasing or decreasing the drive of X?
- What sequences of changes could bring about a given outcome Y?
- What sets of observations are consistent with, or might ‘explain’ observation Z?
Consequences of a Change
We specify an initial change (↑ B. fragilis) and get a list of potential consequent changes.
Parsing the response, we present the results in line with the supporting assertions and text from which they were derived. Each chain of reasoning starts with our specified input: ↑ B. fragilis.
Paths to an Outcome
We can also traverse backwards through relationships, specifying a consequent change for which want to find the potential antecedent changes. Here we look for sequences of changes that might lead to ↑ reactive oxygen species.
Explain an Observation
Let’s suppose we’ve observed a positive correlation between B. fragilis and cancer. What sequences of changes do we have in our system that are consistent with this, or might provide a mechanistic ‘explanation’?
We specify both the antecedent and consequent change and let the system find the paths between the two.
The result shows that ↑ B. fragilis may be linked to ↑ cancer through an intermediary ↑ spermine oxidase.
We have presented an approach that combines information extraction techniques and graph database technology to enable us to ask questions of scientific content at the level of the mechanistic relationships encoded within that content, in a manner that is capable of linking together separate assertions, which may derive from many separate articles, in order to answer queries of biological consequence and explanation.
This is one example of how we can move beyond the current standard of keyword-based search towards a deeper type of search driven by an understanding of biological parts and interactions. We believe this shift will improve our ability to realise connections and see further in our quest to understand biology.
Author email: email@example.com
Author LinkedIn: https://www.linkedin.com/in/nick-morley-32181110b
Project website: www.luscient.io
Project email: firstname.lastname@example.org
Thanks to Beni Bienz and Tim Daly for their insightful contributions, Marco Scoppetta and Soroush Saffari for their brilliance, Tomas Sabat for his initiative and support, and to Paul Bradley and Gordon Baxter, who inspired this work.