Creating Clinical Knowledge Graph by Spark NLP & Neo4j
The first end-to-end clinical knowledge graph creation using Spark NLP and Neo4j.
In this article, we will build a Knowledge Graph (KG) using Spark NLP Relation Extraction (RE) Models and Neo4j. Graph data representation has become pervasive over the last decade since connected relationships based on context as their momentous feature.
We will not focus on the “How to create Spark NLP RE pipeline and its details?”. The main point of this article is “How to connect Spark NLP with Neo4j to create a KG?”. To achieve our main goal, we will generate KGs by exporting the results of three different Spark NLP RE models to Graph DB with Neo4j.
1. Spark NLP — A Short Introduction
Spark NLP is an open-source NLP library under the hood of Apache Spark and Spark ML. It provides a single unified solution for all NLP needs by an easy API to integrate with ML Pipelines. John Snow Labs is an award-winning data analytics company leading and sponsoring the development of the Spark NLP library.
The library covers many common NLP tasks such as tokenization, stemming, lemmatization, part-of-speech (POS), named entity recognition (NER), etc. The full list of annotators, pipelines, and concepts is described in the online reference and you can find cards of pre-trained models and pipelines in the models hub.
Spark NLP crossed 1 million downloads per month and 8 million downloads from the beginning of its journey. You can find details of Spark NLP in the Introduction to Spark NLP: Foundations and Basic Components and you can check out the release notes of each version including examples and PyPI. You can also visit the Spark NLP workshop repository to try some NLP tasks.
2. Neo4j
The big data movement has increased the importance of Graph databases. Storing and showing the data and its relation became popular in recent years. Social media needs to use graph databases to track the relationships and boost their recommendation systems. Relational databases are at least one step behind graph databases when we start to talk about real-time and big data analytics. In business and data analytics, the more recent trend with graph databases is increasing rapidly in order to leverage data relationships.
Neo4j is a graph database that provides trusted and advanced tools to developers and data scientists. It is available as a fully managed cloud service or self-hosted. Neo4j Sandbox, Desktop, and Aura are the options for beginners.
3. Relation Extraction Models
Relation extraction is the task of predicting semantic relationships from the text. Relationships usually occur between NERs and “NER chunks” of a certain type such as Person, Location, Organization, etc. RE is the core component to build “relation KG”. It is essential for NLP applications like sentiment analysis, question answering, and summarization.
By the way, it is time to give some brief information about “What is NER?” for beginners. NER is the abbreviation of the Named Entity Recognition. NER is one of the core parts of the NLP tasks. In my opinion, annotating the data and NER is the brain and heart of the NLP body that manages every other task and subtasks. Recognition of the named entities is basically the classification of the tokens. NER tries to locate and classify pre-defined categories such as persons, locations (countries, cities, counties, etc.), organizations, hospitals, medical centers, medical codes, measurements, units, monetary values, percentages, quantities, etc. We use NER to downstream the related tasks to answer real-world questions:
- Which hospital and department have been admitted by the patient?
- Which clinical tests have been applied to the patient?
- What are the test results?
- Which medication/procedure has been started?
Clinical RE plays a key role in Clinical NLP tasks to extract information from healthcare reports. It is used for multiple purposes such as detecting temporal relationships between clinical events to drug-drug interactions, the relation between medical problems to treatment or medication interactions, etc. We will not go deeper into the importance of the RE in medical studies, it’s beyond the scope of this article.
You can find detailed information about Spark NLP NER and RE models at this link.
4. Spark NLP Relation Extraction KG using Neo4j
This article is the first one that shows to create a KG using Spark NLP and Neo4j. The rest of the blog post will show you how you can create a basic KG using Spark NLP, Neo4j, and Colab Notebook, respectively.
4.1. Creating Neo4j Sandbox
The Neo4j Sandbox is especially appropriate for users who are new to Neo4j. You don’t need to download or install any files. It is a way to get up and run a project easily and rapidly. You can collaborate with your team members if you invite them. You can extend the project for 7 days hence it can be alive for up to 10 days.
For the purposes of this blog post, you have to sign up for Neo4j Sandbox. After signing in, you will create a blank sandbox as shown below.
You can find very fruitful projects to improve Neo4j capabilities on your own. The blank sandbox will be created while your coffee is still brewing. When it is ready, you can find your connection details which will be used to connecting the sandbox using Colab Notebook.
4.2. Creating RE Pipeline using Spark NLP
We will use Spark NLP licensed version for Healthcare Applications. Before proceeding ahead, you can get your 30-day-free secret key by visiting and filling out the form.
Colab Setup
You can set up a Colab session as shown below. First, you will upload your secret key file to use Spark NLP JSL (licensed version). Secondly, all required installations will be managed by the “jsl_colab_setup.sh” bash script according to license key parameters.
Last but not least, we should import corresponding libraries and start the Spark NLP session. As you see below, this notebook is prepared using 3.2.2 versions of Spark NLP and JSL.
Pipeline
As discussed above, we will use three RE pipelines to get relation predictions in clinical text, respectively, temporal events, clinical relations, and posology. We will go through the first one in this article. You can explore the others in the notebook.
Let’s briefly explain the steps of the pipeline. First, we read the document and parse in sentences. Then, we split each sentence as tokens, find the embeddings of the tokens, do the POS tagging. Next, we find the NERs by the pretrained “ner_events_admission_clinical” model. After that, we merge the NER chunks (concatenating B and I tagged NERs). And finally, we run the dependency parser and relation extraction pre-trained model to extract the relations between NER chunks. You can find three RE pipelines (explained in this article) and more in this notebook.
To get faster inferences in the runtime, we will use a light pipeline. For more information about Spark NLP Light Pipeline, you should read this respectable article.
As you see, the relation between NER chunks (chunk1 & chunk2 columns) has been stated in the relation column. In addition, this Pandas DataFrame also shows us types of chunks at the related entity columns.
Until now, we have signed up Neo4j Sandbox, created a blank sandbox, saved free trial keys of Spark NLP JSL version, set up a Colab session, created a RE Pipeline, defined a light pipeline to get faster inference, and got the results. An emerging question at this point, how can I use these results? First, you can use the results to feed downstream pipelines. Secondly, you can create a KG to try to get insights from the results. For this study, we have chosen the second option. Therefore, we will create a KG using the RE results with Neo4j.
4.3. Neo4j Connection
Neo4j connection class is provided by Neo4j Dev Team officials. Now, we will use the Neo4j connection class to connect to Neo4j blank sandbox. This is where we are going to use sandbox’ Bolt URL, username, and password. Then, we will use two helper functions: first is to update data in Neo4j sandbox in batch mode, the second one is to create nodes and relationships between them. I have used and modified the helper classes from this Medium Post. The batch mode loading is useful to load the data if exceeds 50K and to prevent timeouts.
Great! Now, we are ready to take some actions with the Neo4j Sandbox. First, create a connection and delete all nodes and relationships. Then, we should create constraints on nodes before populating the database. Constraints on nodes ensure us to set up indexing and avoiding duplicate nodes. After constraints are asserted, it is time to load NER chunks and relationships. We will use the “add_ners_rels” function for this purpose.
Let’s dive into “add_ners_rels” function. Once, you call the function you pass the relation DataFrame as a parameter. In the function,
UNWIND $rows as row
the command takes each line of the DataFrame as a row and you can use the column names of the DataFrame to create a node or to set a property. Hence, we create NERs from “chunk1” and “chunk2” columns and set the type of the NERs according to “entity1” and “entity2” columns, respectively. Lastly, we create relationships between NER entities using the column “relation”.
4.5. KG Queries
Time to query and check out the results! Neo4j’s graph query language is Cypher. It allows users to store and retrieve data from the Neo4j graph database. Cypher is basically an SQL-inspired query language that provides a visual and logical way to match patterns of the nodes and relationships. You can visit the webpage of Cypher to learn details and for online training.
The first query is to retrieve all nodes and relationships in the graph and save them to the Pandas DataFrame. Viola! We can get all nodes and relationships running the query from Colab and check the visualization from the Neo4j Sandbox. As you see below notebook query (Q1) returns a list and we convert it to the dictionary to save it as a DataFrame. Corresponding query (Q2) is for the sandbox usage to get a visualization of the result. By the way, you can run the same query (Q1) in the sandbox and you can check the results as a table and text format. You can also play with the tools of the sandbox (top-right) to save the results as “.png, .svg, .csv, and .json” format.
In addition, for the rest of the article, you will find the notebook queries on the left side and corresponding visualizations on the right side. You can also find the one-on-one match of the corresponding notebook query to sandbox query (for visualization) at the first line of the sandbox picture. To get the same results with a notebook, you should copy-paste the corresponding “Cypher Query” in the sndbox.
Query in the Notebook (Q1):
MATCH (n1)-[r]-(n2)
RETURN n1.name, n1.type, r.relation, n2.name, n2.typeQuery in the Sandbox (Q2):
MATCH (n1)-[r]-(n2)
RETURN n1, r, n2
When we check out the query results, it is obvious that we can derive all the relationships corresponding to NER chunks. These KGs are the insight layer of the associated data. So, we can reason with enriched data and use it for complex decision-making confidently.
Let’s check out the “DATE” related information in the RE results. First, we filter node1 by defining the type of it as ‘DATE’ and retrieve all others.
Here are the other RE models’ results as a sneak peek! You can find the details in the notebook.
As a final example, let’s find out the “PROBLEM-TEST” relation in the graph. So, we will figure out the treatments applied to cure a problem and test results.
5. Conclusion
That’s all for now. In this article, we talked about how you can create a KG using the RE model with Spark NLP. I hope you enjoyed it! We hope that you can start to play with Spark NLP to create KGs. Stay tuned with Spark NLP and don’t forget to follow our page.
Here are some useful links for beginners.
Introduction to Spark NLP: Foundations and Basic Components (Part-I)
Introduction to: Spark NLP: Installation and Getting Started (Part-II)