Use LLMs to Turn CSVs into Knowledge Graphs: A Case in Healthcare

11 min readJun 22, 2024

Recently I read a post where neo4j-runway was presented. According to their Github page, “Neo4j Runway is a Python library that simplifies the process of migrating your relational data into a graph. It provides tools that abstract communication with OpenAI to run discovery on your data and generate a data model, as well as tools to generate ingestion code and load your data into a Neo4j instance”. Translating, by uploading a CSV, the LLM will find the nodes and relationships and automatically generate a Knowledge Graph.

Knowledge Graphs in healthcare represent a powerful tool for organizing and analyzing complex medical data. These graphs structure information in a way that makes it easier to understand relationships between different entities, such as diseases, treatments, patients, and healthcare providers.

KG’s in healthcare offer some useful applications:

  • Integration of Diverse Data Sources: Knowledge graphs can integrate data from various sources such as electronic health records (EHRs), medical research papers, clinical trial results, genomic data, and patient histories.
  • Improving Clinical Decision Support: By linking symptoms, diagnoses, treatments, and outcomes, knowledge graphs can enhance clinical decision support systems (CDSS), given that they consider a vast amount of interconnected medical knowledge, potentially improving diagnostic accuracy and treatment effectiveness. In this article, I will aproach this topic.
  • Personalized Medicine: Knowledge Graphs enable the development of personalized treatment plans by correlating patient-specific data with broader medical knowledge. This includes understanding the relationships between genetic information, disease mechanisms, and therapeutic responses, leading to more tailored healthcare interventions.
  • Drug Discovery and Development: In pharmaceutical research, knowledge graphs can accelerate drug discovery by identifying potential drug targets and understanding the biological pathways involved in diseases.
  • Public Health and Epidemiology: Knowledge graphs are useful in public health for tracking disease outbreaks, understanding epidemiological trends, and planning interventions, as they can integrate data from various public health databases, social media, and other sources to provide real-time insights into public health threats.

Neo4j Runway is an open source library created by Alex Gilmore. You can find the repo here and a blog describing the library here.

Currently, this library only supports OpenAI LLM to parse the CSVs, and offers the following features:

  • Data Discovery: Leverage OpenAI LLMs to extract meaningful insights from your data.
  • Graph Data Modeling: Use OpenAI and the Instructor Python library to develop accurate graph data models.
  • Code Generation: Create ingestion code tailored to your preferred data loading method.
  • Data Ingestion: Utilize Runway’s built-in PyIngest implementation, a widely-used Neo4j ingestion tool, to load your data.
  • No need for Cypher writing, given that the LLM does all the job.

Here, besides letting the LLM do all the CSV conversion to a Knowledge Graph, I also used Langchain’s GraphCypherQAChain as a final step, to generate a cypher from the prompt so that we can query the graph without writing a single line of cypher (the SQL-like language to query a Neo4j graph database).

The Github page of the library has a financial example, but I would like to test if it works in a Healthcare situation. Starting from a very simple dataset at Kaggle (Disease Symptoms and Patient Profile Dataset), with only 10 columns (Disease, Fever, Cough, Fatigue, Difficulty Breathing, Age, Gender, Blood Pressure, Cholesterol Level and Outcome Variable), I’d like to be able to provide the medical report to the LLM in order to get the diagnostic hypotheses.

Let’s go straight to the code. First, the libraries:

sudo apt install python3-pydot graphviz 
pip install neo4j-runway
import numpy as np
import pandas as pd
from neo4j_runway import Discovery, GraphDataModeler, IngestionGenerator, LLM, PyIngest
from IPython.display import display, Markdown, Image

Load Environment Variables: you can read my other article on how to create an instance in Neo4j Aura and authenticate.

load_dotenv()
OPENAI_API_KEY = os.getenv('sk-openaiapikeyhere')
NEO4J_URL = os.getenv('neo4j+s://your.databases.neo4j.io')
NEO4J_PASSWORD = os.getenv('yourneo4jpassword')

Now, let’s load the medical data. Download the CSV from the Kaggle site and load into the Jupyter notebook. This is a very simple dataset, but useful to test a concept.

disease_df = pd.read_csv('/home/user/Disease_symptom.csv')
disease_df

For instance, we may create a list of all diseases that cause difficulty in breathing, what is very interesting not only for selecting nodes in the graph, but also to develop a diagnostic hypothesis:

disease_df[disease_df['Difficulty Breathing']=='Yes']

Let’s move on. All variables must be strings (the library was made this way), even integers. Then, we save the CSV:

disease_df.columns = disease_df.columns.str.strip()
for i in disease_df.columns:
disease_df[i] = disease_df[i].astype(str)
disease_df.to_csv('/home/user/disease_prepared.csv', index=False)

Now, we will describe data for the LLM, including possible values for each one of the fields:

DATA_DESCRIPTION = {
'Disease': 'The name of the disease or medical condition.',
'Fever': 'Indicates whether the patient has a fever (Yes/No).',
'Cough': 'Indicates whether the patient has a cough (Yes/No).',
'Fatigue': 'Indicates whether the patient experiences fatigue (Yes/No).',
'Difficulty Breathing': 'Indicates whether the patient has difficulty breathing (Yes/No).',
'Age': 'The age of the patient in years.',
'Gender': 'The gender of the patient (Male/Female).',
'Blood Pressure': 'The blood pressure level of the patient (Normal/High).',
'Cholesterol Level': 'The cholesterol level of the patient (Normal/High).',
'Outcome Variable': 'The outcome variable indicating the result of the diagnosis or assessment for the specific disease (Positive/Negative).'
}

The next step is to ask the LLM to analyze the tabular data, to identify elements of the data that are important for generating a graph data model.

disc = Discovery(llm=llm, user_input=DATA_DESCRIPTION, data=disease_df)
disc.run()

This will generate a Markdown output of the data analysis:

Great. Now, let’s create the initial model:

# instantiate graph data modeler 
gdm = GraphDataModeler(llm=llm, discovery=disc)

# generate model
gdm.create_initial_model()

# visualize the data model
gdm.current_model.visualize()

Here, my focus is on the disease, so we will reorder some relationships.

gdm.iterate_model(user_corrections='''
Let's think step by step. Please make the following updates to the data model:
1. Remove the relationships between Patient and Disease, between Patient and Symptom and between Patient and Outcome.
2. Change the Patient node into Demographics.
3. Create a relationship HAS_DEMOGRAPHICS from Disease to Demographics.
4. Create a relationship HAS_SYMPTOM from Disease to Symptom. If the Symptom value is No, remove this relationship.
5. Create a relationship HAS_LAB from Disease to HealthIndicator.
6. Create a relationship HAS_OUTCOME from Disease to Outcome.
''')

from IPython.display import Image, display
gdm.current_model.visualize().render('output', format='png')
# Load and display the image with a specific width
img = Image('output.png', width=1200) # Adjust the width as needed
display(img)

Now we can generate the Cypher code and YAML file to load the data into Neo4j. Just in case, if you are just testing or doing this for the second time, you may want to reset the instance to blank (erase everything).

# instantiate ingestion generator
gen = IngestionGenerator(data_model=gdm.current_model,
username="neo4j",
password='yourneo4jpasswordhere',
uri='neo4j+s://123654888.databases.neo4j.io',
database="neo4j",
csv_dir="/home/user/",
csv_name="disease_prepared.csv")

# create ingestion YAML
pyingest_yaml = gen.generate_pyingest_yaml_string()

# save local copy of YAML
gen.generate_pyingest_yaml_file(file_name="disease_prepared")

Everything is ready. Let’s load data into the instance:

PyIngest(yaml_string=pyingest_yaml, dataframe=disease_df)

Go to the Neo4j Aura instance , Open, add your password and run this query via cypher:

MATCH (n)
WHERE n:Demographics OR n:Disease OR n:Symptom OR n:Outcome OR n:HealthIndicator
OPTIONAL MATCH (n)-[r]->(m)
RETURN n, r, m

CTRL + ENTER and you will get this:

Inspecting the nodes and relationships, we have a huge interconnection of symptoms, health indicators and demographics:

Let’s see Diabetes: as no filters were applied, men and women will appear, as well as all the LAB, DEMOGRAPHIC and OUTCOME possibilities.

MATCH (n:Disease {name: 'Diabetes'})
WHERE n:Demographics OR n:Disease OR n:Symptom OR n:Outcome OR n:HealthIndicator
OPTIONAL MATCH (n)-[r]->(m)
RETURN n, r, m

Or maybe all the diseases that present high blood pressure at clinical examination:

// Match the Disease nodes
MATCH (d:Disease)
// Match HAS_LAB relationships from Disease nodes to Lab nodes
MATCH (d)-[r:HAS_LAB]->(l)
MATCH (d)-[r2:HAS_OUTCOME]->(o)
// Ensure the Lab nodes have the bloodPressure property set to 'High'
WHERE l.bloodPressure = 'High' AND o.result='Positive'
RETURN d, properties(d) AS disease_properties, r, properties(r) AS relationship_properties, l, properties(l) AS lab_properties

Now it’s clear what my goal is: I want to submit a medical report to an LLM, in this case, Gemini-1.5-Flash, from Google, so that it automatically creates the cypher query via Langchain (GraphCypherQAChain), to return the possible diseases a patient has, given the symptoms, health indicators, etc. Let’s do this:

import warnings
import json
from langchain_community.graphs import Neo4jGraph

with warnings.catch_warnings():
warnings.simplefilter('ignore')

NEO4J_USERNAME = "neo4j"
NEO4J_DATABASE = 'neo4j'
NEO4J_URI = 'neo4j+s://1236547.databases.neo4j.io'
NEO4J_PASSWORD = 'yourneo4jdatabasepasswordhere'

Get the Knowledge Graph from the instance and the schema: here you have the node properties and the relationship properties.

kg = Neo4jGraph(
url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

kg.refresh_schema()
print(textwrap.fill(kg.schema, 60))
schema=kg.schema

Let’s initalize Vertex AI Gemini-1.5-Flash:

from langchain.prompts.prompt import PromptTemplate
from langchain.chains import GraphCypherQAChain
from langchain.llms import VertexAI

# Initialize Vertex AI
vertexai.init(project="your-project", location="us-west4")

llm = VertexAI(model="gemini-1.5-flash")

Now, the hard part: create a detailed instruction for Gemini-1.5-Flash to automatically generate a cypher to query the graph database and get us the outcome we need. We need a MASTER prompt for that! 😂😂😂😂 CoT + few shot prompt.

prompt_template = """
Let's think step by step:

Step1: Task:
Generate an effective and concise Cypher statement with less than 256 characteres to query a graph database
Do not comment the code.

Step 2: Get to know the database schema: {schema}

Step 3: Instructions:
- In the cypher query, ONLY USE the provided relationship types and properties that appear in the schema AND in the user question.
- In the cypher query, do not use any other relationship types or properties in the user's question that are not contained in the provided schema.
- Regarding Age, NEVER work with the age itself. For example: 24 years old, use interval: more than 20 years old.
- USE ONLY ONE statement for Age, always use 'greater than', never 'less than' or 'equal'.
- DO NOT USE property keys that are not in the database.

Step 4: Examples:
Here are a few examples of generated Cypher statements for particular questions:

4.1 Which diseases present high blood pressure?
MATCH (d:Disease)
MATCH (d)-[r:HAS_LAB]->(l)
WHERE l.bloodPressure = 'High'
RETURN d.name

4.2 Which diseases present indicators as high blood pressure?
// Match the Disease nodes
MATCH (d:Disease)
// Match HAS_LAB relationships from Disease nodes to Lab nodes
MATCH (d)-[r:HAS_LAB]->(l)
MATCH (d)-[r2:HAS_OUTCOME]->(o)
// Ensure the Lab nodes have the bloodPressure property set to 'High'
WHERE l.bloodPressure = 'High' AND o.result='Positive'
RETURN d, properties(d) AS disease_properties, r, properties(r) AS relationship_properties, l, properties(l) AS lab_properties

4.3 What is the name of a disease of the elderly where the patient presents high blood pressure, high cholesterol, fever, fatigue
MATCH (d:Disease)
MATCH (d)-[r1:HAS_LAB]->(lab)
MATCH (d)-[r2:HAS_SYMPTOM]->(symptom)
MATCH (symptom)-[r3:HAS_DEMOGRAPHICS]->(demo)
WHERE lab.bloodPressure = 'High' AND lab.cholesterolLevel = 'High' AND symptom.fever = 'Yes' AND symptom.fatigue = 'Yes' AND TOINTEGER(demo.age) >40
RETURN d.name

4.4 What disease gives you fever, fatigue, no cough, no short breathe in people with high cholesterol?
MATCH (d:Disease)-[r:HAS_SYMPTOM]->(s:Symptom)
WHERE s.fever = 'Yes' AND s.fatigue = 'Yes' AND s.difficultyBreathing = 'No' AND s.cough = 'No'
MATCH (d:Disease)-[r1:HAS_LAB]->(lab:HealthIndicator)
MATCH (d)-[r2:HAS_OUTCOME]->(o:Outcome)
WHERE lab.cholesterolLevel='High' AND o.result='Positive'
RETURN d, properties(d) AS disease_properties, r, properties(r) AS relationship_properties


Step 5. These are the values allowed for each entity:
- Fever: Indicates whether the patient has a fever (Yes/No).
- Cough: Indicates whether the patient has a cough (Yes/No).
- Fatigue: Indicates whether the patient experiences fatigue (Yes/No).
- Difficulty Breathing': 'Indicates whether the patient has difficulty breathing (Yes/No).
- Age: The age of the patient in years.
- Gender: The gender of the patient (Male/Female).
- Blood Pressure: The blood pressure level of the patient (Normal/High).
- Cholesterol Level: The cholesterol level of the patient (Normal/High).
- Outcome Variable: The outcome variable indicating the result of the diagnosis or assessment for the specific disease (Positive/Negative).

Step 6. Answer the question {question}."""

We set up the GraphCypherQAChain …

cypher_prompt = PromptTemplate(
input_variables=["schema","question"],
template=prompt_template
)

cypherChain = GraphCypherQAChain.from_llm(
VertexAI(temperature=0.1),
graph=kg,
verbose=True,
cypher_prompt=cypher_prompt,
top_k=10 # this can be adjusted also
)

… and submit the medical report:

cypherChain.run("""
Patient Information:
Jane Doe, a 58-year-old female, was admitted on June 15, 2024.

Chief Complaint and History of Present Illness:
Jane reported a high fever up to 104°F, body pain, and a rash,
starting five days prior to admission.

Past Medical History:
Jane has no significant past medical history and no known allergies.

Physical Examination:
Jane's temperature was 102.8°F, heart rate 110 bpm, blood pressure 100/70 mmHg, and respiratory rate 20 breaths
per minute. No petechiae or purpura were noted.

What disease may she have?""")

The output: here Gemini-.5-Flash generates the cypher to query the graph database, returns the result via JSON for the LLM, that interprets it and returns a readable response:

This result does not consider the knowledge base of Gemini-1.5-Flash, but only the Knowledge Graph it is querying. Imagine if we had a beautiful dataset with 300 features!

Note that we can adjust top_k in the GraphCypherQAChain to 1 or any other value:

If we run this last query, we will get the list of 77 diseases with these symptoms, but top_k is set to 1:

The current neo4j-runway project is in beta and has the following limitations:

  • Single CSV input only for data model generation
  • Nodes may only have a single label
  • Only uniqueness and node / relationship key constraints are supported
  • Relationships may not have uniqueness constraints
  • CSV columns that refer to the same node property are not supported in model generation
  • Only OpenAI models may be used at this time
  • The modified PyIngest function included with Runway only supports loading a local Pandas DataFrame or CSVs

Acknowledgements

Google ML Developer Programs and Google Cloud Champion Innovators Program supported this work by providing Google Cloud Credits

🔗 https://developers.google.com/machine-learning

🔗 https://cloud.google.com/innovators/champions?hl=en

--

--

Rubens Zimbres
Rubens Zimbres

Written by Rubens Zimbres

I’m an ML Engineer and Google Developer Expert in ML and GCP. I love studying NLP algos, LLMs and Cloud Infra. CompTIA Security +. PhD. www.rubenszimbres.phd

Responses (8)