Modeling Patient Journeys with Neo4j

Matt Holford
Neo4j Developer Blog
10 min readJun 3, 2020

--

This article is co-written by Ravi Anthapu and Matt Holford. We are both engineers at Neo4j, inc.

We do a lot of what we call “event-based” modelling at Neo4j. This kind of modeling is something for which graph databases are especially well suited. It enables our customers to quickly draw new and meaningful conclusions from massive amounts of data. But what is “event-based modeling” and why are graph databases so good at it? To answer these questions, we will walk through modeling events in the healthcare domain and see what sorts of insights we can make from our model.

Journeys

When we model events, we look at the actions of individuals. Over the course of time, an individual participates in a number of events. The events themselves may vary according to the domain we are looking at; they may range in importance from the mundane to the earth-shaking. The point is that these events when taken in sequence, tell us a “story” about that individual in our domain. This connected series of events is also commonly called a “journey”, as the individual travels step by step from one event to the next.

Although the story of one individual can be interesting in its own right, the true power of the event-based model is revealed when we start to aggregate the journeys of many individuals. For example, we can start to detect common patterns of behavior. This in turn makes it easier for us to pick out “outliers”, individuals exhibiting unusual behavior. We can also begin to group individuals based upon having similar behavior patterns. These groups, or communities, are useful in predicting future behavior, as it is likely that an individual will behave as other members of her community did when confronted with a similar situation.

Patient Journeys

In the medical domain, we can think of a person’s life as a “Patient Journey”, filled with medical events, starting with birth and culminating, eventually, in death. All interactions with the medical establishment during a person’s lifetime form steps along the path. These can include doctor visits, hospitalizations, prescriptions, diagnoses and medical procedures. If we aggregate a large number of patient’s journeys, we can begin to observe trends in disease and patterns in treatment. For example, we could see what conditions emerge as side-effects to taking a medication or what medical procedures are followed after diagnosis with a disease, and how effective these procedures are.

Sankey diagram showing procedures within 90 days of Pre-eclampsia diagnosis

How can we get data for our model? Unless you are an insider at a biomedical organization, like a hospital or research facility, it will be quite difficult to get hold of real patient data. While anonymization of patient data is possible, even this data can require significant clearance and expense. Luckily, there is a third option — synthetic patient data. The Synthea project is able to generate medical records on demand based upon statistical distributions. Modular in design, it is quite customizable as far as disease prevalence, treatment options and patient demographics. Though not perfect (we notice various discrepancies from time to time), the generated patient data is remarkably plausible. Synthea is widely used and actively maintained. For example, as I write this, several efforts are underway to incorporate COVID-19 records. You can find instructions on how to generate Synthea data for the model on our GitHub repository here.

Data Model — First Version

People and Places

Let’s start modelling this domain. First off, we can create a Patient node. Properties should include demographic information such as: first name, last name, race, gender, birth date and death date (when applicable). For capturing the Patient’s location, it makes sense to model a distinct Address node, and link it to the Patient via a HAS_ADDRESS relationship. This design gives us a couple of advantages. For one, some patients (e.g. family members) may live at the same address. Secondly, other entities such as doctors and hospitals have addresses as well. The Address node would allow us to readily match entities with the same addresses. Additionally, the Address node can exploit Neo4j’s built-in geo-spatial capabilities by representing the latitude and longitude of the Address using the Point type. Then we can quickly compute the distances between Addresses and thus find Patients close to each other or to other landmarks.

Providers (doctors) and Organizations (hospitals, clinics, etc) can be modeled the same way. We keep things like name and speciality as properties on Provider and Organization nodes and move location out into a separate node via a HAS_ADDRESS relationship. There is also a connection between Providers and the Organizations for whom they work. So, we create a BELONGS_TO relationship between Providers and Organizations.

Let’s also model Payers- insurance providers or others who foot the bill for medical encounters. A patient’s insurance can change over the years and we will want to capture that in our graph. Thus, we create INSURANCE_START and INSURANCE_END relationships between the Patient and Payer nodes. A relationship property can be used to indicate the times at which coverage started or ended.

Encounters and Events

We can treat each interaction with the medical community as a separate Encounter. Each Encounter is given a date property and is associated with a Provider and a Payer. So, for each Encounter node, we create HAS_PAYER and HAS_PROVIDER relationships connecting it to the relevant parties. We also create a link from Patient via the HAS_ENCOUNTER relationship. The Encounter node should also be linked to the events that occurred or commenced within its span. We model six types of event nodes (Condition, Drug, Procedure, CarePlan, Allergy and Observation), each of which can materialize zero to many times as part of an Encounter. Properties of these nodes would include a name and a formal identification code (e.g. from SNOMED or ICD9). The relationships connecting them from Encounter would be called, naturally enough, HAS_CONDITION, HAS_DRUG, etc.

Let’s consider one more subtle point. Some medical events can have start dates as well as end dates. For example, a condition may be cured or disappear after a while; a patient may stop taking a drug or following a care plan. Other medical events have only a performed date. For example, a medical procedure is performed at a particular time and not over a span of dates. For all events, we use a startDate property to capture the start or perform (if applicable) date. For those events with an end date, we do something a bit different. Whilst we could also model endDate as a property, what if we wanted to capture patient journeys following the end of something? We may want to see what happens after patients stop taking a medication or after an illness goes away. For this reason, we treat the end as a separate Encounter and we connect it to the original Encounter via a HAS_END relationship. The end Encounter will also be connected to the event e.g. by HAS_CONDITION etc.

There. Now we have a complete first pass at a Patient Journey model.

Using the Data Model

To follow a single patient’s journey, we could find that Patient by their ID and traverse the HAS_ENCOUNTER relationships. From there, we can hop to whatever event node by navigating the HAS_CONDITION (or HAS_DRUG, etc) relationship. If we wanted to capture the journeys of all patients after being diagnosed with a disease, we could start from the Condition node representing the disease, get all Encounters where the disease was diagnosed, get the Patients attached to those Encounters and proceed with the journey for each patient as above.

To form a journey, we want to present the patient’s Encounters in date order; but here is where we start to run into problems. To find the patient’s first Encounter, we need to sort all of the Encounters connected to the Patient by date. Since we have created an index on the date property for Encounters, this does not seem so bad. However, we must do a similar sort to obtain the next Encounter and again for the Encounter after that. This may not have too negative an effect when finding the journey of a single patient, barring one who is especially sickly and/or long-lived, but the impact accumulates when we aggregate across thousands or even millions of patients. We could prevent multiple sorts by sorting once and keeping the full list in memory, but here again we run into resource problems when we scale up the number of patients.

Data Model — Improvements

Hmm, how can we do better? Can graph technology lead us to a better solution? Let’s start by thinking about the notion of a “journey”. If we were walking along a clearly-marked path, a yellow brick road for example, it would be immediately obvious what our next step should be — the next brick. We wouldn’t need to consult a map to see where to go; the direction to go is right before our eyes. Graph databases store data in a manner compatible with this metaphor. Connected relationships are stored directly in the data structure of the nodes themselves by way of a linked list.

The NEXT Relationship

We can exploit this powerful data structure by connecting each Encounter to the next chronological Encounter for that Patient. The relationship we call, naturally enough, NEXT, and here is where the magic of Graph databases comes into play. We can compute the whole patient journey by simply hopping along the NEXT relationship from Encounter to Encounter. From each Encounter, it’s still just a quick hop to find whatever Condition, Drug, etc. occurred during that Encounter.

With the NEXT relationship, the model looks like this. Now let us see if our thought process is correct.

Performance Comparison

We can run a few Cypher queries to indicate the performance characteristics of the two models. In the first model, the query to get the journey of a single patient would look something like this:

MATCH (p:Patient {id:$patientId})-[:HAS_ENCOUNTER]->(e)
WITH e ORDER BY e.date
MATCH (e)-[:HAS_CONDITION|:HAS_DRUG|:HAS_CARE_PLAN|:HAS_ALLERGY|:HAS_PROCEDURE]->(x)
OPTIONAL MATCH (e)-[:HAS_END]->(end)
RETURN labels(x)[0] AS eventType, x.description AS name,
e.date AS startDate, coalesce(end.date, ‘NA’) AS endDate

Profiling this query on our test instance composed of a million patients, we found the query took 3ms and 334 db hits. Not bad. In the second model, the query would resemble:

MATCH (p:Patient {id:$patientId})- [:HAS_ENCOUNTER]->(e)
WHERE apoc.node.degree.in(e, ‘NEXT’) = 0
WITH e
MATCH (e)-[:NEXT*]->(e2)-[:HAS_CONDITION|:HAS_DRUG|:HAS_CARE_PLAN|:HAS_ALLERGY|:HAS_PROCEDURE]->(x)
OPTIONAL MATCH (e2)-[:HAS_END]->(end)
RETURN labels(x)[0] AS eventType, x.description AS name,
e2.date AS startDate,coalesce(end.date, ‘NA’) AS endDate

This query also takes just a couple milliseconds and spent 390 db hits under profile.

Hmm. This suggests that using the NEXT relationship comes with a small cost. This is because the Patient node is connected to all event nodes, making sort look like an efficient option.

Let’s take a look at this data from a different perspective. When we are trying to analyze a Condition, we would not start from a single Patient node. We want to look at the journeys of all patients after diagnosis with a Condition. Let’s look at such a query in the first model:

MATCH (c:Condition {code:$code}) <-[:HAS_CONDITION]-(encounter)
WITH encounter LIMIT 1
MATCH (encounter)<-[:HAS_ENCOUNTER]-(patient)
WITH patient, encounter
MATCH (patient)-[:HAS_ENCOUNTER]->(e)
WHERE encounter.date <= e.date < (encounter.date + duration(‘P90D’))
WITH e ORDER BY e.date
MATCH (e)-[:HAS_CONDITION|:HAS_DRUG|:HAS_CARE_PLAN|:HAS_ALLERGY|:HAS_PROCEDURE]->(x)
OPTIONAL MATCH (e)-[:HAS_END]->(end)
RETURN labels(x)[0] AS eventType, x.description AS name,
e.date AS startDate,coalesce(end.date, ‘NA’) AS endDate

Note that for these purposes we have reduced the number of patients to just one. On our test server, this completes in 2 ms and consumes 428 db hits. Adapting this query to the second model, we end up with something like this:

MATCH (c:Condition {code:‘271737000’}) <-[:HAS_CONDITION]-(e)
WITH e LIMIT 1
MATCH (e)-[:NEXT*]->(e2)-[:HAS_CONDITION|:HAS_DRUG|:HAS_CARE_PLAN|:HAS_ALLERGY|:HAS_PROCEDURE]->(x)
WHERE e2.date < ( e.date + duration(‘P90D’) )
OPTIONAL MATCH (e2)-[:HAS_END]->(end)
RETURN labels(x)[0] AS eventType, x.description AS name,
e2.date AS startDate,coalesce(end.date, ‘NA’) AS endDate

This query also takes 2 ms, but spends only 412 db hits, so it seems our thought process is a valid one. The query using NEXT is not only easier to understand- it performs better. The savings using the NEXT technique instead of sorting may appear small with a patient population of just one, but when multiplied across thousands or even millions of patients it makes a huge difference.

Using the NEXT relationship also allows us to form queries that require back and forth traversal between Encounters.

MATCH (a:Allergy)<-[:HAS_ALLERGY]-(e)-[:HAS_END]->(e2),
(e)<-[:HAS_ENCOUNTER]-(patient)
WITH patient, a, e, e2 LIMIT 10
MATCH p=(e)-[:NEXT*]->(e2)
WITH patient, a, nodes(p) AS nodes
UNWIND nodes AS tempe
MATCH (tempe)-[:HAS_DRUG]->(d)
RETURN patient.firstName AS firstName, patient.lastName AS lastName,
a.description AS Allergy, collect(d.description) AS drugs

In this query we find Encounters in which an Allergy ceased to be present. We then collect the drugs taken by patients prior to relief from the allergy. This shows what kind of drugs can successfully treat an allergy across the patient population. The query above takes 4527 db hits. If we attempted such a query with our sort-reliant model, it would look like:

MATCH (a:Allergy)<-[:HAS_ALLERGY]-(e)-[:HAS_END]->(e2),
(e)<-[:HAS_ENCOUNTER]-(patient)
WITH patient, a, e, e2 LIMIT 10
MATCH (patient)-[:HAS_ENCOUNTER]->(encounter)
WHERE e.date <= encounter.date <= e2.date
WITH patient, a, encounter ORDER BY encounter.date
WITH patient, a, collect(encounter) AS nodes
UNWIND nodes AS tempe
MATCH (tempe)-[:HAS_DRUG]->(d)
RETURN patient.firstName AS firstName, patient.lastName AS lastName,
a.description AS Allergy, collect(d.description) AS drugs

This query consumes 7509 db hits, over 50% worse! As you can see, the more we need to traverse along Encounter nodes, the more performance benefit we gain from the NEXT relationship.

Diagram of an individual patient journey

Next Steps

Now we have our complete model for patient journeys! On our GitHub, you can find Cypher scripts to load data into Neo4j from Synthea. For ingestion, we use this python utility. Pyingest leverages batching and optimizations from the Pandas library to enable rapid ingestion to the graph. We have found that it outperforms Neo4j’s built-in CSV loader while consuming less server resources.

In our next blog, we will look at how we can use Neo4j’s Java API to quickly calculate patient journeys starting from events of interest.

--

--