Graph Data Modeling: Categorical Variables
Property graphs provide a lot of flexibility in data modeling; the most frequent question I see that comes up about graph data modeling is whether to make some piece of data a property, a label, or a node.
The image above shows all three examples; an employee name is a property, City is a node label, and there are three nodes. But behind the hardest questions about how to model specific things are usually categorical variables, so I thought I’d put together a short description of what this challenge is about and how to think it through.
As we go through this, try to keep in mind that data modeling is part art, part science, and requires experience in a domain. So we’ll talk about general principles and trade-offs, but there’s no substitute for thinking your problem through for yourself.
Categorical Variables
First, what’s a categorical variable? (Which can also often be called Reference Data, or Reference Master Data)
A categorical variable is a variable that can take on one of a limited, and usually fixed number of possible values.
Examples might include the state a person lives in, the “Type” of an object (is a flight international or domestic) — or a person’s gender.
Examples of things that are not categorical variables:
- A temperature measurement, which could be any floating point value (i.e. “continuous variables”)
- Strings, like names and addresses
These variables typically don’t have a naturally limited number of possibilities, or fixed set of selections. So 9 times out of 10, they’re easier to work with and people end up just making them properties of a node in a graph.
Scenario: Census Data
To illustrate our modeling thinking, and give an example, we’ll take a simple scenario. Say we need to design a database for the census that stores person information, along with the person’s citizenship, self-identified gender, and job (a code that stores their occupation, if they’re employed).
Options!
Categorical variables get the most questions in graph data modeling because there are three different ways you can model them. Let’s take just gender as an example:
- As a label that indicates boolean existence of a category (for example
(:Person:Female)
) - As a property value
(:Person { gender: 'Female' })
- As a separate node
(:Person)-[:GENDER]->(:Gender { name: 'female' })
Cardinality
Before we get into a deep dive on the options, there’s one other thing to consider about our variables, which is their cardinality, or number of options in the category. In the gender variable we have a small handful of options depending on how you model it. For citizenship we have a medium number, but a hard upper limit: there aren’t going to be 5,000 countries next year. Finally, there are probably a few thousand job codes and more will be added for sure with time. So we might say that gender is low cardinality, citizenship is medium, and job is high cardinality.
Change Velocity
In addition to how many values there are, there’s also a consideration about how often they change; genders do not change or very infrequently; citizenships do change, and a Job, or something like a postal code, may change many times over a person’s lifetime.
So what do we do?
All three approaches express basically the same semantics. So which should you use? Let’s look at the options in depth, but keeping in mind that:
Data modeling is part art and part science; there are no hard and fast right answers, there’s only what works well for your use case.
With this article I’m hoping to teach by example, but you can only improve with practice.
Labels
When to consider: Labels provide for fast lookups in Neo4j, but they only really express booleans, i.e. the presence or absence of a category such as male or female. When we label a node such as (:Person:Male)
it is almost equivalent to having a property called male
with a value of true
, because labels can be either present or absent. Labels tend to work great for low cardinality categorical variables, and when the categories can overlap. For example if a person type can be friend and an enemy, an overlapping category of someone with multiple citizenships can easily be expressed as (:Person:American:German)
. Labels also are typically used as a way of partitioning graphs.
When to Avoid: Labels are a bad choice for medium or high-cardinality categorical variables like postal codes. Sure, we can make a label like 23226
, but it’s going to be unwieldy really fast if you have a data model with thousands of labels. They’re also a bad option when overlapping labels could create confusing semantics. For example if the gender options include “Female” and “Did Not Report”, combining those two labels would make the semantics of your data unclear.
Property Approach
When to consider: properties will never steer you wrong, they’re probably the easiest go-to option. They’re flexible, indexable, and easy to use. They support high-cardinality categories with ease, and they can be targeted by constraints. They work well when the data changes frequently. So if you need a category to be unique, or you need it to be present and never null, then a property is a great choice.
When to avoid: when categories overlap or multiple apply, a property will often need to be an array, which brings with it a number of other problems like array sorting, uniqueness, and other issues. Properties are a poor choice when you need to look up other nodes that share that property as part of a regular query pattern. For example, you don’t want to write queries like this:
MATCH (p:Person { id: 5 })
WITH p
MATCH (othersInSamePostalCode:Person { postalCode: p.postalCode })
RETURN othersInSamePostalCode;
Separate Node
When to consider: Separate nodes are ideal when you need to look up other nodes that share a property value, or when the cardinality of the categorical variable is very high. For example, if a predicate on one of your queries is to filter people by shared occupation, that might argue for making job a separate node that you can “navigate through” to ease your queries. For example we might write:
MATCH (p:Person { id: 5 })
WITH p
MATCH (p)-[:HAS]->(j:Job)<-[:HAS]-(other:Person)
RETURN count(other);
This would make it very easy and performant to count how many other people share the same occupation as person #5.
Another form of finding nodes with something in common would start with the job.
// Find the number of accountants in a given postal code.
MATCH (j:Job { name: "Accountant" })-[:HAS]-(p:Person { postalCode: 23226 })
RETURN count(p);
Finally — a separate node is ideal when you might want to capture other metadata later about the category. For example, our “Job Code” is a category, but later on we might want to include a definition of that job code, or provide details about a licensing board for that job. If you’re thinking that something might be a complex object, and not just a category; then a separate node is probably a good option.
When to avoid: Separate nodes can risk becoming “Supernodes” when they are too densely connected. For example, imagine a census graph with 200 million people in it; if we modeled gender as a separate node, then the “Male” node would be expected to have close to 100 million relationships! That’s a super-node for sure, and will slow down queries that access it. There is often a relationship between variable cardinality and selectivity; the fewer options a variable can take on (gender only has a small handful) the larger number of relationships it would have in a large data set. On the other hand, the more options a variable has, the less likely you are to end up with a supernode. “Job” probably has thousands of possible codes.
You’d be in trouble if you tried to do this on a big dataset, because you’d be “navigating through a supernode” (the :Gender
node).
MATCH (p:Person { id: 5 })-[:GENDER]->(:Gender)<-[:GENDER]-(p2:Person)-[:JOB]->(:Job { name: "Accountant" })
RETURN p2;
Decision Time & Conclusions
In this example we’re not going to consider a sample query workload. It’s just an example. Always keep in mind that the decisions we’re suggesting here are for teaching purposes — data models ultimately exist to answer questions you want to ask of them.
Always stay flexible to design your data model in a way that makes sense for the queries you need to ask.
With that though, I’d generally opt to make:
- Person Citizenship a label; because people can have overlapping citizenships, they act as good graph partitions, they are booleans (you’re either a German citizen or you’re not) and because Citizenship has low enough cardinality.
- Person gender a property; because it’s dense and non-overlapping (bad choice for a label) and because it would make supernodes all over the place if we made it a separate node.
- Job a separate node; because it’s very high cardinality, because we might want to assert more data about jobs separately on that node.
This article is part of a series; if you found it useful, consider reading the others on labels, relationships, super nodes, and categorical variables.