Graph Data Modeling: All About Keys

David Allen
Neo4j Developer Blog
7 min readJun 29, 2020

--

In two previous articles, we covered aspects of graph data modeling such as categorical variables, and how relationships work. In this article, let’s address how to identify things in your graph with keys.

What’s a Key?

A graph key is a property or set of properties which help you to identify a node or relationship in a graph.

They are frequently used as a starting point for a graph traversal, or used as a condition to constrain.

What do we want out of a key?

Before we get into different options for keys in Neo4j, let’s list the attributes of what makes for a really great database key.

  • Authority: We have to know who is responsible for the key. Someone has to issue it, and someone has to say which is which. That could be the database itself, you the application developer, or maybe some external authority (for example, your local DMV if you use driver’s license numbers as a key).
  • Stability: Data is constantly changing around us. We want stable identifiers so that older systems & code can refer to newer data.
  • Uniqueness Context: Of course keys have to be unique, but they’re unique within a certain context. A driver’s license ID is unique only within a certain state in the USA (it could be reused elsewhere).
  • Opacity: This means, does the key have an internal meaning? Could a person guess the next key by looking at a previous one? If you saw a key like “2020–06–19-abc”, then you might infer that the key value is a date with a suffix, which would not be opaque.

Do These

  • Have an authority for your IDs that is either you, or a 3rd party you can live with, that publishes rules about how they do things that make sense. Whatever the authority is, whether it is you or a 3rd party, know who it is.
  • Use a single property, and put a unique property constraint on it, which will ensure your keys cannot be duplicated and add an index for fast lookups.
  • Have as wide of a uniqueness context as possible to prevent you from tripping on duplicates.
  • Use opaque identifiers. It’s enough for a key to be unique, don’t try to overload it with extra meaning.

Don’t Do These

Don’t Use Smart Keys

Smart keys are usually compound values which encode information into a key. Imagine we had an ordering system and we identified a customer order as 2020-06–19-VA-9912. It might seem convenient that we’ve encoded the order date (2020–06–19), the state of the order (Virginia) and the order number (9912) into a single key. In practice though, smart keys usually end up a disaster, for several reasons:

  • If the date of the order was entered incorrectly, the ID for the record needs to change (?!?!).
  • It encourages other developers to try to “parse the ID” to extract information, which is a pain, and could give the wrong result (if the order number changed!).

The key thing to notice about smart keys is that they always have low opacity; that’s the point of them.

Don’t Use Compound Keys in Graphs

In relational databases, it’s typical to define compound keys of two or more attributes, but in my view that never makes sense in a graph. A usual reason why someone would use a compound key is because of a dependency between columns. For example, maybe your customer code + state code together is what uniquely identifies a record. But since graphs let you have as many nodes as you want, this Cypher code:

MATCH (r:Record { ccode: "X", scode: "Y" })

Will usually be worse than this:

MATCH (:A { scode: "Y" })-[:LINKED_TO]->(r:Record { ccode: "X" })

The point is that in most cases, a good data model can eliminate the need for a compound key.

What are your options in Neo4j?

The Neo4j Internal ID

Every node and relationship gets its own “internal” identifier which you can access with the id() function.

internal ID of a node

The advantage of these IDs is that they’re always guaranteed to be there for you. And lookup by ID is very fast in Neo4j because of the way the graph storage in memory works. But internal node IDs (in my view) make for very bad application identifiers, for a number of reasons:

  • They get reused. They’re guaranteed to be unique in a graph, but if you delete node 25 and keep creating data, you may have a different node 25 later on.
  • They don’t track between databases. If you dump all of your data from system A and load the same data into a different system B, you won’t necessarily have the same IDs. So they’re not useful for connecting data in different databases, for example with Neo4j Fabric.

Basically, the only guarantee you get is that they are globally (to that graph, not to the DBMS) unique. While this is opaque, the authority for the identifier is the database (not your application) and the uniqueness context is scoped to a single graph on a single system only.

Globally Unique IDs

Using APOC’s built in UUIDs, you can create them on the fly like this:

CREATE (m:Thing { id: apoc.create.uuid() });
An APOC-generated UUID

These are quite good, because you are the authority and manage them yourself. They are stable and never need to change. They are extremely unique across all contexts, and they’re very opaque. They are 128-bit numbers that are pseudo-randomly generated. Practically speaking, you don’t have to worry about collisions, since the space is so large that if you generate 103 trillion identifiers this way (and we’re pretty sure you’re going to be under that) your chances of a collision are still one in a billion. Good enough.

They come with downsides though.

  • They’re big and clunky, and contain more data than is needed for an identifier, which means they take up space when you have billions of them.
  • They’re hostile to human readability, which can matter if your IDs end up in URLs, which is pretty common.

Somebody Else’s IDs

Let’s face it, usually our source data is coming from somewhere else. If we’re importing tweets from twitter into a graph, all of those tweets have existing IDs. And so often, a good approach will be to adopt someone else’e ID scheme that came with your data import.

It’s tough to say what the pros of this approach are, because it will depend on what the identifier is. The best we can do is go back to those principles we’re looking for (opacity, uniqueness, etc) and evaluate an ID against those.

We can talk about specific negatives of adopting someone else’s identifiers though:

  • Authority: You aren’t it. Which means you’re trusting some element of your data’s durability to that outside authority’s IDs. Is this a problem? Depends on your situation. Maybe, maybe not.
  • Mix-ability: Your graph might have one feeder source right now (Twitter, for example). What happens when you start importing other sources? If you bring in Facebook posts, will you have two identifiers, or will the ID depend on the source? This gets ugly quickly.

As a general recommendation — always store any upstream identifier that you can get your hands on. But don’t use it to be your identifier. Use it for correlation with your upstream system. There’s nothing wrong with choosing your own ID in addition to storing a remote identifier.

Auto-Incrementing Numbers

A common approach is to use an auto-incrementing number. Neo4j doesn’t support this straight out of the box, but it’s common to find it in other libraries, and it’s a common technique in the relational world. It’s usually not the best approach though, because:

  • If each node label gets is own “incrementer”, the ID isn’t really unique to the graph, only to the label. This makes your key implicitly id + label, not just the ID. This is the weakest “scope of uniqueness” you can choose.
  • It has the same potential reuse weaknesses as the neo4j internal ID.

That being said, this approach is still opaque (good) controlled by you (good), and compact/storage efficient.

Relationship Identifiers

As of Neo4j 4.1.0, the database does not have regular b-tree relationship property indexes (it does support full-text indexes on relationship properties though) This has important consequences, and means that it’s not possible to look up individual relationships quickly by an ID, because the database simply doesn’t store things that way. The way you find relationships is by looking up one (or both) of the incident nodes, like so:

MATCH (a:Person { id: 1 })-[r:KNOWS]->(b:Person { id: 2 })
RETURN r;

In this scenario, effectively we’re using the “from” and “to” nodes as the relationship key. The id() function still exists for relationships, and they all have internal Neo4j IDs, but typically we don’t need to ever assign property IDs to relationships. Not only are they locatable in this other way, but lacking property indexes, lookup by key wouldn’t be the efficient way to go anyway.

This article is part of a series; if you found it useful, consider reading the others on labels, relationships, super nodes, and categorical variables.

--

--