Graph Data Modeling: All About Relationships

Published in

Neo4j Developer Blog

10 min readOct 23, 2019

In my last article on graph data modeling, we talked about categorical variables, and how to choose whether to model something as a node, property, or label.

With that out of the way, it’s now time to go deep on relationships. In this article, let’s go deeper on what relationships are, what they mean in a domain, and how to use them.

In this article, we’re going to take an example of a social network site, because it makes an easy to understand graph. People create “friend” relationships among their accounts, people “post” content, and “like” content.

Relationships are verbs!

The simplest way is to think about relationships is to just write declarative sentences about our domain. Write some true facts, and isolate the “nouns” (in bold) and “verbs” (in italics). Example:

A person posts an article
A person friends another person
A person has an interest

In this simplified view of the domain, all of your nouns are nodes, and all of your relationships are verbs. The relationship type is the singular form of the verb. And so this implies a graph that looks like (:Person)-[:POSTS]->(:Article) and so on.

If you’ve ever seen RDF, this approach should strike you as pretty similar to the RDF triple concept of subject-predicate-object, where the relationship is basically the predicate of the sentence. Grammatically, the first noun in the sentence is the subject, and the second noun is the object. So boiling your data model down into these sets of 3 elements is a common graph approach (in RDF-land they actually call them triples!)

Under the simple explanation, the task is to decompose your domain into a large batch of simple declarative sentences. This gives you a pile of nodes and relationships to work with. You then have most of your model, and mostly have to make naming decisions.

This is nice…but also too simple. Let’s go deeper and talk about what relationships do in a model, and what they really mean aside from the simple comparison to verbs in a sentence.

Relationships normalize data

Data Normalization is often a topic that’s discussed for relational databases, but it applies to graphs as well. What’s data normalization all about?

The goal of data normalization is to reduce and even eliminate data redundancy, an important consideration for application developers because it is incredibly difficult to stores objects in a relational database that maintains the same information in several places. Reference

When we factor a piece of data (let’s say a Person’s primary interest) and put it into a separate node, linked by a relationship, one of the things we’re implicitly doing is storing that interest once, instead of with every single node. Here’s a simple example.

Denormalized:

CREATE (p1:Person { name: "David", interest: ["Guitar"] })
CREATE (p2:Person { name: "Sarah", interest: ["Guitar"] })

Normalized:

CREATE (s:Interest { name: "Guitar" })
CREATE (p1:Person { name: "David" })
CREATE (p1)-[:HAS]->(s)
CREATE (p2:Person { name: "Sarah" })
CREATE (p2)-[:HAS]->(s)

Notice that in the normalized example, “Guitar” is stored once, and in the denormalized example, it’s stored twice. The fact that we could link the Person nodes to the separate Interest allowed us to normalize our data.

Going back to that data normalization article, the advantages of doing it this way are higher cohesion, and less data duplication. What if we wanted to change the name of the interest from “Guitar” to “Six String Guitar”? With the normalized example, we only have to update one string and we’re done. In the denormalized example, we might have to update millions of nodes! That’s where low cohesion and duplication starts to hurt.

How do you know if your data is normalized “enough”? How far should you go with this? This raises the topic of “Normal Forms” which is always taught in data modeling for relational databases. Just as traditional relational databases have 1st, 2nd, 3rd, things like Boyce-Codd Normal Form, graph databases can do the same.

All of the theory that applies here translates directly to graphs, but is usually written in other sources as being about tables. So here in this image we see it talking about “columns” but if we translate that to “node property” we’re pretty much still good. The full translation of how to apply normal forms to graph data modeling will have to wait for another article, but you should check out these techniques, they have direct relevance to graphs. The definitions though of what it means for one property to “depend” on another and so forth though can be carried over directly.

Know your relationships’ semantics

When we think about what relationships actually mean in a model, there are several different kinds. When modeling it’s useful to know which of these is your desired relationship, because it constrains how many instances of the relationship can exist, and what it actually means in the real world.

A relationship has a domain (the thing it’s coming from) and the range (the thing it’s going to). We think of a relationship as like a function that maps one node to another.

Tip: for each relationship in your domain, figure out what kind of relationship it is in the real world, because it will help you and your model users understand what they can and can’t do with this data.

“HAS A”: this expresses a part/whole relationship, otherwise known as “composition”. For example in our social network, HAS_INTEREST is a “HAS A” relationship, because people may have many. If what you’re trying to do is create a “bag of items” you’re probably dealing with a HAS A relationship.
“IS A”: this expresses an inheritance relationship between a parent and a child. Believe it or not, these don’t come up that often in property graph modeling partially because they’re so easy to do with labels. For example I could have either (:Person:Employee { name: "John" }) or I could have (:Person { name: "John" })-[:IS_A]->(:Employee). The former comes up more often.
Functional: the relationship acts like a true function, meaning that given a single domain, there can be only one range node. An example of a functional relationship is HAS_HOME_ADDRESS. You should probably have one of those, not 5. But a relationship like HAS_INTEREST clearly is not functional, because our users can have many different interests!
Transitive: if the relationship is true from A to B, and from B to C, then it’s also true of A to C. An example of this would be IS_RELATED_TO. My grandfather is related to my father. My father is related to me. So IS_RELATED_TO is transitive, because my grandfather is related to me. Going back to our HAS_INTEREST relationship, this one isn’t transitive. I might be interested in guitar, but that doesn’t mean that it‘s interested in me!
Reflexive: this one doesn’t come up as often in property graph modeling, but it means that the relationship implies every node has one of these to itself. For example (:Person)-[:KNOWS]->(:Person) is reflexive, because all people know themselves. It turns out that IS_RELATED_TO would be reflexive too! But HAS_HOME_ADDRESS clearly is not. Note that with reflexive relationships, the target label (:Person) is going to be the same as the source label…because it’s reflexive!
Symmetric: if the relationship is true one way it’s true the other way too. Again, in the case of (:Person)-[:KNOWS]->(:Person), it’s symmetric because if A knows B, you can be pretty sure B knows A. But you can see how a relationship like HAS_HOME_ADDRESS is not symmetric, because an address can’t have a person as it’s home address, that would make no sense.
Vanilla: I just made this term up. You heard it here first folks. But I’ll refer to any relationship that doesn’t have any of the above properties as “vanilla”.

In other graph modeling disciplines they might add others, such as asymmetric, irreflexive, and others, but we don’t typically need to consider these in property graph modeling because we can traverse relationships bi-directionally, and we’re not usually trying to use them for logical inference as in the RDF/OWL world.

Tip: A good practice is to minimize your vanilla relationships. Knowing what other type applies is useful to the semantics of your model, and useful to application constraints, so ideally you’d like to have that with all of your relationships if possible. But sometimes it isn’t! So don’t be a purist about it. Rule of thumb: if you’re not sure, it’s vanilla.

One thing that may have to wait for another article is how to decompose vanilla relationships into a collection of potential different or other relationships of the various types. It’s enough for now to hint that you can do this, and it’s an interesting exercise to see how you might in your model.

Hint: You do it by refining what the vanilla relationship means and getting more specific. In a later section, we’ll give one concrete example of how reification can be used to refactor vanilla relationships.

Deciding on relationship properties

These relationship types impact an important downstream consideration: whether to put properties on your relationship. When we assert properties about our relationship, we’re putting metadata on it.

The most important things to know are:

Most commonly, relationships don’t have properties at all.
The next most likely scenario is that they only have administrative metadata, such as when the relationship was created, updated, who created the relationship, or a “version” integer to let us version relationships. This “administrative metadata” can be placed on any kind of relationship, because it’s generally divorced from the semantics of the relationship type, and so it can fit with any relationship type.
The next likely scenario is that the relationship property will actually be path metadata, for example a weight or a distance.
Some systems such as Neo4j at present cannot index relationship properties with the exeception of specialized full-text indexes. This means there’s a built in reason to avoid them when you can, because your queries won’t necessarily get faster or more selective by placing criteria on a path that hinge on a relationship property value.

Drake knows his property graph data modeling

Because all of this, if you find yourself wishing you could put an index on a relationship property, you should most likely think about factoring that relationship out into a node. You are probably starting to think about this relationship as a first-class object in its own right, rather than just a relation of two other things.

As always, maintain flexibility to do what’s right for your domain, but in general try to minimize properties on relationships unless you have a compelling reason you can clearly explain, and then that’s what they’re there for.

Reifying relationships

To “reify” means to take something abstract and make it concrete. When we think of relationships as objects themselves (for example, using a relationship as a bank transfer, “smuggling a noun into a verb” so to speak), things start to get hard for the previous reasons. So the answer is to reify the relationship or turn it into a first-class node all its own, and then simplify.

“Reifying a relationship” is also sometimes referred to as making an intermediary node, or creating a “hyper-edge”, where the node is standing in for a relationship that can itself have relationships.

An example is a bank transfer:

/* Very bad */
CREATE (a1:Account)-[:TRANSFER {
    id: 555,
    amount: 123,
    currency: 'USD',
    bank: 'Wells Fargo',
    time: '2pm'
}]->(a2:Account)

Designs like this start out seeming like they make sense, because an account transfer is a flow of money between two accounts. But this is a very bad model, because:

An account transfer is itself a thing in our domain with rich properties, not just a verb.
It lacks the opportunity to normalize data. We can’t link to a separate pre-established :Bank or :Currency node, because relationships can’t have relationships!
It’s going to paint us into a corner. When it comes time to denote that transaction 555 has been cancelled, how will we do that? It’s not extensible.
The :TRANSFER relationship is vanilla. Does it have any of the properties above that we discussed? It’s not symmetric, functional, transitive, or reflexive. It’s also not HAS_A or IS_A! It’s a mess.
You know have a bunch of unindexed properties, so looking up all USD-based transfers from Wells Fargo is going to be painful.

Better is to reify the :TRANSFER relationship into a separate node:

/* Much better */
MATCH (wf:Bank { name: "Wells Fargo" })
MATCH (c:Currency { name: "USD" })CREATE (a:Account)-[:INITIATES]->
   (b:BankTransfer { id: 555, time: '2pm', amount: 123 })
      -[:RECEIVES]->(b:Account)CREATE (b)-[:ORIGIN_BANK]->(wf)
CREATE (b)-[:CURRENCY]->(c)

Notice what happened here!

[:TRANSFER] was reified to :BankTransfer.
Data normalization was improved by reusing Bank and Currency nodes, improving cohesion and reducing redundancy.
Relationships got more numerous, but they all got simpler, and easier to explain: ORIGIN_BANK, CURRENCY are functional relationships!
By redefining the TRANSFER relationship sides to INITIATES vs. RECEIVES, it clarifies the intent of what’s going on. RECEIVES becomes a functional relationship (only one account can receive a transfer) and INITIATES arguably becomes HAS A.
Standard indexing applies, so looking up all USD-based transfers from Wells Fargo will be fast!

Conclusion

In this article we’ve covered data normalization, the connection between relationships and verbs, the different types of semantics a relationship can take on, and how to reify complicated relationships into more simple ones.

In teaching articles, we’re always limited by needing to show simple examples; the challenge for the modeler will be to apply the principles to your domain in real work, and there’s only one way to do that: practice.

Happy graph hacking!

This article is part of a series; if you found it useful, consider reading the others on labels, relationships, super nodes, and categorical variables.