Graph Modeling: All About Super Nodes

David Allen
Neo4j Developer Blog
10 min readOct 19, 2020

--

This article is the latest in a series on advanced graph modeling; the other articles in the series deal with keys, relationships, and categorical variables.

Today we’re going to talk about super nodes, what they are, the problems they cause, and what to do about them. This article is a summary and round-up of those issues and how they touch Neo4j specifically.

Super nodes! But not to the rescue

This article will cover what they are, how they happen, and what we can do about them. Let’s get started.

What’s a super node?

A super node is any node that happens to have a lot of relationships. How many relationships? Well that isn’t really defined, so let’s call it a lot. If a thing has way more relationships than other nodes, relative to what else is in the graph, it’s fair to call it a super node. How many is way too many? Well that is really a subjective thing, since everyone is running a machine with different memory configuration and query patterns.

It’s not hard to spot super nodes. If you were visualizing a graph and what you get is a “hairball”, then the densely connected nodes that seem like they’re connected to everything are probably the super nodes.

Super nodes put the hair into the hairball.

In extreme situations you can’t even visualize super nodes, since if you truly had millions of outbound relationships on a node, the visualization would just be a mess of black color with too many overlapping lines to see anything really.

How do super nodes happen?

There are two key causes: they can arise from the domain that you are modeling, or from your modeling choices for the domain.

From the domain itself: Densely connected networks

Imagine we modeled Twitter as a graph, with a relationship between accounts which follow one another. We know that popular accounts have many followers, so in that kind of graph structure, Lady Gaga’s node would have 82 million+ relationships. That’s pretty super, since the average user has 700 followers.

Lady Gaga the Super node was born this way (as a super node, not as a motorcycle)

Another example of a densely connected network might be (:Flight)-[:DEPARTED_FROM]->(:Airport). The Hartsfield-Jackson airport in Atlanta could easily do more than 200,000 flights per year. That’s a lot of -[:DEPARTED_FROM]-> relationships.

Many other business domains are rich with examples:

  • Big corporate bank accounts which are long-lived, and do hundreds or thousands of financial transactions per day (:Account)-[:TRANSFER]->(:Account) .
  • Other social networks, like those technical recruiters you may know on LinkedIn who seem to have a zillion contacts.

From your modeling choices: Categorical variable overload

In a previous post, I wrote about categorical variables and modeling them in Neo4j. A categorical variable is just a property that can take a small handful of values. The “cardinality” of the variable refers to how many different values it can have.

Example: Imagine you model the gender field a customer filled out as a separate node. This is low cardinality because the only options might be male, female, didn’t specify, and other. Now suppose you have 1 million customer accounts, then we can be sure that, (:Gender) nodes will be super nodes if you modeled your graph as (:Customer)-[:GENDER]->(:Gender).

Gender is just an example; it’s a general issue with low-cardinality variables, modeled as nodes, when you have a large graph with millions of nodes needing to reference the variable. Imagine Amazon.com took a handful of product categories (“Home and Garden”, “Electronics”, “Health & Wellness”) and linked every one of their tens of millions of products to a category node — Same situation.

Data has outgrown the model

Often you’ll have a simple graph model that works on a small dataset, but that you might have trouble scaling, either immediately for full production data, or later as your total dataset grows with time. Imagine that Customer/Gender example; not a big deal with 1,000 customers, and possibly a killer with 10 million customers. Similarly, Lady Gaga’s account wasn’t a problem for Twitter maybe in 2013, but that was a few hundred million users ago.

In modeling, note that the super node problem would come from a cardinality mismatch; if every product needs a category, and products have a cardinality of say, 20 million, and categories have a cardinality of 12, this massive mismatch creates probably as many super nodes as there are categories. The same pattern can be seen with any (:X)-[:RELATED_TO]->(:Y) where the cardinality of the X’s and Y’s are mismatched.

If your data has outgrown your model, it’s time to refactor your model.

Why are super nodes bad?

The trouble is how fast they branch out, giving us too many paths to consider. Put simply, it’s easy for the leaf of a tree to find the tree trunk. It’s harder for the tree trunk to find an individual leaf. Let’s look at some ways they hurt read performance.

Blowing up your traversals

Suppose you wrote a query like this:

MATCH (bob:TwitterUser { name: "bob" })
WITH bob
MATCH p=(bob)-[:FOLLOWS*]-(:TwitterUser { name: "karla" })
RETURN p

This asks for all of the paths between Bob and Karla. This query will be a disaster if anyone follows Lady Gaga, because in order for Neo4j to answer the query, it has to branch out to potentially 80 million other accounts. The number of paths to Karla will explode; Bob and Karla may both live in New York City, but rest assured, this query will need to look through every Lithuanian and Korean fan of Gaga (as well as all of their followers, and their followers followers, and so on) in order to return.

Listen to Boromir, he knows what’s up

Slowing down other reads

Another issue is relationship filtering; Neo4j (as of version 4.1) doesn’t support secondary indexes on relationship properties. Imagine a query to find all of the followers Gaga gained in 2020.

MATCH (gaga:TwitterUser { name: 'ladygaga' })<-[r:FOLLOWS]-(u:User)
WHERE r.date > date('2020-01-01')
RETURN u.name, u.followers

This query must consider all 80 million+ relationships, because you can’t index the date data type on a relationship. This becomes a bigger deal when you need to use relationship filtering on a path of multiple steps, where any potential step may run through a super node.

Slowing down writes

As of Neo4j 4.1, when you MERGE a relationship into the graph, it locks both the source and target node temporarily for that transaction. That’s quite bad for Lady Gaga, because we’re going to want to put new [:FOLLOWS] relationships in for her all the time when her next album drops. When we add a new relationship, she is (for a fraction of a second) locked for the rest of the world to follow, and that’s going to cause problems elsewhere among her international fanbase.

This post on DZone covers some other interesting facets of the performance impact, if you’d like to follow up with more detail.

Not Always the end of the world

There are a few situations where super nodes arguably don’t matter.

  • If the super node is at the end of a path, and you are not traversing through it, it will not have much impact.
  • The super node can be harmless if it isn’t part of a high-transaction rate query, or multi-user update.

OK, so what do we do?

It’s a little bit art and a little bit science — the best answer for you will depend on your domain, but here’s a toolbox of techniques to get the most out of your graph. These techniques range from simple to difficult.

  • Relationship directionality
  • Join hints
  • Lucene relationship indexing
  • Label and relationship segregation
  • Super node refactoring

We’ll cover each of these in detail, but really they all boil down to two classes of approaches — either help the database to consider fewer relationships when you run a query, (which makes super nodes less super) or else try to break up or eliminate the super node in the first place. Some techniques are a combination of both. And of course remember: You don’t have to choose one approach, feel free to mix and match.

Relationship directionality

Cypher lets us express paths as undirected (:User)-[:FOLLOWS]-(:User) or as directed (:User)-[:FOLLOWS]->(:User). Always use directed paths, because it cuts down on how many relationships your query has to consider.

You can exploit what you know about directionality with super nodes. Lady Gaga has 80 million followers (that is, inbound :FOLLOWS relationships) but she only follows 121,000 accounts (outbound relationships). Actually, when you think about that, it’s crazy how many people she follows, but still — there is a drastic difference between these two queries:

(:User)-[:FOLLOWS]-(:User { name: "ladygaga" }) (80 million + relationships)

(:User)<-[:FOLLOWS]-(:User { name: "ladygaga" }) (121,000 relationships)

Keanu learns about one of the super node’s key weaknesses

Join hints

In this knowledge base article, Andrew Bowman lays out how to use Cypher JOIN hints to constrain how Cypher does graph traversals. If you know that you have some super nodes in your graph, you can use this technique to make sure that Neo4j’s query evaluation doesn’t run into that traversal problem.

Rather than re-capping the entire article, here’s the key bit with a brief explanation of how this works:

MATCH (me:Person {name:'Me'})-[:FRIENDS_WITH]-(friend)-[:LIKES]->(a:Artist)<-[:LIKES]-(me)
USING JOIN on a
RETURN a, count(a) as likesInCommon

This construct USING JOIN ON a ensures that Cypher can only traverse to the :Artist node, but never through or away from it. This would be a good technique if the Artist in question was Gaga, with her 80 million fans.

For more information on this technique, check the planner hints section of the Cypher manual.

Lucene relationship indexing

In an earlier section, I said that Neo4j doesn’t support relationship property indexing. That’s almost true. There is actually one special case. It’s possible to use Lucene to index text properties on relationships.

Using this documentation here, you can call for example:

CALL db.index.fulltext.createRelationshipIndex("taggedByRelationshipIndex",["FOLLOWS"],["date"], { eventually_consistent: "true" })

Assuming that the date property on the :FOLLOWS relationship was a String, you could use this to filter relationships down (taking advantage of indexes), which in some limited instances, can help address the issue. This isn’t a silver bullet for the super node problem by itself. And yeah, it’s not great that we need to treat a date as a String in order to make it work.

Label and relationship segregation

Another way to approach the problem of super nodes is that the individual densely connected nodes (like Lady Gaga) look and act like the rest of the graph, when really they’re special cases. Most people (coughcough) on twitter don’t have 80 million followers. So we might segregate the super nodes out of the graph by labeling them differently. What if we introduced new labels?

Lady Gaga: Kinda not like the regular average node in the graph, kinda special case
  • (:TwitterUser) for the regular schmoes like me
  • (:Celebrity) or (:VerifiedUser) for the mega-accounts

This would make our traversals much easier, because we could still specify to go through all nodes if we wished, but we could exclude celebrities in our traversals easily if they were labeled differently.

The same goes for relationships — instead of labeling the nodes differently, we could label the relationships differently: -[:FOLLOWS]-> for regular user to user relationships, and -[:FAN]-> for unreciprocated relationships to a celebrity.

Notice what this does to our earlier query example:

MATCH p=(:TwitterUser { name: "bob" })-[:FOLLOWS*]-(:TwitterUser { name: "karla" })
RETURN p

If we just introduced a [:FAN] relationship type, suddenly this query looks very, very different and we don’t need to consider Lady Gaga. If for some reason we did want to include her, the change is trivial:

MATCH p=(:TwitterUser { name: "bob" })-[:FOLLOWS|FAN*]-(:TwitterUser { name: "karla" })
RETURN p

The difference is as simple as -[:FOLLOWS|FAN*]->

A drawback of this approach though is that it only works with domains that are naturally densely connected. If our problem was a categorical variable like (:Gender), you can’t get out of that problem with this approach. The reason it works is because we can segregate a small population of (:Celebrity) out of our larger graph. That won’t work when all of the (:Gender) nodes are super nodes.

Super node refactoring

Probably the toughest approach is to refactor your graph to break down a single super node into multiple nodes. You could take Lady Gaga’s single node and break her into, say, 1,000 copies of Gaga. Create a series of equivalence relationships, i.e. (gaga1:TwitterUser)-[:SAME_AS]->(gaga2:TwitterUser) and then have a strategy for distributing your applications relationships between the 1,000 “Gaga Clones”.

This is a last resort strategy.

It will only help with certain query patterns though, for several reasons:

  • Queries need to be rewritten to be able to consider the SAME_AS relationships if they need to know everything about Gaga, not just touch their local clone of that node.
  • If you needed to traverse through Gaga, you’re still back in the same situation; after traversing to all of the clones, still you end up with a total cardinality of 80 million+ depending on the query.
  • It’s more complicated to choose and implement a strategy for distributing relationships between “Gaga clones” and keeping it balanced so that one of the clones doesn’t gradually grow to become its own super node, as Gaga attracts more fans over time.

Conclusion

Wrapping all of this together, here are the key points to remember:

  • Super nodes are those that are densely connected in your graph, with a very high number of incident relationships relative to the average found in your graph.
  • Super nodes are bad, because they compromise both read and write performance.
  • Super nodes come from several different places, but most usually come from the natural complexity of your domain, and your modeling choices.
  • You’ve got a bunch of different tools in the toolbox to address the issue, and for advanced graph modeling, it’s an issue you definitely want to be aware of to get the best performance out of your graph.

Happy graph hacking!

This article is part of a series; if you found it useful, consider reading the others on labels, relationships, super nodes, and categorical variables.

--

--