How to Explain Index-free Adjacency to Your Manager

Figure 1: Explaining index-free adjacency with the direct neighbor walk metaphor — a fast physical memory lookup.

Index-free adjacency is the most important concept we need to understand when learning about native graph databases. Native graphs have very different use-cases than non-native graph databases. Non-native graphs systems are usually layered on top of other databases that are not optimized for fast graph traversal. Having a clear mental model of these differences is essential to the writing a clear business justification of any Enterprise Knowledge Graph project. However, it is not enough just for you to understand index-free adjacency: you have to be able to explain it to your non-technical peers and most importantly your manager and anyone else that will authorize your project. We have developed a metaphor for explaining index-free adjacency that seems to work. We call it the “direct neighbor walk” metaphor. Here it goes:

Suppose you live next to a cute neighbor (lets call her Ann). And on Pi Day you want to take her a hot apple pie to impress her. You text Ann and ask her if she likes Apple Pie and she says “yes!”, so you bake her a pie and now you want to get it to her house quickly while the pie is still hot. So here are the steps (see Figure 1):

  1. Walk out the door
  2. Turn towards Ann’s house
  3. Walk directly over to Ann’s house and give her the hot apple pie — and she loves it! Happy ending!

That is how native graph databases lookup adjacent nodes in a graph. They do a direct walk of memory. They have direct physical RAM addresses from each node to other neighboring nodes in the graph. Figure 2 has a “logical” diagram of this graph:

Figure 2: Logical model of two neighbors.

The key is that native graphs systems don’t have to work with any other data structures or indexes to hop from any node to neighboring nodes. When you design your graph you explicitly add links between logically related nodes. These links contain fixed addresses to your neighbor nodes.

Now let’s compare the native graph architecture with with the way Relational Databases or other non-native graph databases might do this. This is described in Figure 3.

Figure 3: using a central index and search to find the address of a neighbor’s house.
  1. Walk out the door
  2. Walk downtown to the DMV where they have a service that tells you how to get to your neighbor’s house
  3. When you get to the DMV you take a number
  4. When your number comes up you get called by a search agent
  5. The search agent at the DMV has a list of all the people in your town sorted by their address (called the index). They search the list (using a binary search algorithm) and they finally give you the GPS coordinates of your neighbor’s house.
  6. You take these GPS coordinates, enter them them into your GPS tracker and follow the directions to Ann’s house.
  7. When you get there the pie is cold. She gives you a low Net Promoter Score on Neighborhood.com. She posts: “Nice guy but it takes him forever to deliver pies”.

This story summarizes the difference between a native-graph implementation of a graph database and non-native variations that are layered on top of other databases. Native graphs take data that is logically connected via arcs or relationships and hard-wire the physical RAM addresses of these items into the node. Lookups are fast! Related nodes are sometimes even stored next to each other in memory which maximize the chances that the data is already in your CPUs cache! As a result, you can “hop” between roughly a million nodes every second for each core you have on your server. If you have a typical server with 16 cores you can be evaluating 16 million hops per second. If you are using an Amazon x1e.32xlarge server with 128 cores you can get 128 million hops per second. And for those of you with NSA-scale budgets you can get a little Cray Graph Engine with 8,192 cores that will do 12.8 billion traversal edges per second (GTEPS) on the graph500 SSSP benchmarks.

The thing to remember is that relational databases, ironically, are very slow at looking up relationships. They are sometimes called “row stores” because their primary design goal is fast row-by-row access. Why? Because early COBOL systems that read from punch cards read one card at a time. They were designed to group all the fields in their rows together. We ended up inheriting the COBOL flat-file designs and retrofitting relationship lookups later as an afterthought. So if you have a big flat file without any relationships then is was really fast! And for simple flat files you don’t need to lookup relationships using a search function. What is worse is that each relational JOIN operation is a search that typically doubles as you double the size of your database — what is called an O(log(n)) binary search operation. And the more JOINs you have the slower you get. Relational databases excel at queries on small datasets that have uniform flat structure. Native graph databases include Neo4j, ArangoDB, OrientDB (now owned by SAS) and TigerGraph.

We don’t want to imply that their isn’t value to layering graphs on the top of other databases. If you are not looking for complex patterns in large data sets there can still be value to storing graph data in your existing infrastructure and using graph visualization tools to view structure. If your company has under 10,000 employees than viewing organizational charts with data from an RDBMS is fine as long as you can write the SQL queries to do this. A topic for a future blog post. There also might be other reasons like security or distributed data issues that might make non-native graphs systems more appealing.

Finally, lets remember that there is no single ideal architecture in storing all types of graph data. Index-free adjacency also means that if you have nodes with a very large number of relationships (a.k.a. supernodes) there are many links that will need to be updated when you make changes. For example a social network with movie stars that have millions of followers will require millions of updates if they are removed. Granted, this might be a rare event on a specific type of network, but something to consider when looking at the tradeoffs.

Does this metaphor work? Do you have other metaphor that explain index-free adjacency? Let me know in the comments.