Graph Data Model and Graph Native Storage & Processing

Published in

Noon

10 min readSep 12, 2022

Context

You don’t have to be an engineer to be a racing driver, but you do have to have Mechanical Sympathy — Jackie Stewart, Formula One racing driver.

What this means for software is, if you can feel how it works internally, you are in a better position to use it optimally.

This post is an attempt to help you have the mechanical sympathy for graph native databases.

Data Model and Data Storage

Data Model and Data Storage for databases are two separate concerns.

Fig 1 : Data abstraction at different layers of an application.

Every layer in a software abstracts the complexity of the underlying layer by exposing a data model. The shape and form in which we deal with data at a particular layer is essentially different from the way it is handled at the layer below it.

The same holds true for databases as well. They expose different kind of data models — relational, document, graph, etc using which we interact with the underlying data. Each data model has some inherent assumptions and constraints which have a profound effect on the access patterns. For example, a relational data model exposes data in the form of tables, where each table has multiple rows and columns. But, this is not how data is persisted on disk. When using a relational database, we think about the problem space represented in the form of tables and associations between them and we handle reads or mutations on the tables. At the application level we don’t have to worry about the B-tree or the data pages at the physical storage level.

Graph Data Model and Graph Native Storage

The graph data model allows us to represent our domain in the form of entities and relationships.

Fig 2 : Example of a social setup modelled as graph.

Let us take a simple example where we model a social setup where we have people and they can know the other person, they can be friends or they don’t know the other person. Also, there are some inherent properties of these type of associations between people that we are dealing with — when you know someone, that does not mean the other person will know you but, a friendship, unlike knowing someone is a mutual kind of association. So, if we have to model this as a graph, every person is considered an entity and is represented as a vertex. The associations between them are represented as edges connecting their corresponding vertices. The knows relationship is inherently a directed relationship whereas the friend relationship is bi-directional or simply undirected in nature. The absence of an edge between two people means they don’t know each other. This graph data model is represented in the diagram above.

This is the logical model — the Graph Data Model — of our problem space and when dealing with this, all our operations are defined in terms of some read/write operations on the vertices and/or edges. But, this is not how data is persisted on disk. There are different ways in which the different graph databases store the graph on disk. Here we will focus on one specific way of storing this — the graph native storage.

Being graph native means that the storage is designed to store graphs from ground up and are not an afterthought, where the graph data is fitted into some key-value stores or the relational databases. Relationships are first class citizens in a graph native database and the concept of index-free adjacency is at the heart of being graph native.

Index-free adjacency

Index-free adjacency means that you don’t need a central index to know the neighbours of a given node. Each node locally stores the reference to all their neighbours. This is a key differentiator, as it results into query performance that is independent of the total size of the graph and depends on just the size of graph you need to traverse.

So, how is this implemented ? Let’s see how Neo4j goes about this.

Physical storage of graph in Neo4j

The graph is stored in multiple store files of different types. There are separate store files for nodes, relationships, labels and properties of nodes and relationships. This way the storage of graph structure is separate from the properties of nodes and relationships.

The Node files store the nodes of the graph. Each node is represented in the store file as a fixed size record. The nodes are assigned an internal id, using which their start offset in the store file can be computed in constant time. This eliminates the need to maintain any index to locate the nodes.

Each node record, contains the ID of 1st relationship visible from that node and the ID of 1st property of that node. A very simple representation of a node record looks as in the diagram below. There are few other fields in a node record but for simplicity I have not captured them. We’ll be building on top of this representation of a node record in the sections to come, but please bear in mind that this is not a complete representation of a node record. In the diagrams to follow, I have captured only those elements of node/relationship record that help us picture the graph structure.

Fig 3 : Partial representation of a node record.

Similar to the Node store files, the Relationship store files contain fixed sized relationship records each having an internal ID using which the records can be located in the file in constant time.

Each relationship record contains the following apart from some other fields :

IDs of the source node(first node) and destination node(second node)
ID of relationship type
IDs of the prev and next relationship of the source node
IDs of the prev and next relationship of the destination node

Fig 4 : Partial representation of a relationship record.

For all these IDs, we can think of them as pointers to their corresponding records. And with the previous relationship ID and the next relationship ID for both first node and the second node, we are actually maintaining a doubly linked list of the relationship chain of the nodes. This is important to realise — every node in the graph maintains a pointer to its first relationship, and the relationships are stored as doubly linked lists.

Now that we have seen how the nodes and relationships are stored in store files, let’s see how the example in Figure 2 is physically stored.

Fig 5 : Physical storage representation of example in Fig 2

In Figure 2, we see that node A has 3 relationships :

A knows B
A knows C
A is a friend of H

So, how do we get to know this from the physical storage (Fig 5)?

The node record for A contains the pointer to its first relationship record — “A knows B”. Few things worth noting in this relationship record :

A is stored as the source node(first record). Actually the internal ID of A is stored.
B is stored as the destination node(second node).
The relation type is KNOWS — actually the corresponding ID is captured here.
Pointer to the previous relationship of A in its relationship chain is empty as this is the 1st relationship in the relationship chain.
Pointer to the next relationship in A’s relationship chain is pointing to the relationship record for “A knows C”.

In Fig5, pointer to the next record in A’s relationship chain is represented in dark red colour and the pointer to the previous record is represented in light red. This way we can visualise the relationship chain of node A and in-fact the relationship chain of all other nodes, represented as a doubly linked list with next pointers in dark colour and the previous pointers in corresponding lighter shade.

Graph Traversal

Next, let’s see how the graph native storage discussed in the last section supports graph traversals. We try to understand this by answering some traversal queries on the example covered in Fig 2 and its corresponding physical storage represented in Fig 5. But, before we get there, let us see how direction in relationships are handled in Neo4j.

Relationships and Directions

Every relationship in Neo4j has an implicit direction, but it allows you to traverse a relationship in any direction with equal efficiency. Let’s break this down.

In Fig 4, we saw that for representing a relationship connecting two nodes as a relationship record, one of the node has to be designated as the first node or the source node, and the other as the second node or the destination node. This means there is no getting around the idea of direction in a relationship —its rooted down to the level of physical storage. So when we have to deal with directionless or bi-directional relationships, the best we can do is — disregard the fact that one node is a source node and the other is a destination node, and this is what Neo4j lets you do. When creating a directed relationship, we specify the corresponding source node and the destination node, and for undirected / bi-directional relationships we randomly choose one as the source and the other as destination and ignore the direction while reading.

But, how does it allow traversing a relationship in any direction with same efficiency ?

Refer to Fig 5. We saw that all the relationships visible from a node (incoming or outgoing) are captured in its relationship chain, which is essentially a doubly linked list. So, for traversing any type of edge — incoming, outgoing, directionless — from a given node, we need to scan the node’s relationship chain. For outgoing edges from a given node, we only select those relationship records from its relationship chain where this node is at the source position (first node in the relationship record), for incoming edges we select only those records where this node is at the destination position (second node in the relationship record) and for undirected/bi-directional edges, we don’t care whether this node is at the source or destination position as long as the record is present in the node’s relationship chain.

Let us try to understand this through some specific traversal queries on our example (Fig 2 & Fig 5)

Query 1: Get all the people known by A

This involves anchoring the query at node A and scanning its relationship chain. We achieve this through the following sequence of steps :

From the internal id of A, we calculate the location of A’s node record in its corresponding node store file.
The node record of A contains the reference to the first relationship record in the relationship chain of A, which is the relationship “A knows B”. Using the relationship ID, we calculate the position of corresponding relationship record in the relationship store file. This gives us access to the relationship chain of A that contains 3 relationship records : “A knows B”, “A knows C” and “A is a friend of H”.
To get all the people known by A, we scan the relationship chain of A and select all those relationship records which satisfy the following conditions :
- A is at the source position in the relationship record. We do this because “knows” is a directed relationship and to get people known by A, we need to follow the outgoing edges from A.
- The relationship type in the relationship record is “knows”.
This gives us the relationship records “A knows B” and “A knows C”. From these relationship records, we see that the destination nodes are B and C. We collect their information from their corresponding node records using their ids from the relationship records.

Query 2: Get all the people who know B

Here, we need to find all those “knows” relationship records that have node B as the destination node. We know that all the relationships visible from B (incoming/outgoing/directionless) are present in B’s relationship chain. So, to get this we need to anchor the query at node B, get its relationship chain, and scan it to select all those relationship records where B is present as the destination node and has the relationship type - “knows”. These selected relationship records would then lead us to their corresponding source nodes giving us all the people who know B.

In the last two examples we traversed the directed “knows” relationship in both the directions with equal efficiency. That is why when modelling the “knows” relationship between two entities, we don’t need to maintain an explicit reverse “known_by” relationship.

Query 3: Get all the people known by friends of H

We need to anchor the query at node H and then we need 2 hops from there. First, following the undirected “friends” relationship from H to get its friends, and then from all the friends of H, following the “knows” relationship in forward direction. This would require the following sequence of steps :

From the relationship chain of H, select all those relationship records that have relationship type “friend”. Since this is an undirected relationship, we don’t need to worry if H is at the source or at the destination position. From all these selected relationship records, we get the node records for friends of H.
For every node record, corresponding to the friend of H, we need to repeat the steps explained in Query 1 to get all the people known by friends of H.

Summary

We represented a social setup having two simple relationship types through a graph data model. We looked at what is a graph native storage and tried to get a feel of how this data model gets stored at the physical level in a graph native manner in Neo4j, where each node maintains its own relationship chain through a doubly linked list. We also tried to understand how this physical storage supports the graph traversals and how direction of relationships come into play.

With this, we can see how the physical storage of a graph native database is fundamentally different from the B-tree or the LSM tree that are mostly used in relational or no-sql data stores.

Noon is a social learning platform and we firmly believe that learning is inherently social in nature and that is reflected across all the learning experiences that we offer to our students. Our graph infrastructure is at the heart of all the social experiences at Noon. You can read more about our graph infra here.

We are always on the lookout for enthusiastic engineers who love to solve complex engineering problems that have a meaningful impact on the lives of our students. Hit me up on LinkedIn if you are interested.