Modeling Graph Relationships in DynamoDB

Published in

Developing Koan

12 min readJul 8, 2020

Koan is centered around goals, and how those goals connect to teams and people within a company. These connections can often be complex: the same person leading a goal on one team may be working toward several other related goals. To model this, we use a graph database built on top of DynamoDB that we call the records table.

A tree view from Koan’s product, showing a hierarchy of goals within a company

A Motivating Example

Our application generally shows lists of objects with many connections to other objects in the system immediately visible. Doing this efficiently with DynamoDB using a traditional normalized table design served us poorly. In particular, consider the case of our goals list: a goal list may easily include thousands of goals, each of which may have a large set of relationships to teams, users, other goals, and interaction data elsewhere in the system. Earlier versions of Koan solved this very naively, by simply fetching all of a company’s goals from the database at once, and sending them to the client. Naturally, this became prohibitively slow, and needed a pagination model. To handle all of this with DynamoDB, we designed what we call the records table.

A screenshot of Koan’s product showing a list of team goals, with lead information, team information, etc.

The above screenshot is showing a list of a team’s goals: two Objectives, one with four Key Results and the other with three Key Results and two sub-objectives. For each objective and key result, we need to display one or more leads (the people in charge of delivering the goal) as well as one or more teams associated with the goal.

So the data we’re showing here can be visualized as:

A graph diagram showing the relationships between goals, users, and teams inside Koan.

This is just a small subset of what we’re showing, of course! Dora Campos happens to be leading two of the goals we see here, but she may also be contributing to several other goals, and may be a member of multiple teams. Goals like “10 nationwide press events” may have their own sub-objectives and key results, and so on.

Modeling this is challenging because:

We need to maintain a variety of connections between a variety of model types
We need to show multiple types and levels of associations at once in our UI
We need to show the same data in different ways: a team’s goal list may show data that heavily overlaps with a user’s goal list on their profile, which may overlap with a goal permalink, etc.
Showing a list of goals is just one of several places where we have complex relationships to display. Users can be associated with teams in several ways as well, such as being a team manager, a team member, or a read-only observer.
If there are too many goals to reasonably fetch in a single round trip from the client, we need a way to paginate the results.

So with the motivation of querying for the goals belonging to a team, in addition to those goals’ leads and teams, let’s try to model this in our database.

Modeling this with DynamoDB

To solve this, we built the records table: a single DynamoDB table capable of housing all of our model types and all of the connections between those models. In graph parlance, each item in this records table is either a “node” or an “edge.”

A node is a first-class object in the Koan universe. Examples include goals, users, teams, and so on.
An edge is a representation of the relationship between two nodes in the graph. An example is a goal membership, which is a connection between a goal and a user, team, or org. Edges are directed (i.e., an edge from A to B is not an edge from B to A).

Items in a DynamoDB table are uniquely identified by a combination of their partition key and sort key. An important characteristic of DynamoDB is that it’s well-suited for querying for specific items by ID, or for querying based on sort key within a single partition key, but not well-suited for querying across partition keys.

So how do we actually store these nodes and edges in the database if everything is organized by partition key and sort key? It turns out DynamoDB’s key structure works naturally as an adjacency list graph. A partition key and sort key pair represent an edge in the graph, with the partition key representing the source node of the edge, and the sort key representing the target of the edge.

A graph diagram showing a goal-user connection, with a tabular diagram showing the connection’s “edge” representation — An edge inside the records table

For example, in this diagram, a user U1’s membership with goal G1 can be thought of as an edge with G1 as the source and U1 as the target. In the records table, we refer to the partition key as the source, and the sort key as the target. On an edge, we store metadata about the relationship, such as the user’s role in contributing to a goal, and when a user became associated with a goal. In the database, this ends up looking like:

{
  source: 'GOAL-G1',
  target: 'GOALMEMBERSHIP-USER-U1',
  memberRole: 'LEAD', // Note that we can store arbitrary metadata on an edge,
  date: '2020-07-01',
  gsi0: '500-LEAD', // Explained below
}

Note: In practice, an ID like G1 is a UUID in our database. That means IDs in the records table are stored as UUIDs prefixed with an object's type. For example, a goal's id might be GOAL-cb421e73-43bb-4c68-bea3-be8f1f6140e8. This type prefix is necessary since we're using a single table to store multiple data types. Edges also have types: in the example above, the edge's type is GOALMEMBERSHIP. This allows us to maintain multiple relationship types between the same two node types: for example, if we were to store a "subscriber" relationship between users and goals, we might prefix the target field with GOALSUBSCRIBER, creating target values like GOALSUBSCRIBER-USER-U1.

But these are just edges. Where are nodes stored?

Nodes are represented as self-referential edges. In other words, we can think of a node’s data as an edge with the same source and target.

A diagram showing a self-referential node in a graph, depicting how node data is stored as self-referential edges. — “Nodes” in the records table are just edges from a node to itself.

For example, goal G1’s node is stored in with a partition key of GOAL-G1 and a sort key of GOAL-G1. So a goal in the records table might look like:

{
  source: 'GOAL-G1',
  target: 'GOAL-G1',
  title: 'Release Next-Generation Augmented Reality Platform',
  // ... and so on with other goal data
}

Inside DynamoDB, these nodes and edges look something like this:

A table showing the nodes and edges in the records table, noting that each DynamoDB partition contains multiple items.

Okay, so what does this do for us?

With this, we’re able to query for any number of nodes directly by performing a BatchGetItem operation on the table’s primary index (i.e., given a list of goal IDs, we can read them out directly in a single operation). But more than that, we’re now able to query for nodes based on edge relationships, such as querying for all of the goals that a user leads. To do that, we need to make use of DynamoDB’s Query API, and also its Global Secondary Index (GSI) feature.

Global Secondary Indexes

GSIs are similar to indexes in a traditional SQL table, with the limitation that they only index on a single field. GSIs have similar structure to the main ID structure of DynamoDB’s primary index, by having a partition key and sort key, with the caveat that a GSI’s ID is not unique. In our case, we maintain a GSI with a partition key set to the target field and a sort key set to the gsi0 field shown above. The gsi0 field is derived from other fields in the edge in a way specifically intended for enabling queries that we care about. In the case above, it's derived directly from the memberRole field, but with a 500- prefix that we'll explain in a bit.

Two tables depicting a DynamoDB primary index and a GSI, showing how the same data is indexed in each.

The Query API allows us to specify a single partition key and a sort key condition to retrieve items for. In our case, we’d like to get all of the goal membership edges connecting a particular user U1 to goals where U1 is a lead. To do this, we can run a query on the GSI for items where the partition key (i.e., target) is GOALMEMBERSHIP-USER-U, and the sort key (i.e., gsi0), is 500-LEAD. Running this query will give us the edges corresponding to the goals we care about. Remember that these edges don't contain actual goal data such as the goal's title, however. That data is only stored in the goal node itself, so we still need to retrieve the nodes connected to the edges we've just retrieved. We do this by performing a BatchGetItem operation on the primary index, with the goal ids that are stored as the source field on each edge. The result of this is the list of goals that the user leads.

But why the `500-` prefix?

Remember that the gsi0 field is a sort key. It's called such because the Query API allows us to query for items that are lexicographically before or after a particular sort key, or values that begin with a particular sort key prefix. We use this to allow for querying multiple goal membership levels at once. For roles like LEAD, CONTRIBUTOR, and TEAM, we store the gsi0 (naming explained at the end) field as 500-LEAD, 400-CONTRIBUTOR, and 300-TEAM (200 and 100 are currently reserved for other purposes). This allows us to perform queries like "get all of the goals user U contributes to or leads" by specifying a partition key of GOALMEMBERSHIP-USER-U and a sort key condition of "anything where the gsi0 value is >= 400-CONTRIBUTOR". Each edge type can provide its own logic for deriving the gsi0 value based on its fields.

The Query API has one other nicety: built-in pagination. Through ExclusiveStartKey and Limit parameters, we're able to fetch the first N goal edges associated with a team, and then fetch the next N when the client needs it.

We’re getting close now! We can retrieve all of a team’s goals by querying against the secondary index to retrieve membership edges and then reading from the primary index to retrieve goal nodes. But we still need to get all of the leads and teams connected to those goals… and doing so would require us to query across all of the goals we’ve retrieved. And once we do that, we’ll still need to perform a BatchGetItem operation to retrieve the user and team nodes associated with those edges — something DynamoDB isn’t great at. Remember that we can only query for specific items by ID (where an ID is a combination of both a partition key and a sort key), or query by sort key within a single partition. This puts us in a situation resembling an N+1 query.

A table showing multiple partitions of data in the database, noting that we appear to need a separate query for each goal.

While we could execute all of these queries individually as parallel requests, this still feels inefficient. Additionally, it’s costly, in the dollar sense!

Enter the edge set

To solve this, we created what we call the edge set. You can think of it as a replica of a node’s edges, stored directly on the node. The edge set is a DynamoDB String Set containing the target and gsi0 contents of the edges connected to a node. For example, since our goal has USER-U1 as a lead, the edge set contains GOALMEMBERSHIP-USER-U1-LEAD. This is a serialization of the edge's key information, which we then deserialize into a user id and its membership role immediately after reading. With edge set data from all of the other goals we've read, we can construct a BatchGetItem operation that retrieves all of the leads for the goals we've fetched, or all of the teams (or both!).

A depiction of the edge set in the records table, showing that it allows us to retrieve adjacent nodes efficiently.

Here, we store information about a goal’s edges inside its edge set. This way, once we read a set of goal nodes from the table via a BatchGetItem operation, we immediately have information about their edges via each goal’s edge set. We can then perform one more BatchGetItem operation to fetch any associated nodes we need, which in the case of our original example, is the list of leads and teams.

There’s still one problem here: consistency. The edge set is redundant and thus denormalized, which creates a risk of inconsistency: how do we ensure that the edge set is in sync with the edge items in the table? AWS released a feature to help with this back in 2018: transactions. With DynamoDB transactions, we can ensure that anytime we add an edge to the database, we’re also adding its data to the edge set. Similarly, when we delete an edge, we use a transaction to ensure that it’s also removed from the edge set.

Bringing it all together

Let’s go back to our original example. We wanted to get all of a team’s goals, along with each goal’s leads and associated teams. With the records table, we’re able to achieve this in three steps:

Query against the GSI for goal edges connecting our team to its goals. This gives us a list of goal IDs.
Perform a BatchGetItem on the primary index to retrieve the goals based on the IDs we just retrieved.
Perform a BatchGetItem on the primary index to retrieve the relevant nodes associated with the goals we just fetched. In particular, we’ll retrieve lead users and teams.

We’ve solved our initial problem, and we’ve also solved several more. For example, showing a page containing just the goals that a user leads is a matter of changing the GSI query to retrieve goal edges connected to a user with a membership role of LEAD. Finding all of the contributors to a particular goal is doable just by inspecting the goal’s edge set. As we’ve modeled more and more of our data in this way, we’ve been able to perform increasingly sophisticated queries without significant refactors when new needs come up.

This is a high-level overview of how our records table works. Managing the rules and structures of this table is a tricky operation, so to make it easy for us to work with, we developed a TypeScript-based client library around DynamoDB’s DocumentClient. The library, which is currently still closed source, provides a declarative interface for specifying all of a node’s fields and edges, as well as configuration for how to derive an edge’s GSI values, serialized edge set value, and so on.

Notes

Why is it called gsi0? We use the name gsi0 because DynamoDB allows up to 25 GSIs per table. There's nothing stopping us from creating a second GSI called gsi1 that indexes goal memberships sorted by date rather than membership level. The records table design allows for indexing multiple edge types in the same GSI, but for indexing the same edge type with a different sort key a new GSI is needed.
What happens if the edge set gets too big? A single item in a DynamoDB table cannot exceed 400KB in size. While this is pretty large, there are some cases where a node may have hundreds, or even thousands, of edges. The “subscriber” relationship we mentioned above could easily exhibit this behavior. For cases like this, our library allows certain edge types to be omitted from the edge set. While this isn’t ideal, in practice it tends to be compatible with product expectations: a list of people subscribed to a goal is unlikely to be something we’d need to show the entirety of in the UI for several goals at a time.
Why is the edge set stored as a DynamoDB Set rather than a more space-efficient format? Using a DynamoDB Set for the edge set allows us to use ADD and DELETE expressions in our queries when adding and removing edges. This means we can create a new edge on a node without needing to read the node out of the database first. (thanks to Marco Aurélio for bringing this up!)

Big thanks to RJ Zaworski, Dan Porter, Matt Tucker, and Andy Beers for help with editing this. Is this the type of problem that interests you? We’re hiring!