Modeling Graph Relationships in DynamoDB
Koan is centered around goals, and how those goals connect to teams and people within a company. These connections can often be complex: the same person leading a goal on one team may be working toward several other related goals. To model this, we use a graph database built on top of DynamoDB that we call the records table.
A Motivating Example
Our application generally shows lists of objects with many connections to other objects in the system immediately visible. Doing this efficiently with DynamoDB using a traditional normalized table design served us poorly. In particular, consider the case of our goals list: a goal list may easily include thousands of goals, each of which may have a large set of relationships to teams, users, other goals, and interaction data elsewhere in the system. Earlier versions of Koan solved this very naively, by simply fetching all of a company’s goals from the database at once, and sending them to the client. Naturally, this became prohibitively slow, and needed a pagination model. To handle all of this with DynamoDB, we designed what we call the records table.
The above screenshot is showing a list of a team’s goals: two Objectives, one with four Key Results and the other with three Key Results and two sub-objectives. For each objective and key result, we need to display one or more leads (the people in charge of delivering the goal) as well as one or more teams associated with the goal.
So the data we’re showing here can be visualized as:
This is just a small subset of what we’re showing, of course! Dora Campos happens to be leading two of the goals we see here, but she may also be contributing to several other goals, and may be a member of multiple teams. Goals like “10 nationwide press events” may have their own sub-objectives and key results, and so on.
Modeling this is challenging because:
- We need to maintain a variety of connections between a variety of model types
- We need to show multiple types and levels of associations at once in our UI
- We need to show the same data in different ways: a team’s goal list may show data that heavily overlaps with a user’s goal list on their profile, which may overlap with a goal permalink, etc.
- Showing a list of goals is just one of several places where we have complex relationships to display. Users can be associated with teams in several ways as well, such as being a team manager, a team member, or a read-only observer.
- If there are too many goals to reasonably fetch in a single round trip from the client, we need a way to paginate the results.
So with the motivation of querying for the goals belonging to a team, in addition to those goals’ leads and teams, let’s try to model this in our database.
Modeling this with DynamoDB
To solve this, we built the records table: a single DynamoDB table capable of housing all of our model types and all of the connections between those models. In graph parlance, each item in this records table is either a “node” or an “edge.”
- A node is a first-class object in the Koan universe. Examples include goals, users, teams, and so on.
- An edge is a representation of the relationship between two nodes in the graph. An example is a goal membership, which is a connection between a goal and a user, team, or org. Edges are directed (i.e., an edge from A to B is not an edge from B to A).
Items in a DynamoDB table are uniquely identified by a combination of their partition key and sort key. An important characteristic of DynamoDB is that it’s well-suited for querying for specific items by ID, or for querying based on sort key within a single partition key, but not well-suited for querying across partition keys.
So how do we actually store these nodes and edges in the database if everything is organized by partition key and sort key? It turns out DynamoDB’s key structure works naturally as an adjacency list graph. A partition key and sort key pair represent an edge in the graph, with the partition key representing the source node of the edge, and the sort key representing the target of the edge.
For example, in this diagram, a user U1’s membership with goal G1 can be thought of as an edge with G1 as the source and U1 as the target. In the records table, we refer to the partition key as the source, and the sort key as the target. On an edge, we store metadata about the relationship, such as the user’s role in contributing to a goal, and when a user became associated with a goal. In the database, this ends up looking like:
memberRole: 'LEAD', // Note that we can store arbitrary metadata on an edge,
gsi0: '500-LEAD', // Explained below
Note: In practice, an ID like
G1 is a UUID in our database. That means IDs in the records table are stored as UUIDs prefixed with an object's type. For example, a goal's id might be
GOAL-cb421e73-43bb-4c68-bea3-be8f1f6140e8. This type prefix is necessary since we're using a single table to store multiple data types. Edges also have types: in the example above, the edge's type is
GOALMEMBERSHIP. This allows us to maintain multiple relationship types between the same two node types: for example, if we were to store a "subscriber" relationship between users and goals, we might prefix the
target field with
target values like
But these are just edges. Where are nodes stored?
Nodes are represented as self-referential edges. In other words, we can think of a node’s data as an edge with the same
For example, goal G1’s node is stored in with a partition key of
GOAL-G1 and a sort key of
GOAL-G1. So a goal in the records table might look like:
title: 'Release Next-Generation Augmented Reality Platform',
// ... and so on with other goal data
Inside DynamoDB, these nodes and edges look something like this:
Okay, so what does this do for us?
With this, we’re able to query for any number of nodes directly by performing a BatchGetItem operation on the table’s primary index (i.e., given a list of goal IDs, we can read them out directly in a single operation). But more than that, we’re now able to query for nodes based on edge relationships, such as querying for all of the goals that a user leads. To do that, we need to make use of DynamoDB’s Query API, and also its Global Secondary Index (GSI) feature.
Global Secondary Indexes
GSIs are similar to indexes in a traditional SQL table, with the limitation that they only index on a single field. GSIs have similar structure to the main ID structure of DynamoDB’s primary index, by having a partition key and sort key, with the caveat that a GSI’s ID is not unique. In our case, we maintain a GSI with a partition key set to the
target field and a sort key set to the
gsi0 field shown above. The
gsi0 field is derived from other fields in the edge in a way specifically intended for enabling queries that we care about. In the case above, it's derived directly from the
memberRole field, but with a
500- prefix that we'll explain in a bit.
The Query API allows us to specify a single partition key and a sort key condition to retrieve items for. In our case, we’d like to get all of the goal membership edges connecting a particular user U1 to goals where U1 is a lead. To do this, we can run a query on the GSI for items where the partition key (i.e.,
GOALMEMBERSHIP-USER-U, and the sort key (i.e.,
500-LEAD. Running this query will give us the edges corresponding to the goals we care about. Remember that these edges don't contain actual goal data such as the goal's
title, however. That data is only stored in the goal node itself, so we still need to retrieve the nodes connected to the edges we've just retrieved. We do this by performing a BatchGetItem operation on the primary index, with the goal ids that are stored as the
source field on each edge. The result of this is the list of goals that the user leads.
But why the
Remember that the
gsi0 field is a sort key. It's called such because the Query API allows us to query for items that are lexicographically before or after a particular sort key, or values that begin with a particular sort key prefix. We use this to allow for querying multiple goal membership levels at once. For roles like
TEAM, we store the
gsi0 (naming explained at the end) field as
100 are currently reserved for other purposes). This allows us to perform queries like "get all of the goals user U contributes to or leads" by specifying a partition key of
GOALMEMBERSHIP-USER-U and a sort key condition of "anything where the
gsi0 value is
>= 400-CONTRIBUTOR". Each edge type can provide its own logic for deriving the
gsi0 value based on its fields.
The Query API has one other nicety: built-in pagination. Through
Limit parameters, we're able to fetch the first N goal edges associated with a team, and then fetch the next N when the client needs it.
We’re getting close now! We can retrieve all of a team’s goals by querying against the secondary index to retrieve membership edges and then reading from the primary index to retrieve goal nodes. But we still need to get all of the leads and teams connected to those goals… and doing so would require us to query across all of the goals we’ve retrieved. And once we do that, we’ll still need to perform a BatchGetItem operation to retrieve the user and team nodes associated with those edges — something DynamoDB isn’t great at. Remember that we can only query for specific items by ID (where an ID is a combination of both a partition key and a sort key), or query by sort key within a single partition. This puts us in a situation resembling an N+1 query.
While we could execute all of these queries individually as parallel requests, this still feels inefficient. Additionally, it’s costly, in the dollar sense!
Enter the edge set
To solve this, we created what we call the edge set. You can think of it as a replica of a node’s edges, stored directly on the node. The edge set is a DynamoDB String Set containing the
gsi0 contents of the edges connected to a node. For example, since our goal has
USER-U1 as a lead, the edge set contains
GOALMEMBERSHIP-USER-U1-LEAD. This is a serialization of the edge's key information, which we then deserialize into a user id and its membership role immediately after reading. With edge set data from all of the other goals we've read, we can construct a BatchGetItem operation that retrieves all of the leads for the goals we've fetched, or all of the teams (or both!).
Here, we store information about a goal’s edges inside its edge set. This way, once we read a set of goal nodes from the table via a BatchGetItem operation, we immediately have information about their edges via each goal’s edge set. We can then perform one more BatchGetItem operation to fetch any associated nodes we need, which in the case of our original example, is the list of leads and teams.
There’s still one problem here: consistency. The edge set is redundant and thus denormalized, which creates a risk of inconsistency: how do we ensure that the edge set is in sync with the edge items in the table? AWS released a feature to help with this back in 2018: transactions. With DynamoDB transactions, we can ensure that anytime we add an edge to the database, we’re also adding its data to the edge set. Similarly, when we delete an edge, we use a transaction to ensure that it’s also removed from the edge set.
Bringing it all together
Let’s go back to our original example. We wanted to get all of a team’s goals, along with each goal’s leads and associated teams. With the records table, we’re able to achieve this in three steps:
- Query against the GSI for goal edges connecting our team to its goals. This gives us a list of goal IDs.
- Perform a BatchGetItem on the primary index to retrieve the goals based on the IDs we just retrieved.
- Perform a BatchGetItem on the primary index to retrieve the relevant nodes associated with the goals we just fetched. In particular, we’ll retrieve lead users and teams.
We’ve solved our initial problem, and we’ve also solved several more. For example, showing a page containing just the goals that a user leads is a matter of changing the GSI query to retrieve goal edges connected to a user with a membership role of
LEAD. Finding all of the contributors to a particular goal is doable just by inspecting the goal’s edge set. As we’ve modeled more and more of our data in this way, we’ve been able to perform increasingly sophisticated queries without significant refactors when new needs come up.
This is a high-level overview of how our records table works. Managing the rules and structures of this table is a tricky operation, so to make it easy for us to work with, we developed a TypeScript-based client library around DynamoDB’s DocumentClient. The library, which is currently still closed source, provides a declarative interface for specifying all of a node’s fields and edges, as well as configuration for how to derive an edge’s GSI values, serialized edge set value, and so on.
- Why is it called
gsi0? We use the name
gsi0because DynamoDB allows up to 25 GSIs per table. There's nothing stopping us from creating a second GSI called
gsi1that indexes goal memberships sorted by date rather than membership level. The records table design allows for indexing multiple edge types in the same GSI, but for indexing the same edge type with a different sort key a new GSI is needed.
- What happens if the edge set gets too big? A single item in a DynamoDB table cannot exceed 400KB in size. While this is pretty large, there are some cases where a node may have hundreds, or even thousands, of edges. The “subscriber” relationship we mentioned above could easily exhibit this behavior. For cases like this, our library allows certain edge types to be omitted from the edge set. While this isn’t ideal, in practice it tends to be compatible with product expectations: a list of people subscribed to a goal is unlikely to be something we’d need to show the entirety of in the UI for several goals at a time.
- Why is the edge set stored as a DynamoDB Set rather than a more space-efficient format? Using a DynamoDB Set for the edge set allows us to use
DELETEexpressions in our queries when adding and removing edges. This means we can create a new edge on a node without needing to read the node out of the database first. (thanks to Marco Aurélio for bringing this up!)
Big thanks to RJ Zaworski, Dan Porter, Matt Tucker, and Andy Beers for help with editing this. Is this the type of problem that interests you? We’re hiring!