Stop Using Scan! Speed up Your DynamoDB Reads by Leveraging Global Secondary Indexes

Kajoban Kuhaparan
SSENSE-TECH
Published in
12 min readJul 8, 2022

NoSQL databases have become a popular choice for data storage in recent years. With NoSQL databases, developers enjoy greater flexibility and ease of use compared to more traditional SQL databases. With the rise of cloud computing, many cloud providers include them as part of their suite of products, with Amazon Web Services (AWS) being no exception, providing DynamoDB as a fully managed database for customers to integrate into their stack.

Many teams at SSENSE, including my own, use DynamoDB to store their domain’s data. My team was able to enjoy the versatility of DynamoDB by leveraging a flexible schema to iterate on and adapt our data models over time.

Recently, our team found success by leveraging DynamoDB’s Global Secondary Index (GSI) feature to run performant queries on our non-key attributes without having to refactor how our data was being stored. When a service needs to run a query tens of thousands of times a day, its performance is critical so as to not cause a bottleneck for the entire system. GSIs emerged as the best solution for our team, and allowed our service to remain performant, while avoiding any refactoring to our existing table and saving precious developer time.

In this article I will start with a quick introduction to DynamoDB and the read operations I’ll be comparing. Then, I’ll dive in and present the querying challenge, potential solutions, and why GSIs emerged as the preferred option. It is valuable to have a well-rounded perspective when deciding to use a certain approach so I’ll also cover pitfalls and tradeoffs of using GSIs, and end with methods to optimize GSI usage.

DynamoDB: A Quick Introduction

When creating a table in DynamoDB, a developer is required to provide a table name and primary key. The primary key can consist of one attribute (the partition key), or two attributes (the partition and sort keys, jointly referred to as the composite key). Regardless of whether a partition or composite key is chosen, every primary key must be unique in the table.

As expected, DynamoDB provides the 4 basic operations of persistent storage, or CRUD (create, read, update, delete), PutItem, GetItem, UpdateItem, and DeleteItem. Each operation requires the name of the table and the primary key of the item to be operated on.

Let’s focus on the 3 ways DynamoDB supports reading data. In addition to GetItem, DynamoDB also provides Query and Scan operations. Each behaves in a different way:

  • GetItem: Retrieves a single item using its primary key. Operation is highly efficient because it has direct access to the physical location of the item.
  • Query: Retrieves all items matching a partition key. Operation is highly efficient because it has direct access to the physical locations where the data is stored.
  • Scan: Retrieves all items in a given table. When coupled with a FilterExpression, unneeded items can be discarded.

Here’s a concise side-by-side of the differences in these operations:

Figure 1: Comparison of DynamoDB Read Operations

From the above description, we can see that GetItem is the preferred method for retrieving a single item, while Query is the preferred method for retrieving multiple items. Unless we need all the items in a table, we should avoid the usage of Scan, as it consumes a large amount of compute time by traversing the entire table.

A Querying Challenge

Given the methods DynamoDB provides for retrieving items, a challenge arises when we need to frequently retrieve items on a non-key field.

Let’s illustrate this challenge with an example. Consider that as part of your warehouse domain, you are storing order data (in this example, the items in an order). Each order has a unique OrderID, which we will use as the primary key of our table.

Now imagine that when an order is dispatched from your warehouse, you also store the DispatchDate in DDMMYY format. Note that if an order has not been dispatched yet, it will not have a DispatchDate associated with it.

Figure 2: Order Data Table

Now, suppose that on a daily basis you’d like to view a list of all the orders that have been dispatched that day. Given the methods for retrieving items noted earlier, how can we implement this?

Potential Solutions

Because we’re attempting to retrieve multiple items, we may try to use the Query operation. However, as noted earlier, the Query operation allows us to retrieve items provided a partition key. Since we want to retrieve items on the DispatchDate, which is not the partition key of the table, we would not be able to use the Query operation.

We are left with the Scan and FilterExpression operations. As noted earlier, the Scan operation would retrieve every item in our table. We would then use a FilterExpression to retrieve only the items matching our desired DispatchDate. Retrieving every item in our table would result in a large compute time. This may be satisfactory if we only need to retrieve the items once. However, since the table’s size will increase daily as new orders come in, over time our daily Scan will take longer and longer. Therefore, we need a more scalable solution.

Luckily, this is where DynamoDB’s GSI feature comes into play. Instead of incurring the performance hit of retrieving every item with the Scan operation, we can leverage a GSI to run the faster Query operation on our table and retrieve only our desired items based on their DispatchDate.

Global Secondary Indexes to the Rescue

By configuring a non-key attribute as a GSI of our table, we can treat it like a partition key and select a subset of attributes from our table, organizing them against this “new” partition key. Then, we’re able to leverage DynamoDB’s Query operation in order to retrieve our desired items faster than if we were using a Scan operation.

So, let’s set the DispatchDate attribute as a GSI of our table. The main table’s primary key attributes would also automatically be included in our new index. Since we don’t need the Items attribute for our desired Query, we can choose to not project that attribute onto our new index. The resulting index would look like this:

Figure 3: DispatchDate Global Secondary Index

We can now run the faster Query operation on the DispatchDate attribute. This is where DynamoDB’s flexibility is exemplified, as we did not have to alter our table’s key schema or update any items to do so.

Note that ORDER_5 is missing from our table. If you’re curious why, it is because DynamoDB will not include items in your GSI if they are missing the partition key in the main table. Since ORDER_5 did not have a DispatchDate associated with it, it was not propagated to the DispatchDate index.

A GSI can easily be added to an existing DynamoDB table through the AWS console. First, select the table you would like to add the GSI to, and navigate to the Indexes tab to the right of the Overview tab.

Figure 4: DynamoDB Table Overview Tab

Here, you can view all existing GSIs of your table. To add a new GSI, click on either of the Create Index buttons in the Global Secondary Indexes panel.

Figure 5: DynamoDB Indexes Tab

On the next page, you will be able to configure the GSI’s details, such as its partition key, optional sort key, and its index name.

Figure 6: Configuring GSI Details

You can also customize its read capacity, write capacity, and attribute projections (more on this in the Optimization section). You’ll also be able to view your indexes’ estimated cost.

Figure 7: Configuring GSI Read and Write Capacity Units
Figure 8: GSI Estimated Costs
Figure 9: Configuring GSI Attribute Projections

As noted by AWS, creating a GSI can take some time, especially if it is for an existing table with many items already in it. You can decrease the time taken to create a GSI by allocating more Write Capacity Units (WCUs) for your table, and you can monitor the GSI’s creation progress through the OnlineIndexPercentageProgress metric in CloudWatch.

Finally, once you press the Create Index button, you will be brought back to the Indexes tab, and will see your index in the list with the “Creating” status.

Figure 10: Newly Created GSI With Creating Status

Once it has been successfully created, its status will change to “Active”. You can now run queries against your new GSI!

Figure 11: Newly Created GSI With Active Status

Below is a sample Query on our DispatchDate GSI, using the JavaScript AWS SDK. Other than providing the additional IndexName parameter, the syntax for querying a GSI is no different than if we wanted to query the partition key of our main table.

In the above example, I handle the pagination of the Query’s response, in the case where the total response size would be more than 1 MB.

Finally, in this example we specified a single GSI, but you can have up to 20 GSIs for a given table.

Query Versus Scan Experiment…the Results may Surprise you

So far, I’ve boasted about how much faster Querying a GSI is versus doing a full Scan of your table, but haven’t shown any numbers to prove my point. So, I ran a Scan vs. Query experiment on AWS to prove how much faster Querying is than Scanning.

To set-up this experiment, I created 3 DynamoDB tables in the us-west-2 region containing 10 thousand, 100 thousand, and 1 million randomly generated items. The tables were provisioned with 10 Read Capacity Units (RCUs) each. The items themselves follow the format of the Order Data Table above, where the OrderID is a randomly generated UUID, the list of Items is a random set of clothing pieces, and the DispatchDate is a random date from the last year.

The code for comparing the Scan and Query operations was run through a Lambda function in the same region running Node v16. For each table, I chose 50 random DispatchDates, and for each date, recorded the response times for both a Scan and Query to retrieve all of the orders that were dispatched on that day.

Finally, I calculated the average response times for both types of reads, and on a logarithmic scale, plotted them against the table sizes:

Figure 12: Average Scan and Query Response Times vs. Table Size
Figure 13: Logarithmic Plot of Average Response Time vs. Table Size

From the results, it is clear that querying the GSI is the undisputed winner. As your table size grows, running a Query is exponentially faster than a Scan. For the table with 1 million records, a Query was on average 18 times faster!

As an aside, running a Scan will also result in a larger bill than running a Query. This is because when running a Scan, you will be charged for the total number of items scanned, regardless of how many are returned from the FilterExpression. Since a Scan will always read the entire table, it is clear that it will not only take longer, but also cost more than a Query.

Any large scale service that is handling data at this size will quickly create a bottleneck if it attempts to retrieve items using a Scan, and it will only get worse as the size of your table increases. If you don’t actually require every item in the table, it is more scalable to run a Query on a GSI.

If you’re curious about the code used to run this experiment, feel free to take a look at the Github repository: https://github.com/kajoban/GlobalSecondaryIndexExperiment.

Pitfalls and Tradeoffs

There are several important aspects to keep in mind when using Global Secondary Indexes:

Uniqueness: GSIs do not need to be unique. Since any non-key attribute can be configured as a GSI, DynamoDB cannot enforce the attribute’s uniqueness when it is set as the partition key of the index. If enforcing the partition key’s uniqueness is required by your application, you must handle it yourself in your code.

Querying: Only Query and Scan operations are supported. The GetItem operation requires a unique primary key to retrieve a single item, and that cannot be enforced by a GSI, so it is not available to be used. Furthermore, since it is possible to have duplicate partition and sort key pairs in a GSI, a Query specifying such a pair may not always return a single item.

Write Operations: You cannot write to a GSI using PutItem, UpdateItem, or DeleteItem, or any other write operation. This is because the GSI is a projection of your main table. It should be thought of as a “read-only” entity.

Synchronization: As DynamoDB automatically synchronizes your GSI with your base table, a write operation on your main table will eventually propagate to your GSI. Although the synchronization is usually fast, there may be a delay depending on the size of your table, and this could result in the usage of stale data if the data is read from the GSI before it has been synchronized with the main table.

Creation in Existing Table: Although creating a GSI on a table doesn’t require refactoring how your data is stored, there may be a substantial wait time before your GSI is created if your table already contains many items. This can be mitigated by temporarily increasing the Write Capacity Units (WCUs) of your table until the GSI has been created. Furthermore, you can monitor the GSI’s creation progress through the OnlineIndexPercentageProgress metric in CloudWatch.

Missing Key Attributes: DynamoDB implements a feature called a Sparse Index, which results in a GSI not automatically propagating items in your main table that are missing the key attribute of the GSI. Therefore, you should ensure that the GSI key attributes are populated in your main table before attempting to Query for those items in your GSI.

Cost: Since writes to the main table also propagate to your GSI, there is a multiplicative effect on the cost of writes to your main table. Likewise, since your GSIs are projections of your main table, this effect also applies to the cost of space required to persist these projections. We can minimize these costs by selecting only attributes we need to be a part of our GSI. I will go over this in the next section.

The Secret to Success: Optimization

Earlier, I mentioned that a GSI is a projection of your main table. This means that it includes a set of attributes that are copied from your main table onto the index. Although the main table’s partition and sort keys will always be projected onto your index, you can choose what other attributes you would also like to project. Leveraging this customization is important in order to project only what is required for your application, thus reducing the costs of both synchronizing and persisting your GSI.

Going back to the example shown above which created a GSI through the console, we should note the Attribute Projections section. DynamoDB provides us with 3 options for what attributes we’d like to project from our main table onto our GSI:

  1. KEYS_ONLY: Each item in the GSI includes the selected key attribute, as well as the main table’s partition key (and sort key values if they exist). This is the smallest possible GSI, and thus the least expensive.
  2. INCLUDE: Like with KEYS_ONLY, each item in the GSI will include the key attributes, but additionally any specified non-key attributes from the main table. Indexes which use INCLUDE will vary in size depending on how many attributes are additionally projected, and is best for when you know the maximal amount of attributes your Query should return.
  3. ALL: Each item in the GSI will include all of the attributes from the main table. This effectively duplicates the main table, and is the largest possible GSI, and thus the most expensive.

When creating a new GSI through the AWS console, by default it will be created with the ALL Projection Type. However, it is always a good idea to look carefully at what queries you expect to be running against your GSI, and if every attribute doesn’t need to be projected, only include the required attributes in order to minimize costs.

Conclusion

If you find yourself frequently using Scan for a particular attribute on your DynamoDB table, you should consider projecting that attribute onto a GSI and leveraging DynamoDB’s much faster Query operation. It is easy to extend an existing table with a GSI, without having to update any items in your table. GSIs are also very flexible in allowing you to choose which attributes of your main table you would like them to include, allowing you to minimize the costs associated with persisting and synchronizing them.

Editorial reviews by Catherine Heim, Mario Bittencourt & Nicole Tempas.

Want to work with us? Click here to see all open positions at SSENSE!

--

--