Speed up Your AWS DynamoDB Queries

Published in

henngeblog

8 min readApr 15, 2020

We had a problem: big queries that collected lots of data from AWS’s DynamoDB (DDB) took a long time. We had a table of data consisting of ~68000 entries, with a primary hash key with ~35000 items. Querying to return all the data in all these items required somewhere in the realm of 10 seconds, and people don’t like waiting that long on the internet. So I went on a journey to help uncover what made these queries slow, and what could possibly be done to help improve these query times…

Long story short (I’ll try to keep the whole post short and to the point), we’re beginning to think that Amazon’s Boto3 library just doesn’t do its job very well (or at least, it’s not very fast).

In this post, I will refer often to a benchmarking script I wrote for the sake of determining the effectiveness of improving the query performance of DDB. This benchmark script simply performed the query mentioned above and timed how long the operation lasted using Python’s timeit module.

Improving Boto3’s Performance (Kind of)

Since we were already using boto3, we decided first it would be most appropriate to try to improve the performance of it.

TIP 1: Make sure you have enough read capacity (~15 seconds → ~10 seconds)

The very first benchmark proved to be quite poor, requiring a massive 15 seconds. However, it was later discovered that this was due to the limited read capacity of the table (15). Due to how query works, one must supply multiple requests to the API if there are many items (>1MB) in the query. I found that certain queries would be throttled if the read capacity was exceeded (normal requests were ~0.7 seconds, throttled requests lasted up to 3 seconds). After ensuring the read capacity was definitely enough (100), no requests were throttled, and the query completed in around 10 seconds pretty consistently.

TIP 2: Store less data (~10 seconds → ~2 seconds*)

*This tip might not be applicable to many situations (if any; why would anyone be storing data they don’t need?), but there are numbers in here that might be interesting to some, and it might be interesting to compare these results to those of the next tip.

The next point of interest to investigate was the design of the table in general. In order to further understand Amazon’s SaaS-supported black box that is DDB, I created a table that consisted of the same number of items as the original staging data, as well as the same primary hash key containing ~35000 items. This table contained only the necessary data from the same schema of the original table (1 primary hash/sort key pair, and one local secondary index). Because there was significantly less data in the table (~2MB down from ~24MB in the original table), it’s not really surprising that the query took less time. However, it might be interesting to note that while there were less requests required (2 rather than 12), each request took longer on average. I don’t know enough about it to know why this is the case.

Large (original) dataset vs small dataset

TIP 3: Use projections to only collect the data that you need (~10 seconds → ~3 seconds)

10 seconds is still a long time, and no one is going to be happy waiting that long. The next thing that can help improve the query time using boto3 is to provide a Projection onto the query (request only certain fields from the data, rather than whole items). Performing the query with a projection of just the 2 primary keys (hash and sort) and the primary local index, it was possible to reduce the amount of time required to ~3 seconds down from ~10. This was a reduction in attributes from up to 20 (though, not every item had every attribute) to 3 for all. The interesting thing to note here is that, despite retrieving fewer data in the query, the same number of requests were performed. This seems to indicate that the 1MB response limit is calculated based on the total size of the items being retrieved since when checking the response data only the projected attributes appear (which seems to account for the lower per-request time seen below).

Large (original) query vs a projected query on the same data

It’s important to note here that, similar to the previous tip, this might not always be applicable. However, this just demonstrates that it may definitely be worth it if you do not need all the data in table at any given moment.

TIP 4: Split primary hash keys into hash keys containing <1MB data (~10 seconds → ~6 seconds)

Doing this in and of itself will not provide any speedup (at least, theoretically) however, it does allow then allow data to be collected in a non-sequential manner (non-sequential meaning either asynchronously, or in parallel). Because the API/response is always paginated and returns the next sort key from which to begin querying again at the end of each response, it is not possible to dynamically parallelise (de-synchronise? de-sequentialise?) the query. However, if one knows this information beforehand, then the data can be collected in a non-sequential manner. Because this requires making fundamental changes to the way data is stored in a table, it may not be the most attractive option. In fact, I didn’t even do this when collecting the time for these results, I actually just scraped a list of “ExclusiveStartKey”s (sort key values) from the regular query and used those to simulate the multiple hash key scenario (however, since real data/tables will change over its lifetime, this would not be possible outside of the testing environment).

After getting these request start points, I wrote a method to gather the results asynchronously using Python’s asyncio and aiodynamo modules. This method lined up the requests as mentioned earlier using aiodynamo’s client’s query method, and collected the results asynchronously. This achieved a speedup of around 4 seconds, from ~10 seconds down to ~6 seconds. Additional speedup may be possible through parallelization. However, the difficulty in setting up a database to incorporate this may prove too difficult to be worth the speedup.

Large sequential (original) query vs the same query asynchronously

Ditching Boto3 Altogether

After some digging, we decided perhaps all this hassle isn’t worth it… Maybe boto3 is just irredeemably slow?

That was a good thought.

TIP 0: Don’t use boto3 (or botocore) (~10 seconds → ~4 seconds)

After profiling the query operation using boto3, we discovered that boto3 uses a ridiculous amount of time just to check, double-check, and triple check the validity/structure of data that it is receiving and parsing that data from the response.

The total time for boto3 to parse data makes up two of the three most time consuming methods

When inspecting these traces, we’re most interested in the tottime field, as this is what determines the real-time consumed during execution (i.e. all tottimes added together will result in the total execution time).

The _handle_structure and _parse_shape methods are methods provided by botocore to parse data from the AWS response into boto3 compatible data structures. This is largely unnecessary, though, since it is possible to predict the correct/anticipated structure of the data and immediately parse (or at least, attempt to parse) the data into this structure, without the intermediate steps that boto3 enforces. To this end, we decided to try implementing our own request method to AWS. Luckily, Amazon actually provides examples of how to do this in Python (it’s like they’re trying to say something…), so I just copy-pasted most of the code and tweaked it to perform a query rather than create a table (and to utilise AWS sessions).

The results were incredible… Without performing any “tricks” like projections and de-sequentialisation (I’m going with that word) the time to perform a query of ~35000 items reduced from boto3’s ~10 seconds down to ~4 seconds!

The same query to AWS, but using python’s requests module rather than boto3

This could be a very good case against the use of boto3. However, it really depends on use case. If you are looking more for convenience of use without having to develop your own module, then boto3 might be simpler to use on a whim. However, if you require data to be quickly available to users, then perhaps it would be worth investing some (small amount) of time into developing a requests-based method for hitting AWS. Additionally, since each of the methods used to “speed up” boto3 had some sort of downside (lack of data, difficulty in implementation), they might not even be worth pursuing. As an aside, one should keep in mind that this data was collected solely from the point of view of the query method (and a highly-controlled example of which at that, though I would assume the data would transfer to the general case).

What do you need to query AWS directly?

Honestly, not a lot. Beyond Amazon’s example that I listed earlier, you really just need to move some things around. Add variables to make the code more generic (it’s pretty much entirely hard-coded in the example), place things into useful functions, replace the request type (i.e. Amazon’s custom method signature) with whatever request you’re trying to achieve and you’re pretty much good to go. When it comes to query (which is what this benchmarking is based off, by the way), you will need to deal with data pagination (but you need to do this with boto3 anyway…). Lastly, if you require the response data to be transformed into useful models that are part of your code base, you will likely need a method to deserialise the response data into the model(s). However, this is a relatively small amount of work, and (hopefully) a custom deserialiser will be faster than boto3’s generic one (as ours was).

The script that was used in this benchmark query can be found here.

David is a student at the University of Sydney studying Software Engineering (graduating at the end of 2020). David was one of two HENNGE Global Interns for January 20-February 28 2020. Apply now by cracking the IT challenge here.