Reading from DynamoDB gone wrong

Chethan
4 min readMar 23, 2024

A few months ago, we moved our persistence layer from S3 to DynamoDB. After we switched to DynamoDB, within a few days, we started noticing throttling errors from the DynamoDB due to our system hitting the defined capacity. If you know DynamoDB, you will know that it comes with two types of Read/Write Capacity Mode

  • Ondemand
  • Provisioned

Having the prior knowledge of the number of reads/writes we do, we had decided to Provisioned mode, and set a limit which we thought would be enough for our use case. But our understanding was proven wrong. When we hit the threshold started getting throttling errors, we increased the limit further. This cycle repeated twice more before we switched completely to ondemand mode, hoping that AWS to scale the limit automatically. However, as the workload increased on our system, we started noticing our cost going crazily up.

It was around this time we spent some time to understand what was going wrong, and when we found out what the reason was, it was our facepalm moment.

Before I share you the problem, let us talk about one important concept associated with the DynamoDB.

HashKey and RangeKey: Both of these together identifies a row in the DynamoDB table. This indirectly mean, they have to form a unique combination. While defining RangeKey is optional, HashKey has to be defined during the creation of the table. In the absence of RangeKey, HashKey uniquely identifies the row in the table.

A simple DynamoDB table

The above screenshot shows a DynamoDB table with MusicCode as the HashKey, and Artist as the RangeKey

Now coming back to our problem. Wait, I didn’t tell what it was, right? Well, the main problem with our system was how we were reading from the DynamoDB. Our application was written in Python, and we were using pynamodb library to interact with our DB. The library provides three main ways to fetch data from the tables,

  • Get: This method is the simplest of all the three we have, like the name suggest it is used to get a row from the DB that matches the Haskey and Rangekey(if exists). This method only returns only one row. For eg:
def get_a_music_album(music_code:str):
return MusicModel.get("music_code")

However, the above code doesn’t work. Because we have defined a RangeKey for our DB, the pynamodb expects us to provide the details of that as well. Without that, we get a query error: The provided key element does not match the schema. This is because, with only HashKey, there are chances that pynamodb may end up with multiple objects returned from the table and the library won’t know which object to pick.

To fix this we have to change our get call to something like below.

def get_a_music_album(music_code:str, artist: str):
return MusicModel.get(music_code, range_key=artist)
  • Query: This method allows us to fetch multiple rows from the table based on the matching HashKey. This could be useful when you have both Haskey and RangeKey defined, and you want to get all the entries that matches HashKey. Additionally, you can also use comparison operators on any other column to refine the search results. For this let us change our table a bit by adding a one more row with Adele’s entry.
def get_music_albums(artist: str):
return MusicModel.query(artist) # artist is equal to "Adele"

# if you wnat to filter based on the range_key, you can write it like this:
def get_music_albums_for_artist_by_genre(artist: str, genre: str):
"""artist=Adele, genre=Pop"""
return MusicModel.query(artist, filter_condition=(Attr('Genre').eq(genre)))

The first method call will return a resultset with 2 objects and the second returns 2 objects as well because the filter condition matches two values.

While using query, DynamoDB first fetches the data from the table based on the HashKey and then applies the filter. So, it is a two-step process.

  • Scan: This is the method that will fetch all the rows from the table. Along with that, you can apply filters which can be applied to extract a subset of the rows. Similar to the query method, Scan also extract data in two-step, first fetches all the rows and then filter values based on the provided filter condition.
def get_music_by_genre(genre: str)
return MusicModel.scan(filter_condition=MusicModel.genre == genre) #genre Pop

For the above method, we will get 4 objects if provide value as Pop for genre.

You may have guessed by now, what was the issue in our case.

We were using Scan to fetch data. While we were looking for one object based on the matching HashKey, for every call we would fetch all the rows from the table and apply filter on the HashKey after that. As the number of rows added to the table increased, we were getting closer to the provisioned capacity for every read call we made. After we moved to ondemand mode, the cost grew higher and higher by days.

I hope you learnt from the mistake we made; we definitely did. Choose the method you want to use carefully.

--

--