DynamoDB on Production: A compilation of lessons learned in a year

Published in

FinBox

13 min readMar 3, 2020

This content has an updated version published on the AWS Startups Blog

This article talks about the do’s, don’ts, and best practices for using DynamoDB for large scale production workloads and tips for saving costs.

For one of FinBox’s products DeviceConnect, we provide a credit score based on enriched mobile device data for customers. At the time of writing this article, we were scoring close to a million customers per month and ingesting close to 80 GB of new data every day. DeviceConnect makes heavy use of DynamoDB, and here are the lessons we learned after using DynamoDB in the product for the last 1 year:

Lesson 1: First be sure whether you need DynamoDB

Being a fully managed NoSQL Database with high availability and durability, no administrative burdens for things like provisioning and managing servers, DynamoDB makes itself pretty easy to start with. But before even thinking of starting, make sure you have the answer to the question — “Why DynamoDB and is it a good choice for me?”. This article by Forrest Brazeal funnily proposes “WhynamoDB”, an AWS service to answer the same question, and the two laws of DynamoDB:

Assume that a DynamoDB implementation will be harder, not easier, than using a relational database that you already know.
On a massive scale, DynamoDB’s usability is limited by its simplicity.

In the case of DeviceConnect, with fixed access patterns and polymorphic data, we needed a highly available NoSQL store to quickly store and retrieve enriched device data. 90% of our tech stack was already on Serverless (AWS Lambda), and DynamoDB integrates pretty well with it. Also being a small team we didn’t want to invest much in operations, hence DynamoDB being a managed service became our choice.

Lesson 2: Know your access patterns before designing the schema

This is something true for most of the NoSQL databases — you need to know your access patterns before designing your schema. Since you are charged based on throughput in DynamoDB (capacity or request units consumed), efficient reading becomes pretty important.

Knowing the access patterns before, you can design schemas with appropriate partition and sort key so that there is the least amount of data scanned (hence low bills), every time you query. For example, while querying, adding filter expressions doesn’t influence the cost. This is because the filter is applied after reading all the items fetched based on partition key and sort key range conditions. Patterns like Hierarchical Sort Key (Composite Sort Key) can be used to address this problem.

To understand this with an example, imagine a table called Job which has certain attributes, three of them are jobtype , createdat and status — where jobtype is partition key and createdat is the sort key. Now to get all pending task_1 jobs, you would query with jobtype = "task_1" with a filter expression of status = "pending" , but since the filter expression is applied after querying all the partition key values, in this case, you’ll end up paying for the data size of all three rows as in the diagram.

But by creating a composite sort key by joining status and createdat called status_create here, we can query with jobtype = "task_1" and status_create STARTS WITH "pending" . This way, since its only range query, only the required number of items (2 in this case), will get queried and you will be charged only for their size.

In this amazing talk at AWS re:Invent 2018, Rick Houlihan showed that knowing the access patterns beforehand, a single DynamoDB Table can handle access patterns of a multi-table relational database. The AWS developer guide also mentions that most well-designed applications require only one table. Let’s look at some techniques to make it possible:

Index Overloading (Source: Trek10’s article)

The technique you can see in the diagram is index overloading, in which a single attribute pk is being used to store primary keys for all the different modeled relational tables. This is useful in modeling different sets of data in a single table, and multiple access patterns with a small number of indexes.

Similarly, if we need to access a small subsection of the table, we can make use of sparse indexes. For any item in a table, DynamoDB writes a corresponding index entry only if the index sort key value is present in the item. If the sort key doesn’t appear in every table item, the index is said to be “sparse”. For example, if we have a table that stores all the food orders, with user_id as the partition key and order_id as the sort key. Then to track all currently incomplete orders, we can insert a boolean attribute is_running defined as an alternate sort key (LSI) in the incomplete orders. This attribute will be deleted once the order is complete. Since the items with these attributes will be less. This is_running will act as a sparse index.

Adjacency List (Source: AWS Developer Guide)

To implement many-to-many relationships, one can use the Adjacency List concept from Graph theory, and represent each edge by an item entry, with partition keys denoting the different nodes and the target node as the sort key. This is demonstrated in the diagram on the left where an invoice contains multiple bills and one bill can be part of multiple invoices as well.

The problem with designing a single DynamoDB table is that only the people who designed it can understand the data by looking at it. It requires proper design documentation explaining the design choices based on access patterns. If not documented well, it can be difficult while onboarding new employees.

Lesson 3: LSI can only be created during table creation and GSI creation later on large tables can be expensive

Local Secondary Indexes (LSI) can only be defined at the time of table creation, this makes it pretty difficult to implement new access patterns at a later time (Also adding LSI restrict the size per partition to 10 GB).

Global Secondary Indexes (GSI) has the flexibility of being added a later time as well. But, the lesson we learned is that GSI creation can be super expensive if the table is huge so it’s better to think access patterns beforehand.

In one of our tables with over a billion records, adding a GSI costed us close to $1000 in a single day! This happened because of the replication happening to create this GSI. If this amount was gradual and distributed over time, this single day bill wouldn’t have hurt much. Also, creating GSI at a later time can take time, as in our case it took close to 24 hours.

Cost Explorer showing the cost for the day GSI was created

To reduce the costs, it is always advisable to specify only required attributes as projections while defining the secondary indexes, by doing so only the projected attributes are replicated instead of all attributes.

For creating GSI at the start, you need to know your access patterns before — so one should spend a good amount of time discussing and then designing the schema before starting.

Lesson 4: Provide sufficient capacity for GSI

One of the best advantages of GSI is the fact that you specify separate capacity for it, unlike LSI where capacity is shared with the main table. If GSI is specified with less capacity then it can throttle your main table’s write requests!

Whenever new updates are made to the main table, it is also updated in the GSI. In case of high write rate for the main table, and the GSI is not getting time to be updated (due to low provisioned capacity), write requests in the main table will throttle, until the GSI is updated.

Hence, the AWS Developer Guide also mentions that to avoid potential throttling, the provisioned write capacity for a GSI should be equal or greater than the write capacity of the base table.

Lesson 5: Beware of hot partitions!

DynamoDB hashes a partition key and maps to a keyspace, in which different ranges point to different partitions. If your access patterns involve querying against the same partition key again and again (“hot” key), then you may end up with Hot Partitions, which leads to your read / write requests to the same partition getting throttled. DynamoDB has although tried to solve this problem up to some extent by burst capacity and instant adaptive capacity.

Demonstration of Burst Capacity (capacity units in y-axis and time in seconds in x-axis)

In burst capacity, the unused read/write capacity is retained and given at end of every 5 minutes window. As stated in the developer guide, DynamoDB can also consume burst capacity for background maintenance and other tasks without prior notice, so one should not rely entirely on it.

Also, initially, your provisioned capacity is equally distributed over the partitions and the unused capacity from other partitions is instantly available over the hot partitions if required by instant adaptive capacity.

In late 2019, they also announced the feature of isolating frequently accessed items automatically based on access patterns. But, this doesn’t work for tables with DynamoDB Streams.

Instant adaptive capacity can still never cross the limit of 3000 capacity units of reading and 1000 capacity units of writing for each partition.

To avoid hot partitions, you should select your partition key appropriately so that requests are distributed uniformly over the partitions and the case of “hot” key never arise, but its something that’s not always possible with every access pattern. At FinBox, we had a few “bad” users, who send huge amounts of data so quickly at times which ends up throttling the other read/write requests. To tackle this we followed what Segment did, blocking or limiting the requests from certain users as listed on their blog post “The million-dollar engineering problem”. It used to be a long process requesting the hot partition keys from AWS and getting bad users by logs, but Amazon released the CloudWatch Contributor Insights for DynamoDB that lets us find those keys/items by ourselves.

Lesson 6: Use backoffs or jitters while writing or reading data

There are many reasons throttling of read/write requests can occur — underprovisioned capacity or hot partitions. A general rule of thumb to handle throttles is to have retries with exponential backoffs and jitters, while writing/reading data. Most of the AWS client libraries already have options available for retries with exponential backoffs.

Look at this article to read more about it.

Lesson 7: Prefer provisioned capacity over on-demand

Every DynamoDB table must be in one of the two capacity modes — either On-Demand or Provisioned. In On-Demand you are charged based on the request units while in provisioned you setup the capacity limits for read and write (can put autoscaling settings as well) and are charged for the provisioned or scaled capacity units.

Demonstrating the increase in cost after switching to on-demand from provisioned

While On-Demand sounds good, but it can get pretty costly. Recently, we changed some code that required a particular table to be accessed more often, so we switched to On-Demand from provisioned to monitor the capacities, and it costed us almost 2x! We did this to decide the future capacity but the costs were pretty well visible in just 3 days of making the switch. Also, it is to be noted that on-demand to provisioned capacity mode conversion is allowed only once per day. As a general rule of thumb, avoid on-demand as most of the use-cases have a predictable load.

Lesson 8: Autoscaling has its limits

While scaling up can happen any number of times in provisioned capacity mode, scaling down is limited, it has a limit of 27 times — 4 decreases in the first hour, and 1 decrease for each of the subsequent 1-hour windows in a day. Hence, for one “bad” user, if the scaling up happens, it will take a lot of time coming down making you pay more on your AWS bill, as provisioned capacity will be high. Identification of “bad” users by proper monitoring was pretty much required in Finbox’s case to handle this situation.

An additional strategy to save costs can be setting up scheduled autoscaling where different policies can work at different times based on usage time. Like less capacity during the night, and more during the day for applications expected to be used more during day time.

Lesson 9: Be super clear with unit calculations and prefer eventual consistent reads

It is super important to be clear with how much unit each request (write or read) is gonna consume, as it directly influences the bill. Strongly consistent reads consume twice the unit as eventual consistent reads, so you should always prefer eventual over strongly consistent read unless specifically required.

Querying, in general, is more efficient than scanning. Scanning involves going through the entire table or projected attributes (in case of secondary indexes), hence you end up paying more cost for scanning. A well-designed schema in the first place will have partition and sort key chosen well so that you never have to scan and can always go for query — hence saving costs.

It is also worth noting that there is an option of requesting only a subset of attributes while querying but it has no impact over the item size calculations and hence the cost.

Unit calculation can get tricky based on DynamoDB APIs. For example, BatchGetItem rounds of each item to nearest 4 KB boundary, while in the Query the sum of items are rounded to the nearest 4 KB boundary. As an example, if there were 2 items fetched of 1.5 KB and 6.5 KB respectively. In BatchGetItem, 12 KB (4 KB + 8 KB) will be considered, while in Query 8 KB will be considered.

Similarly while updating even a subset of attributes using UpdateItem, the write unit calculation is based on the larger of the complete item size before and after the update, irrespective of attributes being updated.

Every year new features and improvements are happening on DynamoDB, make sure to be updated with the current limits and optimizations by referring to developer guide and documentation of DynamoDB and use the knowledge wisely to save costs over time.

Lesson 10: Use streams and relational databases for ad-hoc queries and analytics

Sometimes it’s required to run ad-hoc queries or analytics queries over the data, in DynamoDB this can turn pretty painful because keys are selected based on the usual access pattern. Hence, we can often end up scanning the table for such queries, which can cost us high bills.

At FinBox, our data science team continuously works on newer credit models and the business development team also requires looking at analytics. To fulfill the needs, we make use of DynamoDB Streams, we stream as soon as the data arrives, to a relational store — RDS or Redshift, and use that as a source for such queries. It makes it more flexible in terms of queries we can run with no capacity of the main table getting consumed.

Lesson 11: Storage can be expensive, use TTL on items whenever possible

Cost Comparision for storage in USD per GB per month in ap-south-1 region

Other than throughput costs, people often ignore the storage costs involved with DynamoDB. We at FinBox learned it the hard way! As you can compare in graph DynamoDB can cost you quite some amount with the storage costs getting combined with the usual throughput costs. Some good strategies to address this issue is to use the TTL (Time To Live) attributes. On automatic data deletion due to TTL expiry, there are no charges incurred, in comparison to the DeleteItem operation, which is charged based on the size of the deleted item. Hence, also, one must go for table deletion instead of individual delete items if you ever have to delete all items from a DynamoDB table.

At FinBox since we stream data to relational databases, even with items expiring due to TTL, we have the data in relational storage, and this helps us serve the requests for older data (with higher latency) later on if requested by clients. Another strategy we follow for a few of our tables is to use streams to capture deletes happening due to TTL expiry and then archiving them on S3 on a lambda invocation.

Backup and Restore

Since we are discussing storage, let’s talk about backup and restore as well. Restore in DynamoDB works only by creating a new table and you cannot restore by overwriting the table from which backup was made. Compared to RDS where you are not charged for backup snapshots up to the size of your instance (or until your instance is terminated then $0.095/GB/month), the DynamoDB charges $0.114/GB/month for on-demand backups, $0.228/GB/month for continuous backups, and $0.171/GB for restores in ap-south-1 region.

In case you are using DynamoDB Streams and streaming the data to RDS, you also have the option of avoiding backups for DynamoDB (since they are costly) and set them up for RDS instead and use them in case of disaster recovery. Also, S3 can be a good choice for backups.

Storing Large Items

Also to point out, DynamoDB has a limit of 400 KB per item. This can be pretty less in some cases. Patterns for storing large items involve using compression or storing a pointer to an S3 object.

Lesson 12: Save costs on local and dev environments

While developing features using DynamoDB, developers would want to test their code. To reduce this cost to some extend, mock platforms like Localstack can be used. Localstack particularly supports DynamoDB and even DynamoDB Streams in the free version!

Also unless doing a load testing, you should always have a lower capacity setup for DynamoDB tables being used solely in development environments.

Lesson 13: Monitor things well

Some of the useful metrics to monitor can be:

Difference between provisioned and consumed throughput: this can help you identify over-provisioning
Throttled reads/writes: to identify under-provisioning or hot partitions
System errors: these are errors thrown by AWS
CloudWatch contributor insights: to identify hot partitions and specific keys for which throttling occurred

Lesson 14: Buy reserved capacity whenever possible

Last but not the least, for predictable capacity consumptions over the year, you can also go for buying reserved capacity by paying partial or full money upfront. This can help you save your costs more.

PS: We recently open-sourced a script that we internally use to generate actionable items to save AWS costs. You can check it out here. Feel free to contribute and add more features to it.