This is post #4 of the series aimed at exploring DynamoDB in detail. If you haven’t read the earlier posts, you can find them here.
This post will aim to present some efficient design patterns, along with some key considerations involved in designing your read/write operations on DynamoDB tables. If you are not interested in reading through the entire blog and want to jump to the summary straight away, click here
Guidelines for efficient and cost-effective read/write:
While a good data model is a key to the performance of DynamoDB, it is not the only key factor. Design of read/write operations also plays a major role in ensuring that your services get the best performance out of DynamoDB. In this section, we’ll take a look at some of the key factors that affect the performance/cost of read/write operations.
Write consistency is not configurable in DynamoDB but reads are. You have two consistency options for reads.
- Eventual consistency
- Strong consistency
Key differences between these two consistency levels are listed in the below table:
If your service can function satisfactorily without the need to have a strong consistency level, it is better to go with eventually consistent reads, for cost and performance reasons.
Please note, if you’re using Global tables, you will have to be content with eventual consistency globally. So, there is little benefit in being strongly consistent on individual replicas. Use eventual consistency everywhere.
As mentioned in the “Partitioning” section, in my earlier post, different partition keys can go into different partitions and each partition has its own share of provisioned throughput. Consider the example of a hypothetical “Landmarks” table shown below.
Below is an example partition view of this table.
Assume, you had provisioned 6 WCU for the table and post partitioning, each partition has 1 WCU provisioned. Now, if your access pattern is to update the “Landmark” attribute of every hotel id, you might do this in a couple of ways.
Approach 1: Iterate over each hotel id and update the “Description” attribute of each landmark
A typical write path, in this case, might look like the one shown below.
As can be seen above, the approach to updating all the items of one partition key first and then move on to the next one might not be the most efficient. Because Hotel_ID’s 1 and 2 are in Partition-1, the maximum writes possible on that partition is limited to 1 WCU (based on what you provisioned). So, even though you still have 5 WCU’s unused, you cannot get more than 1 WCU throughput. This is inefficient.
Approach-2: Update the attributes of different hotel id’s in parallel
A typical write path, in this case, might look like the one shown below.
As can be seen from the above figure, with this approach, because you are writing to different partitions concurrently, you can fully utilize the write capacity of the table and achieve a maximum of 6 WCU.
By merely changing your approach to writing you could increase your DynamoDB throughput by several times (6 times in this case), without making any changes to your data model or increasing the provisioned throughput.
The same logic applies for reads as well.
Because DynamoDB is a managed service, you do not have any visibility on which partition key goes into which partition. So, the best approach to writing parallel requests is to randomize your partition keys as much as possible, so you increase the probability of writing to different partitions.**
** DynamoDB adaptive capacity can “loan” IO provisioning across partitions, but this can take several minutes to kick in. Additionally, it wouldn’t work at all for the type of update in the above example because you would be trying to write more than provisioned, successively to different partitions. Adaptive capacity cannot really handle this as it is looking for consistent throttling against a single partition.
** DynamoDB on-demand provisioning would allow all the writes to execute in the example with no throttling, but this is much more expensive, and at high IO rates will still encounter the same problem as the IO per partition is limited to 1000 WCU or 3000 RCU even in on-demand provisioning mode.
Document store or attribute store?
With DynamoDB, there are costs to reading and writing data. These costs are packaged into RCU (Read Capacity Units) and WCU (Write Capacity Units).
1 RCU = 4KB read
1 WCU = 1KB write
So, if you have a wide column table with a number of attributes per item, it pays to retrieve only attributes that are required. The same applies to writes. With DynamoDB, you have the option to update individual attributes of an item. So, it is more cost-efficient to not update items altogether but rather update only the required attributes.
Example Scenario: Assume your service has a JSON document (shown below) that contains customer information and you want to save this in DynamoDB for future reference.
"streetAddress": "21 2nd Street",
"city": "New York",
"number": "212 555-1234"
"number": "646 555-4567"
You can store this in DynamoDB in a couple of ways:
- You can either store this entire document as an attribute
Considering this table structure, if you want to retrieve only the first name of a given customer, you have to retrieve the entire document and parse it, in order to get the first name. This means a possibly higher latency due to the additional overhead of parsing plus the additional time spent over the wire.
- Alternatively, you can store each parameter within the JSON document as a separate attribute in DynamoDB
Store each parameter within the Json document as a separate attribute in DynamoDB
In this case, if you want to retrieve only the first name of the customer, you can retrieve the single attribute “First_Name”. This saves you the parsing time and less time will be spent on the network because the payload is smaller.
Note: Irrespective of whether you request for the entire item (or) just a single attribute in an item, the cost of a read operation will be the same due to the way Dynamo reads work internally. Thanks to Nagarjuna Yendluri for pointing this out in his comment.
DynamoDB gives you the option to run multiple reads/writes, as a single batch. There are a few fundamental concepts to keep in mind while using DynamoDB batches.
- Unlike some other NoSQL datastores, DynamoDB batches are not atomic (i.e.) if one of the reads/writes in a batch fails, it does not fail the entire batch, rather the client receives info on the failed operations, so it can retry the operations.
- DynamoDB client (driver/CLI) does not group the batches into a single command and send it over to DynamoDB. Instead, the Client sends each request separately over to DynamoDB. So, even if you group 100 reads into a single batch at the client, DynamoDB receives 100 individual read requests
Considering the above facts, if you’re wondering why use batching at all, there are a couple of reasons as to why:
- Reads/writes in DynamoDB batches are sent and processed in parallel. This is certainly faster than individual requests sent sequentially and also saves the developer the overhead of managing thread pools and multi-threaded execution
- While reads and writes in batch operations are similar to individual reads and write, they are not exactly the same. In order to improve performance with large-scale operations, batch reads/writes do not behave exactly in the same way as individual reads/writes would. For example, you cannot specify conditions on an individual put and delete requests with BatchWriteItem and BatchWriteItem does not return deleted items in the response.
If your use case involves a need to run multiple read/write operations to DynamoDB, batching might be a more performant option, than individual read/write requests.
Note: There is a limit of 16MB payload and 25 write requests (or) 100 read requests per batch.
Avoid full table scans:
DynamoDB offers two commands that can retrieve multiple items per request.
To run a Scan request against a DynamoDB table, you do not need to specify any criteria (not even the Partition Key). Scan requests navigate the entire table item by item until it reaches a limit of 1 MB during which time a response is sent to the client with the scanned items, a pointer to the LastEvaluatedKey and information of how many items are left to be scanned in the table.
Query operations are slightly different. To run a Query request against a table, you need to at least specify the Partition Key. Query requests attempt to retrieve all the items belonging to a single Partition Key, up to a limit of 1MB beyond which you need to use the “LastEvaluated” key to paginate the results.
For the key differences between Query and Scan operations, refer to the below table.
Because you do not need to specify any key criteria to retrieve items, Scan requests can be an easy option to start getting the items in the table. But, this comes at a cost. As you’re not specifying the Partition Key, Scan requests will have to navigate through all the items in all the partitions. This can be a very expensive way to find what you’re looking for.
Query requests would be a better option to retrieve multiple items, if they all belong to the same Partition Key because not only are queries run against a single partition, you can also specify the sort key to narrow down the results even further. So, Query requests are expected to be much faster than Scan requests.
Avoid Scan operations. Use Query operations instead (along with indexes*, if required), wherever possible.
If for any reason you need to get all the items in the table, you can use Scan requests but please note that this causes extreme stress on your provisioned throughput. So, look at some of the mitigative measures like rate limiting, parallel scans, reducing the page size, etc.,
* Please refer to my other blog post DynamoDB: Efficient Indexes, to learn more about indexes.
- If you do not need strongly consistent reads, always go with eventually consistent reads. Eventually, consistent reads are faster and cost less than strongly consistent reads
- You can increase your DynamoDB throughput by several times, by parallelizing reads/writes over multiple partitions
- Use DynamoDB as an attribute store rather than as a document store. This will not only reduce the read/write costs but also improve the performance of your operations considerably
- Use batching, wherever you can, to parallelize requests to DynamoDB. Batching offers an optimized, parallel request execution without burdening the developer with the overhead of managing thread pools
- Avoid full table scans. Consider using Query operations along with indexes as an alternative, wherever possible
I hope this blog gave you a reasonable insight into designing faster read/write operations on DynamoDB tables.
The next and final article — part 5 of this series will focus on the key considerations involved in creating and using DynamoDB indexes.