DynamoDB: Data Modeling

Published in

Expedia Group Technology

15 min readDec 13, 2018

This is post #3 of the series aimed at exploring DynamoDB in detail. If you haven’t read the earlier posts, you can find the links below.

This post will aim to present some efficient data modeling techniques and some key considerations to be noted while creating and accessing DynamoDB tables. If you are not interested in reading through the entire blog and want to jump to the summary straight away, click here

A few Q&A to start off with. In DynamoDB terms,

What is Data Modeling? It is the process by which users can configure how their data needs to be organized within the table.
Why do you need Data Modeling? The time taken to store and retrieve data to/from DynamoDB is dependent on how the data is organized. Data modeling helps you organize the data so storage and retrieval operations can be done as fast as possible.

A good Data Model is key to the performance, and consequently the scalability, of your DynamoDB table.

Understanding the need for a good data model:

Partitioning:

For the purposes of illustration, let’s consider a table named “Landmarks” with the following data structure.

The table lists the hotels in key cities in the UK, along with key landmarks around the hotels. Assume, there are 60000 items in the table and every 10000 records constitute a data volume of 10 GB. Total size of the table = 60 GB.

When you create a DynamoDB table, it is mandatory that you choose an attribute as the Partition Key of the table. This allows DynamoDB to split the entire table data into smaller partitions, based on the Partition Key. So, the table shown above will be split into partitions like shown below, if Hotel_ID is chosen as the table’s Partition Key.

DynamoDB: Partitions on the Sample Table

Now, assume you have a read operation like the one below.

aws dynamodb get-item --table-name Landmarks --key file://key.json

The arguments for get-item are stored in a JSON file named “key.json”. Here are the contents of that file:

{ "Hotel_ID": {"N": 1},"Description": {"S": "National Gallery"}  }

The read path for this operation will look like the one shown below.

As can be seen above, DynamoDB routes the request to the exact partition that contains Hotel_ID 1 (Partition-1, in this case).

How does it do this? As mentioned above, DynamoDB splits table data into partitions based on the Partition Key. This happens in several steps:

Step 1: Each unique Partition Key (Hotel_ID here) is passed through a hashing algorithm
Step 2: The entire table data is split into several “hash ranges”, based on Data size and Provisioned read/writes
Step 3: Each “hash range” contains data corresponding to one or more Partition Keys

When there is a read request, the following happens:

Step 1: Client sends a read request to DynamoDB
Step 2: DynamoDB hashes the Partition Key specified in the request
Step 3: DynamoDB looks up the Partition that is responsible for this hash
Step 4: DynamoDB routes the request to that specific Partition
Step 5: Data is returned back to the Client

This is the ideal scenario and every request should target hitting no more than one Partition.

The ideal scenario is to have every request hitting no more than one Partition.

So, it does look like our choice of Partition Key is a good one. But, let’s consider another case.

Assume, “London” is the favorite destination for the users of your service. Say, requests for hotels in London contribute to approx. 60% of your total traffic and the rest of the destinations have more or less even load. Below is how your Partition usage might look like.

This is called “Hotspotting”*. Even though you have data distributed evenly across Partitions, the workload on the table focusses most of the traffic on only one Partition while the other Partitions are relatively unused. This is not desirable.

Why? DynamoDB model mandates that you provision throughput at the table level. Say for example, when you create your table, based on the expected workload you provision 6000 RCU and 2000 WCU for the table**. Now, if your table contains 2 Partitions, then this throughput will be split evenly among the two Partitions, as shown below.

DynamoDB: Partitioning (*Credit:* *DynamoDB Documentation*)

** This illustration and the rest of the sections in this article are with respect to the provisioned capacity mode only and do not consider the new on-demand capacity mode. Please refer to “on-demand capacity mode” section for key considerations related to this mode

Assuming a total workload is 6000 4KB reads/sec, if the workload is evenly distributed, each partition will serve 3000 reads/sec without any throttling. But, if the workload is not evenly distributed (hotspotting), then one partition can get more requests than the other.

Say, for example, one partition (P1) gets 3500 reads/sec and the other (P2) gets 2500 reads/sec. Requests to P2 will be served without any issues as they are well under the provisioned throughput allocated to that partition. But, requests to P1 exceed the throughput allocated and hence will be throttled.

Note: DynamoDB has “adaptive capacity” enabled by default which can reassign unused provisioned throughput in P2 to P1 but that will not save the day in this case because of 2 reasons:

Even though your table has not used up 100% of the provisioned throughput, because each partition is limited to a maximum of 3000 RCUs, adaptive capacity cannot increase the provisioned throughput of P1 any further
Even in cases where this hard limit of 3000 RCUs is not reached, adaptive capacity can take anywhere between 5–30 minutes to kick in. So, there is going to be throttling for a good period of time

The choice of Partition Key can lead to “hotspotting” and throttling of requests

This is exactly why a proper Data Model is key to the performance and scalability of your services.

* Refer to avoid hotspotting for possible solutions to the hotspotting problem

Guidelines for designing a good Data Model:

Start with the access patterns

One of the fundamental differences between a relational database model and a NoSQL model is that, while the key goal of relational model is to offer the ability to make any query that can dynamically bring out the relationship between entities, the goal of NoSQL model is to serve a set of your most important predefined queries, as fast as possible.

So, the fundamental requirement to a good Data Model (with DynamoDB) is to identify the queries that are most important to your service (i.e.) your service’s “access pattern”

Let’s consider the example of the same “Landmarks” table specified earlier in this post and we’ll look at how the table’s Data Model should change based on your service’s access pattern.

Assume, your access pattern is to get all the attributes of an item, given a hotel id and a landmark’s description. For this access pattern, you can design your data model in either of the two ways, shown below.

DynamoDB: Access Patterns and Data Models

You can have “Hotel_ID” as your Hash Key and “Description” as your Range Key or vice versa.

So, which one would you choose?

Never choose an attribute with low cardinality. Example: If you have only 100 hotels and 10,000 unique landmarks (specified by the “Description” attribute), it is better not to choose “Hotel_Id” as your Hash/Partition Key. “Description” would be a better Hash/Partition Key
Look at any potential expansion to your access patterns. Example: If you foresee that you will have a new access pattern, where you will be retrieving a list of all the landmarks around a given hotel, then it is better to choose “Hotel_ID” as your Partition Key. Refer to the below diagram to see why.

DynamoDB: Table with “Hotel_ID” as Partition Key

DynamoDB: Table with “Description” as Partition Key

Never choose an attribute with low cardinality.

As can be seen above, when you partition by “Hotel_ID”, a query to retrieve all landmarks around a given Hotel_ID has to go only to one Partition. On the contrary, if the table is partitioned by “Description”, then the same query has to go to multiple partitions in order to get the same results. The scatter-gather effect observed with the latter is detrimental to the performance of queries and hence will not be recommended, for the given access pattern.

Avoid Hotspotting:

One of the ways to avoid “hotspotting”, is to not choose a low cardinality attribute as your table’s Partition Key. Despite that, due to the nature of your workload, if you still face “hotspotting” in your dataset then “Write Sharding” might be a good option for you.
Let’s continue with the same access pattern, as mentioned above — get all the attributes of an item, given a hotel id and a landmark’s description. But this time, assume hotels in London receive 60% of the total traffic. This leads to “hotspotting”.

In this case, “Write Sharding” might be a good option, if:

you are reading only one item at a time and/or
you have a write-heavy workload

You can add a random number to the partition key values to distribute the items among partitions (example: 1_12345, 1_23456, etc., for the same Hotel_ID 1), or you can use a number that is calculated based on something that you are querying on (example: 1_nationalgallery, 1_buckinghampalace, etc.,).

Add a salt value to the partition key values to distribute the items among partitions

Below is the data model and the corresponding partition view of the “Landmarks” table with “Write Sharding”.

DynamoDB: Partition View and Read Path on a Write Sharded Table

Please note, while this might be a good option if you are reading/writing one item per request, this might not be a most efficient option if you need to retrieve all the items for a given partition key. As you can see above, given that data belonging to a single partition key is distributed across multiple partitions, read requests have to be executed against different partitions and the results merged together in the end, to send a compiled response to the client. This is not very efficient.

Expire/Archive Old Data:

The cost of a DynamoDB table can be split into 3 major components:

Cost of Reads
Cost of Writes
Cost of Storage

As your table grows in size over time, the storage costs keep increasing. In addition, as the volume of data in the table grows, the table is split into smaller partitions, with each partition getting a portion of the table’s provisioned RCU/WCU. So, if you have a table that stores, say time-series data for example and reads/writes are done on only the most recent day/month/quarter/year, then it is advisable to delete/archive older (unused) data, for the following reasons.

You don’t want to pay the storage costs of data that you are not using
You don’t want a large portion of your provisioned throughput going unused. This can cost you a lot of money, for nothing!

The below diagram illustrates this effect on a sample table containing time-series data.

DynamoDB: Partition View of a Table with Timeseries data

As can be seen above, if you have large volumes of outdated/unused data in your table, you could be paying several times the cost of your actual usage.

If you have large volumes of outdated/unused data in your table, you could be paying several times the cost of your actual usage.

So, when you’re designing your data model, if you know you’re going to have data that will not be actively used after a period of time, it is recommended to do one or more of the following:

Use TTLs to expire items after a pre-defined period of time
Plan regular archival tasks (to say, S3) for unused data
Plan regular delete tasks to clear up unused data altogether

Global Tables:

By default, all DynamoDB tables are local to the region in which it is hosted. But, if you have a use case where multiple systems in different regions are writing to the same Table, then Global tables might be a good option for you. Two things to keep in mind when creating Global tables:

You need to make a table “Global”, during its creation itself. You cannot change a normal table into a global table at a later time (There is feature request open with Amazon to enable this though. Most likely to be available sometime in 2019)
Enable streams on the table. DynamoDB relies on streams to synchronize the replicas. So, this is absolutely mandatory

As far as the reads and writes are concerned, there is no difference in the performance of reads and writes to a Global table as against a standard table. Services in a given region write to the Global table replica in that particular region only and this is eventually asynchronously replicated to the other replicas in the other regions. Within a given region, the Global table read and write performance should not be different from the performance of a standard table. But for multi-region requests, Global tables would be much faster because they read from/write to replicas in the local region.

Global table read and write performance should not be different from the performance of a standard table.

A comparison of the performance of a standard table, as against a Global table in a multi-region setup is illustrated in the below figure.

As can be seen above, for multi-region requests, Global tables can make reads/writes much faster as compared to Standard tables.

Please note, Global tables are more expensive than standard tables. The additional cost comes from the following factors.

While read costs are same in Global/Standard tables, writes are charged at a higher rate with Global tables. This is because writes to Global tables are not priced in terms of WCUs (Write Capacity Unit) but rather in terms of rWCUs (replica Write Capacity Units). As an example, in us-west-2 region (Oregon), 1 WCU=$0.00065 per WCU, 1 rWCU=$0.000975 per hour. Refer https://aws.amazon.com/dynamodb/pricing/ for more on this
Global tables have more replicas than a standard table and every write operation needs to be propagated to all the replicas. In addition, if there are concurrent writes on different replicas, additional writes are required for conflict resolution. Amazon recommendation states:

The provisioned replicated write capacity units (rWCUs) on every replica table should be set to the total number of rWCUs needed for application writes across all regions multiplied by two.

If you refer to the above figure(s) for example, for a single write into a Standard table, you would require a provisioned capacity of 1 WCU. In the case of Global tables, the number of rWCUs required for a single write = 1 rWCU (for write) + 2 rWCU (for replication) = 3 rWCUs. Accommodating for conflict resolution, you would require 3 * 2 = 6 rWCUs for a single write. This can quickly add up and make global tables several times more expensive than a standard table

Global tables use DynamoDB streams to propagate updates to replicas. Considering that the replicas are in different regions, streaming data across regions is going to incur additional data transfer cost. Refer https://aws.amazon.com/dynamodb/pricing/ for more on this
Obviously, the more the number of replicas, the more the storage cost. So, Global tables can increase your storage cost by a factor equal to the number of replicas.

On-demand Capacity Mode:

Amazon has recently (Nov 2018) announced a new “On-demand Capacity Mode” for unpredictable workloads. As per the announcement:

Amazon DynamoDB on-demand is a flexible new capacity mode for DynamoDB capable of serving thousands of requests per second without capacity planning.

It promises to help serve thousands of requests/second without any capacity planning. How is this possible and how different is it from the current “Provisioned Capacity mode”?

To illustrate the differences between these two modes, consider the below example of an e-commerce website.

Assumptions:

The website uses DynamoDB to store information on its product information and availability
Each customer request translates to a request to DynamoDB
On a day-to-day basis, the website (and correspondingly DynamoDB) receives 1000 4KB (eventually consistent) read requests/sec and 100 1KB write requests/sec throughout the day
Using the Provisioned capacity mode, DynamoDB is provisioned with 1000 RCUs and 100 WCUs

On Black Friday, assume the website suddenly receives an unexpectedly large volume of traffic, say 2000 RCUs and 500 WCUs. In this case:

Requests to DynamoDB will be throttled because the provisioned capacity is not sufficient enough to serve the unexpectedly large volume of load.
Burst capacity, if any can limit the damage if it is a short spike but in cases like this, the increase in load tends to be more persistent and can last for hours.
You can further reduce the downtime by configuring an autoscaling policy but autoscaling can take up to 10 minutes to respond every time there is a trigger. This can be far too slow for many workloads, particularly if they are constantly changing

This kind of scenario can put the website availability in some serious trouble.

In cases such as above, On-demand capacity mode can come in handy. With this mode:

There is no need to provision throughput ahead of time
You are instead billed on a pay-per-request mode.

So, in the Black Friday example above, DynamoDB will automatically increase throughput provisioned as the workload increases and you would not expect any downtimes.

That’s fantastic, isn’t it?!

As per Amazon’s documentation, below are the key differences between this new mode and the current Provisioned capacity mode.

DynamoDB: On-demand vs Provisioned Capacity Mode

As you can see above, the recommendation is to use On-demand mode only for unpredictable workloads. If On-demand is so good at dynamically handling workload increases, you might ask, why use Provisioned capacity mode at all?

The answer to that question is Cost!

DynamoDB: Cost of On-demand vs Provisioned Capacity Mode [Credit: Jack Shirazi]

The above table illustrates the differences in cost between On-demand and Provisioned capacity modes for a sample write workload. As can be seen above, the cost of your writes in the On-demand mode can increase to nearly ~7 times (3240/468 ~= 6.9) your cost in the Provisioned capacity mode.

The cost of your writes in the On-demand mode can increase to nearly ~7 times your cost in the Provisioned capacity mode

So, while On-demand capacity remains an extremely useful option for unpredictable workloads in the short term, it might not be the best mode of operation for all workloads.

Poorly designed data models can still result in throttling even in On-demand capacity mode.

Please note, while On-demand capacity mode rapidly accommodates for changing workloads, it still does not guarantee zero throttling. Poorly designed data models can still result in throttling even in On-demand capacity mode. Below is one example that illustrates this issue.

DynamoDB: Throttling in On-demand Capacity Mode

Currently, you cannot use both modes on a table - it is either Provisioned or On-demand Capacity Mode. However you can switch between the two - so, in the example above you would use Provisioned capacity normally, but on expected spikes like Black Friday, you could switch to On-demand mode and then back again after the bursty period has ended.

To Summarize:

Before designing your data model, identify and understand the access patterns/use cases the table has to serve
For a given access pattern, the data model should ideally ensure that any read/write operation will hit no more than one partition
Even though DynamoDB splits the table data evenly among its Partitions, depending on the context, the choice of Partition Key could lead to “hotspotting” and throttling of requests
Avoid “hotspotting” by choosing partition keys with good cardinality and adding salt values to the partition key, if needed
Archive, delete or expire old/unused data. If you have large volumes of outdated/unused data in your table, you could be paying several times the cost of your actual usage
If you are planning to access your table from different geographic regions, consider using Global tables. Within a given region, the Global table read and write performance should not be different from the performance of a standard table. But for multi-region requests, Global tables would be much faster
Global tables are more expensive than Standard tables due to the additional writes involved in replication, write pricing in terms of rWCU, inter-region data transfer costs, and additional storage requirements
While On-demand Capacity Mode is a great addition to the existing features, it might not be the best mode of operation for all workloads. Throttling is still possible in this mode.
There is no replacement for proper capacity planning and good data modeling

I hope this article gave you a reasonable insight into designing data models for DynamoDB tables.

The next article — part 4 of this series will focus on the guidelines to be followed for faster reads/writes in DynamoDB.