Reviewing Amazon DynamoDB

Paul Xue
4 min readMar 11, 2015

--

A misnomer on the Dynamo storage system design, the naming is not the only part of this NoSQL solution offered by the company that’s confusing.

One of the most important concepts of using DynamoDB is understanding Capacity Units (CUs). It is the base unit measuring consumption, moreover, it is the unit measurement in which Amazon will use to ultimately calculate how much they will charge for using their service. Amazon advises their potential users to calculate how much it will cost to use the service by pre-calculating their consumption levels, or, in other words, provisioning read/write throughput on tables stored in DynamoDB.

In this sense, Amazon completely missed the mark and failed to improve upon SimpleDB’s complicated pricing model. Instead of machine hours, CUs are calculated for varies operations, reading and writing to the database consume CUs dependant on multiple factors such as how the operation is performed and the amount of data is accessed or written. For simple operations the calculation is fairly straightforward, but as the query or table structure becomes increasingly complex, estimating the throughputs become increasingly taxing and time consuming.

The following paragraph may be a confusing one.

For reading data from a table in the database, a task consumes different amounts of CUs based on combined factors of the individual item size, number of items the task had to access, consistency of the task, and finally the read style of the task. In numerical senses, per 4kb per second read consumes 1 CU for consistent reads, and 0.5 CU for eventually consistent reads. Individual items can be more than 4kb, and will consume additional CU.

Examples might help.

Starting with the simplest task, which is to access a single item based on its hash key and range key combination, would consume ceil(item size / 4kb) cu if it’s a strongly consistent read, or 1/2 of that if it’s a weakly consistent read.

Batch accessing items are just grouping of individual item accesses over a single network call. A batch of 10 item accesses would consume same CUs as individually getting them.

For the above two read styles, individual item sizes are round up to the next 4kb and CUs are calculated based on that. For example, 2kb item size for 100 consistent item access will consume:

ceil(2kb/4kb) * 100 items * 1 consistent read = 100 CUs

Amazon does not disclose how item sizes are specifically calculated. Briefly, it defines the item size as the sum of the length of its attribute names and values. Some basic primitive type and their value sizes are:

  • null/boolean: 1 byte
  • strings/binary: (value length) bytes
  • numeric: 4 byte*
  • list/map: 3 byte overhead + sum(size of values)

* the numeric size is an experimental result.

There are two additional read styles for accessing multiple consecutive items (automatically grouped by their hash key and/or range key), query and scan allow the user to access data in a more complex way. Querying operation treats the entire task as a single read operation, grouping and calculating all items sizes together and then round up to the nearest 4kb. Take the same above 100 items example, the CU consumption calculation for querying the same hash key would be:

ceil[(2kb * 100 items) / 4kb] * 1 consistent read = 50 CUs

Therefore, this is where optimizing your item size becomes handy.

Scan operations essentially traverses the entire table projecting and filtering on user specified expressions. However, the operation will traverse the entire table and consume all of the read throughputs or up to 1MB of item size accessed, which ever comes first, by default. Therefore, if per item size is less than 4kb, a scan will consume up to 1MB/4kb = 256 CUs.

As for write operation, the math is a bit simpler: 1 KB per CU.

Things become interesting when secondary indexes are introduced to the picture. Amazon’s solution for the inflexibility of the key hash and range only querying is to create multiple copies of the same data, indexed by potentially different hash key or range keys. In other words, trading space for speed. On the surface this is fine since storage is cheap, trading storage for data accessing speed is fairly standard trade off these days; unfortunately, creating and populating the secondary index does not come free. For every write to the original table, a secondary index with a full or projected copy of the attributes will also be written. In some cases, a single write could consume twice as much CUs, doubling the initial required write provision and costs.

In the end, DynamoDB feels like a new dog with old tricks. It offers essentially no new breakthroughs and the only foreseeable benefit is it’s flexibility to be easily pluggable with other existing AWS services (EMR, Kinesis, ElasticCache) and the managed nature of the service. The “best practise” section of the documentation is laughably naive; suggested solution to burst data access and “hot hash keys” are just not cost effective and scalable. Sticking a cache in front of the database or distributing single hash key into different partitions by concatenating a bucket suffix is not real engineering solutions, they’re children’s solution to a grown up’s problem.

The inaugural blog post by Amazon’s CTO back in 2012 hinted that DynamoDB was an materialization and extension of the Dynamo paper written back in 2007. Improving upon the lessons learnt from SimpleDB, offering the same managed, blazingly fast, and highly available store that Amazon championed. Yet, in hindsight, DynamoDB offers nothing refreshingly new; it was not the first managed/hosted NoSQL, it was not the first auto-sharding NoSQL store, and it was certainly not the game changer Amazon pretended it was. As much as Amazon wanted to sell the product as something completely fresh, it just did not bring the same excitement into peoples lives the same way Cassandra did when Facebook open sourced it back in 2008. Unfortunately for Amazon, as much as they don’t want people to compare DynamoDB with Cassandra, it happened anyway and the non-existent excitement fizzed out into nothingness.

--

--

Paul Xue

software developer, entrepreneur, day dreamer. @localyyz