DynamoDB: Why migrate to DynamoDB from Cassandra?

Published in

Expedia Group Technology

9 min readNov 26, 2018

This is post #2 of the series aimed at exploring DynamoDB in more detail. If you haven’t read the first post, you can find it here.

DynamoDB: Basics

In this post, we’ll look at some of the key differences between Cassandra and DynamoDB and why Cassandra users might want to migrate to DynamoDB. If you are not interested in reading through the entire blog and want to jump to the summary straight away, click here

Replica Configuration and Placement

NoSQL datastores like Cassandra and DynamoDB use multiple Replicas (copies of data) to ensure high availability and durability. The number of replicas and their placement has a direct impact on how resilient your datastore is. With Cassandra, the number of replicas to have per ring and their placement is completely configurable (eg.) a ring can have 3 replicas, 6 replicas or any number of replicas as necessary and they can all be placed in a single AZ, two AZs or any other possible configuration. This model puts the resiliency of the datastore firmly in the hands of the individual configuring the datastore. So, if you do not have much experience configuring NoSQL datastores, this can be a challenge.

In contrast, with DynamoDB, Amazon makes these decisions for you. The number of replicas in DynamoDB is equal to 3 and the replicas are placed one per AZ, by default and this is not configurable. This way, irrespective of the individual capabilities of the user, DynamoDB offers a certain standard of resiliency.

Cassandra vs DynamoDB: Replica Placement

Consistency

Cassandra offers several levels of consistency like ANY, ONE, QUORUM and ALL, for both reads and writes. Again, when to choose what is completely up to the user. In contrast, DynamoDB simplifies this to two configurable consistency levels for reads — eventual consistency and strong consistency. For writes, the consistency level is not configurable at all.

Writes are always strongly consistent — every write is synchronously written to two replicas (2 AZs) and asynchronously written to one more. This ensures that any data are written is durable for the life of the datastore. In terms of reads, if consistency is absolutely important (example: travel wallet balance) to the use case, users can choose strong consistency — where DynamoDB will ensure that the data read is certainly the latest. If not (example: list of landmarks in an area), eventual consistency is a better option because its cheaper and faster.

Such simple options relieves a lot of burden from the user.

The below diagram shows the difference between the various consistency levels**, for a Cassandra ring with 3 nodes and a replication factor=3, as against a standard DynamoDB table*.

DynamoDB: Cassandra vs DynamoDB Read Consistency Levels

DynamoDB: Cassandra vs DynamoDB Write Consistency Levels

* The diagrams above depict the different consistency levels for a single datacenter/region setup only. Multi-region setup and the corresponding consistency levels are out of the scope of this article
* DynamoDB internals are not visible to the public. So, the internal querying flow is only an illustration intended to explain the observed behavior

Partitions

Say, for example, you are creating a Cassandra ring to hold 10 GB of social media data. You choose to create a 3 node ring, with a replication factor of 3. Assume, this is how the data is structured and data is partitioned by UID (Partition Key)

DynamoDB: Sample Table

In this case, because the replication factor=3, each replica will hold 10 GB of data. What happens when the data volume grows over time? Assume, the data grows to 100GB in 6 months time. The following questions might arise:

Do the nodes have 100 GB data storage space?
Is there an impact on the performance due to growth in data volume?
Do I need to add another node to the ring?

Compare and contrast with DynamoDB, the table will be split into partitions* automatically based on the volume of data. In DynamoDB, no single partition can hold more than 10 GB of data. If the data volume grows to 100 GB, as in the above example, DynamoDB will split the partition repeatedly and the table will end up with at least 10 partitions. As a user, you need not worry about storage space or performance impact, purely due to growth in data volume.

Note: If you provision, for example, 1000 RCU and 1000 WCU for your original partition, the 10 partitions that you end up with will share the same RCU/WCU (not necessarily equally) unless you increase this. This might not be a problem if you only have data growth but if data growth is accompanied by usage growth, then you might want to increase the RCU/WCU provisioned on the table.

* The term “partition” means different things in Cassandra and DynamoDB. In Cassandra, it refers to the bunch of data belonging to any single partition key range. In DynamoDB, this refers to either a data volume of 10 GB or an RCU of 3000 or a WCU 1000.

Batching

Batch operations are atomic in Cassandra (i.e.) if 100 writes are grouped into a single batch and one of the writes fail, then the entire batch is marked as failed — all or nothing. With DynamoDB, there is no such guarantee. If 100 writes are grouped into a single batch and one of the writes fail, the batch operation returns information about the one request that failed, so that you can diagnose the problem and retry the operation on that item.

The main goal of batch operations in DynamoDB is performance. This performance gain is achieved because:

By grouping multiple operations into a single batch, applications can save on the network round trip times to DynamoDB on each of those individual requests
DynamoDB performs the individual read or writes in parallel. Applications can benefit from this parallelism without having to manage concurrency or threading.

Administration

DynamoDB is a hosted service. So, the majority of the administrative work is taken care of, by Amazon. In Amazon’s own words:

DynamoDB lets you offload the administrative burdens of operating and scaling a distributed database, so that you don’t have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.

In contrast, if you want to run Cassandra in AWS, below are some of the questions you might need to answer:

Which EC2 instance type should I choose?
Will I need to configure my database to use ephemeral volumes or should I use EBS?
Does the network link between the nodes have sufficient capacity to serve my workload?
If you are using a multi-region ring, should I be using a VPN?
How should my replicas be placed, in order to ensure maximum resiliency?
Which compaction strategy should I use?
Will tombstones be a problem?

In addition, every time there is a need to scale the cluster up, all the hardware provisioning, setup, and configuration work needs to be repeated which makes scaling up a slow and expensive process.

Capacity Planning

Assume, you have a new service that is expected to read from the data store at 12000 reads/sec, with each read ~ 4 KB.

Do you know how many Cassandra nodes you would require to satisfy this workload?

There is no easy way to do this. You have to bring up a ring with a certain number of nodes (say 3) for a start, run tests on the ring to identify how much more capacity you need to serve this kind of workload.

What about DynamoDB?

The answer is 4 partitions*. You need 4 partitions to run a workload that reads at 12000 reads/sec. How do I know it? Because, DynamoDB offers a simpler capacity planning model, where every partition can handle no more than 3000 RCU or 1000 WCU.

This kind of model helps get capacity planning done in a fraction of the time, as compared to Cassandra.

*We’re assuming the reads are strongly consistent.

After reading through the key differences mentioned above, if you are still wondering whether to choose DynamoDB or Cassandra, here’s a quick summary.

Why DynamoDB?

If you do not want to be bogged down by hardware provisioning, setup and configuration and just want to get high performance, scalable, resilient data store up and running with little/no effort, then DynamoDB is for you. How?

Amazon takes care of resiliency by having a multi-AZ setup by default
You do not have to worry about the durability of your data. All writes are synchronously written to multiple AZs and asynchronously replicated to one more. You can even have a multi-region setup, with little/no effort.
You no longer have to worry about performance degradation, as the data volume increases. Amazon takes care of this, by dividing data automatically into partitions and providing you the option to assign capacity at the partition level.
If you see an increase/decrease in workload on your data store, you no longer have to worry about bringing up new nodes, installing software and getting it into the cluster. You can scale up and down with the click of a few buttons
If you are migrating from any other datastore to Dynamo, you do not have to spend too much effort on understanding the corresponding DynamoDB capacity requirements. Amazon offers a simple capacity planning model, based on per partition RCU/WCU/Data and Storage limits, which makes capacity planning as easy as a few simple arithmetic calculations
DynamoDB is closely integrated with other AWS services including Lambda/SQS, which makes easier if you are looking to move to a serverless architecture.
In addition, DynamoDB provides you the option to use DAX (DynamoDB Accelerator). If you have a read-heavy workload, considerable cost savings could be achieved using DAX

Having mentioned all the above, it is important to understand that there are certain scenarios where DynamoDB might not be the best option for you.

Why not?

Cassandra latency can be sub-millisecond if you correctly model your data and tune the requests and system. For DynamoDB you are more likely in the 5-10ms range except if your usage pattern is appropriate for DAX
DynamoDB TTLs are at the item level. So, if you have a use case where you are expiring only certain attributes of an item after a predefined period of time, then Dynamo might not be the best option for you
With DynamoDB, the maximum allowed item size is 400 KB. If you are using large blobs that are expected to exceed this limit, you might be better off looking at other alternatives. Cassandra, for example, has a maximum single column value size limit of 2 GB (<1 MB recommended)
Batching in DynamoDB is not atomic. If you need your batch operations to be atomic, then DynamoDB might not be the best choice for you
Considering that DynamoDB is a hosted service, a lot of the internals are hidden away from Customers. While this is good in some cases, because it relieves you of administration and lets you focus on storing and retrieving data, this might not always be the case. Example: DynamoDB access logs — as it stands, DynamoDB access logs are not easily available to Customers. Under normal conditions, this might be fine. But, if you observe “hotspotting” on your table and are trying to troubleshoot, then this becomes critical. At the moment, you have to log a support ticket to Amazon to get a copy of these logs or you might have to handle this at your service level, by writing every request to DynamoDB into a log file
With DynamoDB, every read and write is rounded off to 4 KB and 1 KB respectively. So, if the payload for your reads and writes are only a few bytes, you might be paying for so much more than what you are using
In addition, you are paying for the storage as well. So, if you have an ever-growing volume of data and are not archiving old data, you can expect to have a lot of unused partitions, which you would have to pay for. This needs to be considered while deciding if DynamoDB is for you

I hope this article gave you a glimpse of DynamoDB’s capabilities and hopefully, helps you make an informed decision on whether DynamoDB is for you.

Please do note that a lot of the good things that you get out of DynamoDB to depend on how good your Data Model is. So, Part 3 of this series will focus on the do’s and dont’s of DynamoDB Data Modeling.