Moving to Amazon DynamoDB from hosted Cassandra: A leap towards 60% cost saving per year
Achieving better efficiency and cost savings!
GumGum moved its largest no-sql workload to Amazon DynamoDB from Apache Cassandra. In this blog I will be discussing the architecture, design decisions made in the process with their justifications and the steps taken in order to complete this switch.
Why GumGum needs a NoSQL Data Store?
The Digital Advertising facet of GumGum employs proprietary AI and Computer Vision techniques to contextually target images and pages to deliver ads. User profiling (users are tagged through browser cookies) is also used to target ads for specific users for displaying relevant online ads. To perform such targeting, we need to store the metadata of images, pages and users in large data stores that can handle extremely high traffic with very low latencies spanning multiple data centers over geographical boundaries. Most NoSql databases boasts these abilities and GumGum hosts its targeting data over a 105 node Apache Cassandra (v2.1.15) cluster running on i3.2xlarge EC2 instances across 4 data centers (US East, US West (CA), Ireland and Japan). Speaking modestly, I believe that GumGum is a pretty advanced user of Cassandra and we have tuned it to provide us with an excellent performance.
Now let’s see why this migration project was conceived. Cassandra has been adored by GumGum since 2011, but when the company achieved a moderate traffic scale, the database started becoming unstable and presented us with occasional mood swings (here I refer to long running heavy compactions, infrastructure failures, buggy software patching etc.) leading to outages. In digital advertising outages are an expensive affair which costs considerable revenue loss as well as are exhausting for engineers. Our scale has been growing every year and supporting the business with such a database became very difficult. As outages became frequent from occasional, we realized that a new data store is needed which would require less engineering specialty to maintain but at same time very easy to scale while providing the same performance benefits of the existing solution.
Making the Choice
In today’s market choosing a NoSql data-store product presents a dilemma because of the plethora of available variety.
One needs to weigh the pros and cons at a deeper level for a given use case to understand the suitability of a product. We compared several of the products available which claims to provide very high throughput based on our requirements to arrive at our choice. The most important determining factors eventually boiled down to operational cost and management.
The existing solution using Cassandra used the open sourced version. The closest solution was using the enterprise counterpart, which though promising was expensive and not fully managed. This meant that we would have to pay for the software license and host it ourselves (additional infrastructure cost) and continue maintaining it (engineering hours). In all the databases we were considering, DynamoDB was the only fully server-less solution. Serverless solution abstracts the complexity of node level understanding and makes it really easy for developers to maintain and scale the system. From the above comparison, it was easy for us to choose the product which drawn the most big smiley faces: Amazon DynamoDB 😀.
Using Apache cassandra we achieved an operation scale of ~125,000 Reads per sec and ~40,000 Writes per second while maintaining an average read latency of ~2ms and average write latency of ~4ms. Please note that this latency doesn’t include network time between application and the database cluster. It’s the latency observed from within the Cassandra nodes. The new solution must be able to meet these expectations. Trying out a product would have required us a span of at least 1+ month because of the application data lifecycle constraints. So we decided first to run a benchmarking test through YCSB. I performed multiple iterations of benchmarking with different kinds of operational loads. Below are some of the benchmark results I observed with DynamoDB which helped us decide.
Benchmarking is performed at a medium scale of 20 Million items over random preloaded items. Tests were conducted on AWS US-West-1 data center. The benchmarking was performed in a c4.2xlarge EC2 instance. Data size was ~22GB.
Observed results are shared below:
It is to be noted that the latencies indicated includes the network hop between the EC2 instance running the benchmark to the DynamoDB service. There were multiple iterations for the above loads along with other load distributions. The test data is a fraction of what we target to store but given the WCU and RCU constraints, DynamoDB showed a constant latency of ~2–3ms (from DynamoDB console).
Migrating to DynamoDB: The Switch
Migrating any database technology is not as simple as migrating any other technology of the stack. This is because of the persistence nature of the data held by the existing technology. In order to start using DynamoDB, we needed to populate it with data so as to support our traffic requests without any impact. The data stored in Cassandra has a time to live of 30 days assigned which means data older than 30 days is automatically deleted by the database.
We had two choices to make the switch happen:
1. Create long running ETL jobs to extract the data from the Cassandra table and save those data to DynamoDB table. Once DynamoDB is populated, start reading and using the data for the business as well as write all new incoming data to DynamoDB.
2. Exploit the fact that data older than 30 days are deleted and write to both existing Cassandra database as well as DynamoDB. After writing for 30 days, theoretically both databases should contain the same data because data older than 30 days is no longer valid.
We chose the second approach since it served two major advantages. This approach avoided additional development time for the temporary ETL jobs needed by strategy 1. Secondly, running the ETL jobs on Cassandra would have put additional load on the cluster which might have affected the normal bandwidth required by the business application serving ads. The disadvantage is that we had to wait 30 days to start reading the data from DynamoDB and try it for our business. The trade-off caused by the wait was acceptable for us given the simplicity of strategy 2.
After ~40 days, the table stats looks something like the below:
The TTL keeps the data from growing too large. At this point both Cassandra and DynamoDB were at sync and we could start reading the data from DynamoDB and use it while disabling reads from Cassandra. The diagram given below depicts this transition. This was the big switch completing the transition to DynamoDB smoothly.
X Datacenter Data Replication
One of the critical business requirements was data replication across our US data centers (US-East-1 and US-West-1). Cassandra supports replication by itself and when we started writing data to DynamoDB, global tables feature was not yet introduced. AWS provided the dynamodb-cross-region-library to replicate a master table to a slave table(s). We required master-master replication since we needed to be able to write to either region and read from the other. To meet this requirement, I modified the dynamodb cross region library so that it could perform master-master replication and support our business requirement of cross data center multi master replication. This change can be found here on Github. The resulting architecture achieving master-master replication is depicted in the below diagram.
AWS introduced DynamoDB global-tables at AWS reInvent 2017 which allows DynamoDB tables to be replicated across data centers and we plan to start using this feature soon.
Application latency is a key factor in digital advertising business. Once we completed the move, we observed that there was no observable increase in our application latency. From DynamoDB console, the service delivered ~2–3ms read latency and ~4–6ms write latency with negligible spikes.
Comparing the amount we spent running Cassandra (hardware and maintenance), we observed that DynamoDB reduced the cost considerably. Let’s do some math here:
80 i3.2xlarge instances would cost: 0.624000 x 24 x 365 x 80 = $437299.2 USD
When compared to this, DynamoDB cost as provided by AWS cost explorer is as follows:
Per month = ~450 x 30 = ~13500 USD
Estimated per year cost = 14100 x 12 = $162000 USD.
Both the cost of DynamoDB and Cassandra instances are based on AWS On-Demand usage, though with reservation they will get further reduced.
Now here is the fun part:
Unlike Cassandra instance provisioning, DynamoDB provisioning is not fixed and is through auto scaling. This helps in putting a check on the database resources we use for running our business. The Dynamodb cost figures indicated above includes cost for an entire day with both peak and non-peak hours and can be considered pretty close to actual running cost.
This is amazing because considering only this, the observed cost savings with DynamoDB is ~62–65% over running Cassandra. Additionally, we should not overlook the fact that Cassandra maintenance requires dedicated engineering hours and the outages are also expensive not only financially but also to the organization reputation. Since DynamoDB is a fully managed service, maintenance cost can be considered Zero (and we trust team AWS team to keep it fault tolerant with minimum outages). Factoring in these, we may ball park the savings to be around 65–70% over running Cassandra.
This was a big win for us considering the fact that there will be no additional infrastructure to run and no more maintenance activities required from an engineers end. Given the current state we look forward to migrate all our existing data from Apache Cassandra to Amazon DynamoDB in the recent future.