What is Scylla?
Scylla is an Apache Cassandra alternative NoSQL database that provides low latency and high throughput at a fraction of the cost of other competitors. They reimplemented Apache Cassandra which was written in Java, using C++ to increase performance and utilize multi-core servers.
Why do we want to move away from Cassandra?
GumGum has used Apache Cassandra for the past 6 years. We stored 3 different types of data in there: contextual metadata, behavioral data, and ad performance data. However, starting 2 years ago, we started an initiative to migrate data out of Cassandra and into a managed service database. The reason for this was threefold. First, we wanted a managed service that would be easy to scale. As the requests we sent to our Cassandra clusters increased, we experienced issues with scaling and needed to engage our DevOps team to manually add nodes to the cluster. Also since we had no Cassandra experts in house, it was difficult to manage the cluster and debug issues. We noticed that as the requests to our Cassandra clusters increased and problems related to load and capacity began to arise, we were spending valuable engineering time managing the clusters and combating the consequences from the clusters’ issues rather than focusing on product features work. Second, Apache Cassandra has not been officially supported by Datastax for years. Third, we were running version 2.1.15 and could not upgrade to 3.0 due to data type changes which would have required more complex changes in our application. Also, if we were to upgrade to 3.0, we would have lost free access to OpsCenter which is a tool provided by Datastax to monitor Cassandra clusters.
Due to all these reasons, we began the process of moving away from Cassandra. We migrated all of our contextual metadata and behavioral data to AWS’s DynamoDB. Migrating the contextual metadata and behavioral data from Cassandra to DynamoDB required some small changes but was doable. However, we were not able to move our ad performance data to DynamoDB because DynamoDB does not support counter data types which is what our ad performance data is entirely composed of. Therefore, we continued to search for Cassandra alternatives.
We considered a few different options to migrate our ad performance data including Aerospike, Scylla, Datastax Enterprise (managed version of Datastax’s proprietary Cassandra), and Redis. Aerospike, Datastax, and Redis were all viable options but way too pricey. Scylla seemed like an obvious winner as it was a good combination of price and ease of migration since it is considered as a drop in alternative to Cassandra.
The migration process was pretty clear. First, we worked with Scylla to bring up nodes in every AWS region we operate in based on an estimated load and data size. Then we took a snapshot of the data from Cassandra and loaded it on to the Scylla clusters. We made changes to our application to write to both Cassandra and Scylla but only read/use data from Cassandra. Then we started reading from Scylla to see if the cluster could handle the load and also compared the data retrieved from Scylla to the data from Cassandra to validate the data in Scylla. When we validated the data in Scylla, we finally started using the data from Scylla within our application. Once the Scylla clusters looked stable, we were ready to sever all ties with Cassandra. We removed Cassandra related code from our application as well as outside teams’ workload. Finally, we shut down our Cassandra cluster.
Difficulties / Challenges
We experienced a few challenges during the migration process. First, the POC didn’t cover the full load we would be sending to Scylla so when we directed all the read requests to Scylla, the clusters got overloaded. Also, we saw occasional latency spikes in the cluster that led to latency spikes in our application. We were notified of hot partitions that returned over 100 rows which was leading to specific nodes becoming overloaded. In addition, we observed random surges in read requests sent to Scylla from our application which would destabilize the entire cluster since the number of requests would increase by 5 times or more.
Overcoming Difficulties / Challenges
In order to combat the issue with the load being more than what we originally expected, we simply had to add more nodes to the clusters. We were able to decrease the occasional latency spikes by implementing Scylla’s workload prioritization feature for heavy load queries. This feature allows users to specify queries that can be deprioritized when the cluster is overloaded. To help nodes from being overloaded due to hot partitions, we actually modified the schemas for two tables that were responsible for returning over a hundred rows for certain keys. The cause of the surge in requests is still being investigated and the root cause is currently unknown.
As for our next steps with Scylla, we want to optimize the instance sizes we are using for our nodes. We are currently running the same instance type and size across all regions but we believe this can be changed to run less instances and prevent over provisioning in every region. Also, we are continuing our investigation on the cause of the surge in requests that our application sends to Scylla and how we can prevent this from happening.
We’re always looking for new talent! View jobs.