We switched to ScyllaDB

Published in

Pygmalios Engineering

5 min readJul 16, 2021

Our team is constantly looking for new ways to improve our products. We make sure that our services work efficiently, fast, and reliably. Pygmalios analytics platform collects data from various sensors installed in stores and receives thousands of events every minute. One of the dominant aspects of our platform is a database to process those large amounts of data every day.

To meet those requirements our platform was built on Apache Cassandra NoSQL database to offer scalability and high availability of our system. As we are constantly looking for improvements to the system, we did an analysis to map better alternatives. Based on the analysis, we came across ScyllaDB and this article describes the way how we switched from Cassandra to ScyllaDB.

Cassandra vs. ScyllaDB

Scylla acts as an Apache Cassandra alternative and replacement with fully supported Cassandra migration capabilities. Scylla provides the same CQL interface and queries, the same drivers, even the same on-disk SSTable format, but with a modern architecture designed to eliminate Cassandra performance issues, limitations, and operational barriers.

To summarize the comparison of both NoSQL databases, here are the main differences we encountered in our analysis:

Cassandra is written in Java, Scylla in C++
99 latency percentile should be 5–10x lower compared to Cassandra (because of lockless inter-core communication and share-nothing approach. The advantage of the shared-nothing approach is that each thread has its own memory, CPU, and NIC buffer queues so we don’t need the garbage collector)
10x less nodes required for the same performance (Scylla is 10x more efficient on the same HW)
Scylla requires less resources, does not need to load the application into JVM.
Scylla offers caching feature, Cassandra needs an external solution (Redis)
Consistent performance

The process of the database switching

The analysis showed that ScyllaDB offers promising advantages compared to efficiency and cost, so we decided to switch our development environment to ScyllaDB to thoroughly analyze its performance. We divided the migration process into three main steps until we finally migrated our data into the production environment. First, we prepared the pipeline to smoothly migrate all the table data from Cassandra to the ScyllaDb cluster. Then we migrated all data in the development environment. In the second phase, we evaluated the efficiency of the new database running in the development environment. Finally, we switched the production environment.

Data migration in the development environment

Pygmalios runs Spark with Cassandra. Spark is a batch-processing system, designed to deal with large amounts of data. The Spark worker understands how Cassandra distributes data, and together they create a powerful tool for processing massive amounts of data.

One of the main reasons companies choose to switch from Cassandra to ScyllaDb is the ease of data migration, as ScyllaDB offers the same interface so the same libraries can be used. For that reason, we only needed to create a new ScyllaDB cluster and migrate all of the data. To start migration we carefully prepared Spark batch processes to trigger migration from every table.

When all data were copied to ScyllaDB cluster, we were ready to switch our development environment to the new database. Then we set up monitoring and alerting of Scylla with Scylla Monitoring Stack. This stack is one of the benefits of Scylla and it contains open-source tools including Prometheus and Grafana. At this moment our development was running with Scylla and production was still running with Cassandra.

Comparison of our metrics between ScyllaDB and Cassandra

To ensure the benefits of Scylla for the services, we performed tests to measure if Scylla has a better performance than Cassandra. The comparison was done between the development environment with Scylla and the production environment with Cassandra, where both environments use the same HW resources, same database setup, and the exact copy of data. The results are shown in the table below.

We performed the metrics to measure resource usage where Scylla showed its effectiveness in CPU usage and disk read/write latency. The effectiveness should be caused by C++ implementation, lockless inter-core communication, and the share-nothing approach mentioned in the upper analysis.

Metrics of Read/Write database latency, obtained by Prometheus and Grafana, showed huge improvements with Scylla. Nevertheless, we measured the impact of Scylla on processing time in our services but the processing time of the services did not decrease rapidly. Metrics show processing time of real-time batches, night batch jobs to reprocess days data, the last metric tells about the response time of some API calls.

Latency comparison (Scylla vs. Cassandra)

The following graphs show a comparison of Read/Write latencies during regular day operation of the system and night batches when all daily data are reprocessed so the highest load is reached.

Read latency during the night batch jobs

Write latency during the night batch jobs

Read latency during the day

Write latency during the day

Scylla on production

The tests between development with Scylla and production with Cassandra verified that Scylla offers benefits like a modern development stack, less hardware resources, or less Read/Write latency. Another important argument for us was automatic backups to GCS with Scylla. While Datastax Opscenter supports backups only into AWS or Azure, our system runs in GCS, so we have been exposed to expensive traffic from one cloud provider into another. All these arguments led us to fact that we could switch our production to Scylla as well,

Conclusion

Scylla is more hardware efficient than Cassandra. Pygmalios expects to grow, so it is an investment to lower operating costs as we don’t need to buy so many high-end nodes. However, we still have hardware commitments so we don’t cut the costs today.
Batch job duration is slightly faster in ScyllaDB but it can be caused by dataproc performance. However, read and write latencies area huge gap.
Datastax opscenter backups are supported only into AWS and Azure, which causes expensive traffic from one cloud provider into another, mostly during the whole database restore. We also experienced some issues in Cassandra restore, we were getting a lot of timeouts and we have to improve our service reliability.
Scylla supports a more modern developer stack than Cassandra which better fits into our stack, we are comfortable in it and don’t need to order business support so far as planned.