Photo credit: Pexels

Avoid pitfalls in scaling your Cassandra cluster: lessons and remedies

Introduction

Several teams at Walmart use Cassandra heavily as one of their primary data sources. Many of them rely on the high availability and tunable consistency guarantees provided by Cassandra besides the blazing performance. We have deployed several Cassandra based applications in production with reads and writes staying consistently at 10’s or even 100’s of thousand ops per second.

Reaching that state has not been easy. We had to ‘un-learn’ a lot of the tricks of RDBMS world, create schema that looked ‘weird’ but turned out to be more performant, and consider various strategies to ‘delete’ the data. In this article, we will discuss some of the pitfalls that we observed while scaling our Cassandra clusters and how we avoided them.

Pitfall 1: Get your partitions right

After delivering a couple of successful projects based on Cassandra, one pattern always stood out

Of all the tunings, strategies, tricks that can be applied to Cassandra to gain performance, designing the most optimal schema had the biggest impact in terms of latency, stability, and availability of the cluster.

Uneven partitions is one of the biggest reasons for non-deterministic reads/writes performance and query timeouts. It also causes uneven pressure on compaction cycles as some partitions may be big and may take long time to compact while smaller partitions may compact quickly.

In one of our use-cases we had a need to maintain parent-child relationship. It seemed reasonable to maintain parent key as the partition key and child key as the clustering key so that all children could co-locate with parent. One of the design concern was that number of children varied heavily; some parents had one child, while others could have one million! This resulted in partition hot spotting, lots of dropped mutations, and in general excessive pressure on some nodes of the cluster.

Fig 1 — Uneven partitions in a parent child relation

To solve this issue, we applied the concept of sharding. We sharded the parent and introduced a shard key as part of the partition key. The shard size was kept at a fixed size of 1000 so that the overall partition size could be kept under 10 MB. (Each row had ~ 10 KB of data).

Fig2 — Almost even sized partitions

This resulted in almost equi-sized partitions and load on the cluster got distributed as well. As a bonus, we got data pagination for free during read queries! Each shard nicely mapped to a data page, and the nth page could now be fetched with a simple slice query as follows

SELECT * FROM table WHERE parent_key = 'p1' AND shard_key = n

Pitfall 2: Ensure separation of concenrs

Cassandra has limited support for secondary indexes. In one of our projects we needed to create a lot of secondary indexes and decided to use Solr with Cassandra to offload the index management while maintaining the data locality.

As the data size grew, the cluster started showing stability concerns. We started seeing a lot of timeouts with increased load on the cluster. On digging deeper, we discovered that one of the major culprit was that for every mutation on a Cassandra row, Solr was fetching the entire partition in memory for indexing. Since, each partition carried MB’s of data, this resulted in a lot of data fetch in memory.

Fig 3 — Cassandra + Solr nodes ring — bad performance for wide rows

We segregated Cassandra and Solr clusters, and started writing to Solr through a Kafka topic. While the latencies of pure Cassandra cluster improved significantly (almost zero timeouts), the Cassandra + Solr cluster continued to behave badly. We finally replaced Cassandra + Solr combination with SolrCloud and the performance has been excellent after the switch with throughput improving significantly!

Fig 4 — Separate Cassandra and Solr rings — better performance for wide rows

Pitfall 3: Avoid full reads/writes

Cassandra resolves mutation conflicts by using the “last write wins” strategy. If the same data is written back to a column at a later time, it is considered a mutation even if the data has not changed. Mutations incur a heavy cost on reads as reads have to consult multiple SSTables to ensure that last update is always returned. Mutations are compacted / discarded during the compaction process, however that causes significant load on the cluster if the real content is not changing.

Almost all Java Object Mapper frameworks for Cassandra provide proxy based mechanism for tracking changes (i.e. which properties/columns need to be written back), which should most definitely be used. However, the read semantics are a bit slippery. Typically an entire row is read every time a read query is fired (see code below), which increases the cost of reads. A better approach would be to read selectively wherever possible.

//writes: 
connection.proxyForUpdate(MyBean.class) // the proxy would keep track of which fields were changed
connection.save(myBeanObject) // only changed fields would be written
// reads:
MyBean myBeanObject = connection.find(MyBean.class, myBeanId)
/*
this would typically read all the columns, which is cheaper for row-oriented data bases, but costlier for Cassandra
*/

Pitfall 4: Beware of collections

Cassandra supports Set, Map, and List collections as column types. A general mistake is to read the collection in the application, modify it, and then write it back as shown in the following code

List<Object> c = cassandraDO.getList();
c.delete(obj1);
c.add(obj2);
cassandraDO.setList(c); // BAD: re-writes the entire list.
cassandraDO.save();

This results in re-writing the entire collection again, which causes a lot of tombstones and affects performance. Furthermore, Sets should be preferred to Lists as Sets (and Maps) avoid read-before-write pattern for updates and deletes. An update or delete in a List will potentially read the entire list in memory (server-side) again since there’s no ‘data uniqueness’.

Another issue with collections is that collections are read in entirety and there is no support for selective reads. For example, if a map collection has 10,000 entries and a query wants to read values for 5 keys, all the 10,000 entries will be read by Cassandra. This is unnecessary and redundant.

Pitfall 5: Driver tunings

Driver tuning has a lot of impact on how the queries perform. We found an excellent article by DataStax here that describes a lot of settings that resulted in better overall performance and lesser time outs. Since we were only using local data center, we set the remote host connections to 0. Also, settings shuffleAmongReplica to true gave us better load distribution (the downside being more local read-repairs happening on server side)

Summary

Cassandra is a great system if used properly. Some of the key factors affecting query performance are proper partitioning of data, avoiding hot spotting, avoiding full reads/writes, and tuning the driver properly for the expected query load. We learnt a great deal by changing our RDBMS oriented mindset and we’re happily using Cassandra as one of the main data sources for last couple of years at Walmart.