The wrong way to scale up Cassandra DB when secondary indexes are in place
Cassandra is my favorite (not managed) database for many reasons: it doesn’t have a single point of failure (SPoF), supports multi-region, good for read and write ops, flexible about read and write consistency levels, scales linearly and not too complex to manage for day-to-day operations.
Like every database, you should use Cassandra based on your data access patterns, so if you need a flexible database for ad-hoc queries or adaptable enough for constantly database model changes, you should consider other options.
Cassandra is a column-oriented DB and it’s really powerful when you have your data queries already defined. Datastax, the company that supports Cassandra, recommends you should start by designing your queries and then your data model in Cassandra. Despite the fact of your columnar structure, Cassandra support many data structures as a column(s) type, such as Maps.
Cassandra basic structure
Cassandra is a primary key database, which means your data is persisted and organized around a cluster based on the hash value (the partition key) of the primary key. For tables that have more than one PK, Cassandra considers only the first part of the PK as it’s partition key. See more about composite keys here.
To be more clear, let’s get back to one of the most important characteristics of a Cassandra DB: it’s architecture and the fact that it does not have a SPoF.
A Cassandra cluster is composed by nodes (3 or more) and those nodes together composes a ring of nodes:
Each node on a Cassandra's cluster works "independently", but different nodes could store the same data accordingly the replication factor (RF) configuration configured for the cluster.
To know where (which node) your data is persisted, Cassandra uses the hash value (token) calculated through a consistent hash function using the PK column of a given table.
When you run a query, the coordinator node (normally the closest one of your application instances) will look for what nodes in a ring have your data, in that way if one node is down for some reason, another node could serve your data (RF ≥2). That's the magic about a masterless approach, where every node in a ring is equal in terms of read and writes.
This concept about PK and the replication factor is very important to understand about how to scale your Cassandra cluster when your application is under high load conditions.
Cassandra has also the concept of secondary indexes. In relational databases, you could have many indexes in a given table, the cost of have a secondary index is associated with write operations, not for read operations. In Cassandra this is not true.
Secondary indexes in Cassandra could be useful and tempting when your data model changed and you need to query based on a new column.
In that way, with a secondary index, you could run a query like that:
SELECT * FROM my_table WHERE SECONDARY_INDEX = ‘value’;
The problem about using a Secondary Index
Imagine the scenario: you are in a Blackfriday/CyberMonday and your Cassandra cluster is suffering from peak events and you need to add more nodes to scale your database, balancing better your traffic and… surviving. Fine, right?
Normally, it is a normal situation in a highly scalable application. But what about if your application is running queries using a secondary index?
Yeah, you got the point.
Remember when I said that Cassandra distributes data in a ring using the partition key? This is already happening, but the problem is when you introduce a secondary index in your query. Secondary indexes are NOT A PART of a partition key, and Cassandra knows about where your data are living through the partition key. When you run a query that uses this kind of index, what Cassandra does is looking for each node in your ring trying to satisfy your query.
During a Blackfriday, our applications were with high loads. Many and many customers wanting to benefit from the huge discounts provided by a Blackfriday event.
We took a look at our APM and all the analysis led us to our persistence, in this case a Cassandra DB. We got long periods of latency, but not for every request, just for some.
Trying to keep things back to the normal state again, our first maneuver was to add more nodes to our Cassandra cluster.
We added and we're still suffering from latency issues. The question was: why is this still happening?
We were wrong. It was a simplistic conclusion and we didn't take care for one very important detail: this behaviour was happening not in all requests, but in some of them.
If you thought about the secondary index, bingo! That was exactly the problem.
Adding nodes would never solve the problem, because the problem was not related to all the queries arriving in the database, the problem was in some and those were the real ones degrading the database performance. It was totally a Pareto thing.
Detailing the issue and how we mitigate it
At one moment before the Blackfriday event, we needed to change our data model. We regionalised our application and the customer's region started to be an important thing for us, we needed to query data based on a product OR region.
Looking back and connecting the dots, we could realize that we were very precious about the implementation as we wanted to reflect this new behaviour not only in the API Layer (new query param), but also in the way that we accessed the data in Cassandra.
And why were we so precious? Because even considering that our query time didn't increase that much, we did the change.
That implementation not only increased our query time by using a secondary index, but also generated more issues according we scaled up Cassandra’s infrastructure. As we added more nodes in our cluster, meant more nodes to look up to find the data, thus the issue was increasing exponentially.
To mitigate the issue, what we did was take back the number of nodes that we had previously and increased the replication factor for the majority of our nodes in the cluster.
We also changed our read consistency level to be less consistent. We were using *QUORUM and instead we changed to ONE. This helped us to lower the load in nodes.
As we froze our applications days before the event, we knew that we didn't have new data (write operations) and the data would be consistent in their current state.
The days after and the DB model solution
As a part of the final solution, we needed to (re)think about our database model and roll back the changes that we did as a mitigation path during the event.
Before the event we were using the product ID (PID)as the partition key, which was a good decision, since the PID has good attributes to be a PK due to its nature about being a sequential number (high cardinality), and in that way spreading the data evenly around the cluster.
About the new field “region”, we leverage Cassandra collections data type and used a Map for each region as a column in our product table.
Secondary Indexes are always a bad idea?
The short answer is no.
Explaining a bit better, there are two kinds of indexes in Cassandra: local and global indexes.
A local index as the name says is a kind of index that exists only locally, that means in a node. When you create a secondary index, Cassandra creates a new (hidden) table where the secondary becomes a primary key in this table. The visibility of this new table is in terms of a node, not a ring (cluster). That's the case of secondary indexes.
On the other hand, a global index has ring visibility through it's partition key, so Cassandra knows where you data is in a ring through that partition key.
Secondary indexes could be an alternative, when you have in a query both: primary and secondary indexes. In that case, Cassandra knows where your data resides (which node) through the partition key and then looks up the local table in the node that refers to the (local) secondary indexes.
There are also some other nuances about secondary indexes that are very well explained here, but the best practice is to avoid them by denormalizing your data model.