The wrong way to scale up Cassandra DB when secondary indexes are in place
Cassandra is my favorite (not managed) database for many reasons: it doesn’t have a single point of failure (SPoF), supports multi-region, good for read and write ops, flexible about read and write consistency levels, scales linearly and not too complex to manage for day-to-day operations.
Like every database, you should use Cassandra based on your data access patterns, so if you need a flexible database for ad-hoc queries or adaptable enough for constantly database model changes, you should consider other options.
Cassandra is a column-oriented DB and it’s really powerful when you have your data queries already defined. Datastax, the company that supports Cassandra, recommends you should start by designing your queries and then your data model in Cassandra. Despite the fact of your columnar structure, Cassandra support many data structures as a column(s) type, such as Maps.
Cassandra basic structure
Cassandra is a primary key database, which means your data is persisted and organized around a cluster based on the hash value (the partition key) of the primary key. For tables that have more than one PK, Cassandra consider only the first part of the PK as it’s partition key. See more about composite keys here.
To be more clear, let’s getting back to one of the most important characteristics of a Cassandra DB: it’s architecture and the fact that it does not have a SPoF.
A Cassandra cluster is composed by nodes (3 or more) and those nodes together composes a ring of nodes:
Each node on a Cassandra's cluster works "independently", but different nodes could store the same data accordingly the replication factor (RF) configuration configured for the cluster.
To know where (which node) your data is persisted, Cassandra uses the hash value (token) calculated through a consistent hash function using the PK column of a given table.
When you run a query, the coordinator node (normally the closest one of your application instance) will look for what nodes in a ring have your data, in that way if one node is down for some reason, other node could serve your data (RF ≥2). That's the magic about a masterless approach, where every node in a ring is equal in terms of perform read and writes.
This concept about PK and the replication factor is very important to understand about how scale your Cassandra cluster when your application is under high load conditions.
Cassandra has also the concept of secondary indexes. In relational databases, you could have many indexes in a given table, the cost about have a secondary index is associated with write operations, not for read operations. In Cassandra this is not true.
Secondary indexes in Cassandra could be useful and tempting when your data model changed and you need to query based on a new column.
In that way, with a secondary index, you could run a query like that:
SELECT * FROM my_table WHERE SECONDARY_INDEX = ‘value’;
The problem about using a Secondary Index
Imagine the scenario: you are in a Blackfriday/CyberMonday and your Cassandra cluster is suffering from peak events and you need to add more nodes to scale your database, balancing better your traffic and… surviving. Fine, right?
Normally, it is a normal situation in a highly scalable application. But what about if your application is running queries using a secondary index?
Yeah, you got the point.
Remember when I said that Cassandra distributes data in a ring using the partition key? This is already happening, but the problem is when you introduce a secondary index in your query. Secondary indexes are NOT A PART of a partition key, and Cassandra knows about where your data are living trough the partition key. When you run a query that uses this kind of index, what Cassandra does is looking for each node in your ring trying to satisfy your query.
During a Blackfriday, our applications were with high loads. Many and many customers wanting to benefit from the huge discounts provided by a Blackfriday event.
We took a look at our APM and all the analysis led us to our persistence, in this case a Cassandra DB. We got long periods of latency, but not for every request, just for some.
Trying to keep things back to the normal state again, our first maneuver was to add more nodes to our Cassandra cluster.
We added and we still suffering from latency issues. The question was: why this is still happening?
We were wrong. It was a simplistic conclusion and we didn't take care for one very important detail: this behaviour was happening not in all requests, but in some of them.
If you thought about the secondary index, bingo! That's was exactly the problem.
Adding nodes would never solve the problem, because the problem was not related to all the queries arriving in the database, the problem was in some and those were the real ones degrading the database performance. It was totally a Pareto thing.
Detailing the issue and how we mitigate it
At one moment before the Blackfriday event, we needed to change our data model. We regionalised our application and the customer's region started to be an important thing for us, we needed to query data based on a product and region.
Looking back and connecting the dots, we could realize that we were very precious about the implementation as we wanted to reflect this new behaviour not only in the API Layer (new query param), but also in the way that we accessed the data in Cassandra.
And why we were so precious? Because even considering that our query time didn't increase that much, we did the change.
That’s implementation not only increased our query time by using a secondary index, but also generated more issues according we scaled up Cassandra’s infrastructure. As we added more nodes in our cluster, meant more nodes to lookup to find the data, thus the issue was increasing exponentially.
To mitigate the issue, what we did was take back the number of nodes that we had previously and increased the replication factor for the majority of our nodes in the cluster.
We also changed our read consistency level to be less consistent. We were using *QUORUM and instead we changed to ONE. This helped us to lower the load in nodes.
As we frozen our applications days before the event, we know that we didn't have new data (write operations) and the data would be consistent in their current state.
The days after and the DB model solution
As a part of the final solution, we needed to (re)think about our database model and rolling back the changes that we did as a mitigation path during the event.
Before the event we were using the product ID (PID)as the partition key, what was a good decision, since the PID has good attributes to be a PK due to it’s nature about be a sequential number, and in that way spreading the data evenly around the cluster.
About the new field “region”, we leverage from Cassandra collections data type and used a Map for each region as a column in our product table.
Secondary Indexes are always a bad idea?
The short answer is no.
Explaining a bit better, there are two kinds of indexes in Cassandra: local and global indexes.
A local index as the name says is a kind of index that exists only locally, that means in a node. When you create a secondary index, Cassandra creates a new (hidden) table where the secondary become a primary key in this table. The visibility of this new table is in terms of a node, not a ring (cluster). That's the case of secondary indexes.
On the other hand, a global index has ring visibility through it's partition key, so Cassandra knows where you data are in a ring through that partition key.
Secondary indexes could be an alternative, when you have in a query both: primary and secondary indexes. In that case, Cassandra knows where you data resides (which node) through the partition key and then lookup the local table in the node that refers to the secondary indexes.
There are also some other some nuances about secondary indexes that are very well explained here, but the best practice is to avoid them by denormalizing your data model.