Properly testing your Cassandra cluster before you’re in production
In my early days at DataStax (circa Cassandra 1.2 era) this was rarely a major problem. Most of the customers I had then were deploying Cassandra with plenty of operational head room:
- Cheap hardware, so adding more was cheap.
- Sub terabyte nodes. So lots of nodes as more data is added, losing a node out of 100 is no big deal.
- Simple data models that were key value style lookups. So scaling out was trivial.
This meant if they got the production demand wrong they’d just add more cheap nodes and everyone was happy. Fast forward to today the style seems to be more:
- Extremely expensive enterprise hardware.
- Very dense nodes to justify the cost of the hardware, so adding new nodes takes a long time, and each node represents a larger percentage of your capacity.
- Data models that are sometimes naively lifted 1 for 1 from Oracle data model that wasn’t scaling (so probably not easy to scale out then when dropped into Cassandra).
The recent style is fundamentally more brittle and leads to problems that are systemic and more likely to take down the entire cluster as less nodes, longer times to provision and more density all harm the entire reason you went with Cassandra in the first place. So if you’re going to production with that scenario you better test a lot and test accurately. This blog aims to cover the ways to do that.
Mimic your production workload
Many people testing a Cassandra based application and cluster for the first time are coming from a decade or more of work in the relational database world, so not only is this a new database technology and API to learn, but an entirely new way of writing an application to scale on a distributed database instead of a single machine. Many folks to ease the pain of the transition from single machine to distributed will make heavy use of materialized views, secondary indexes, UDF (user defined functions) or even an additional search engine such DSE Solr. It’s important to realize that in most applications at large scale these techniques are not heavily used, and if they are allowances are made for the performance penalty in node sizing and SLAs.
In summary, this means your testing workload should match your production workload as closely as possible and never change a variable unless you can fully account for the difference it will make. Below I will discuss in detail the tradeoffs and considerations you need to make to validate your use case, cluster tunings and code.
Establish SLA targets
Decide now before you go into production what is fast and what is slow, and how often is slow allowed to occur. If you have no idea a few basic rules are good:
- Max is bad, you have no control over the worst case scenario. Still you may monitor this for extreme outliers and keep an eye on it. Do not make decisions based on this.
- 99.9% is aggressive but if you have a latency sensitive use case this is what is required. Know you will spend a lot on hardware if you need to keep this really close to the 50%.
- 95% is pretty loose but for cases where latency is not easily noticed this is a good metric to measure.
- 99% is my sweet spot and the most common.
Now that you’ve established the request percentiles if you’re still not sure about common expectations out there I can share my opinion and experiences. All the following are on clusters with more data on disk than there is ram to help eliminate buffer cache and is measuring in the 99th percentile via cfhistograms:
- On top of the line high CPU nodes with NVM flash storage: less than 10ms.
- On most sata based SSDs (a few notable exceptions): less than 150ms is common. Testing, controller etc will shift this A LOT and some hardware will be closer to NVM specs.
- On most sata based systems: 500ms is common, but this varies the most. Controller, contention, data model, tuning can shift this an order of magnitude.
Establish TPS Targets
Now that you’ve established your SLA go ahead and come up with how many TPS you expect your cluster to handle. Match this in your test plan with the same node count or at least ratio and see how the cluster behaves.
If your cluster blows the SLAs try and root cause why. Does it hit SLAs below that? What TPS level do your SLAs suffer?
Monitor the important metrics
All of these can be found from a combination of JMX mbeans or iostat/dstat. Opscenter is a tool that does both if you’re interested in DataStax this can be a good way to go.
- Monitor system hints generated. This should not be continually generating a high number. If a node is actually down or a data center link is down. Expect this to accumulate. This is by design.
- Monitor pending compactions. With STCS this should not hang out too long over 100. With LCS it’s much harder to say but if it stays over 100 it’ll likely stay over 1000. DTCS there is less field experience but I’d wager a 1000 would make me nervous.
- Monitor pending writes. Should not be over 10k for any length of time, this indicates a tuning or sizing problem.
- Monitor heap usage. Are you getting huge spikes with long pauses?
- Monitor IO wait. Anything above 5 is a sign of the CPU waiting too long on IO, this is a sign of being IO bound.
- Monitor CPU Usage. If it’s hyper-threading don’t forget that if half your threads are at 100% CPU usage that’s totally busy, even if your tool is reporting 50% usage.
- Monitor your SLAs (95%, 99%, 99.9% in my case). You established these right?
- SSTable count for STCS should not get too far over 100. For LCS monitor level 0 make sure that it does not do the same. If in any case you get over 10k sstables you’re in a for a long day. Look up CASSANDRA-11831 for future help on this issue.
- Look for nasty errors or warnings in the logs. Literally just grepping for the following will help:
GCInspector (recent 2.1–3.0 versions of Cassandra when using the G1GC this is extra chatty)
Simulate real outages
- Take a node or two offline, even more. Do your queries start failing? Do they start working without any change when you bring the required number of nodes back online?
- Take a data center offline while taking load. How long can it be offline before your system.hints build up to unmanageable levels.
- Run an expensive task that eats up CPU on a couple of nodes. This is so your prepared for the effect on your nodes if someone manages this.
- Do a stupidly expensive range queries in CQLSH while taking load and see how the cluster behaves. This is always a possibility and devs will sometimes do naive things such as:
SELECT * FROM FOO LIMIT 100000000;
Simulate real ops
- Run repair while you’re doing your load testing. You left enough overhead to complete that too right?.
- Bootstrap a new node into the cluster while you load test. You’re gonna need to do this in prod eventually.
- Online a new data center while you load test. Someday you’re probably going to have to do this, you should have some idea of what the process looks like.
- Change the schema around a bit, add new tables, remove tables, alter existing. See how the cluster behaves with all this. Generally this is never a problem, but you should test it especially during outage simulations.
- Replace a “failed” node. No how long it’ll take to bootstrap a new node to production when you’ve lost a portion of your cluster.
- Repeat the above with some outages so you KNOW how Cassandra behaves.
Run a real replication factor and at least 5 nodes
Ideally you would actually run the same number of nodes that you expect to run in production, but even if you’re planning on running only 3 nodes in production, test with 5. You may find quickly that the queries that were fast with 3 nodes are much slower with 5 (secondary indexes come to mind as a common query path that is exposed in this scenario as being slower at scale).
Replication factor should be what it is in production ( which should almost always be 3 per data center but that’s another blog post).
Mimic your production data center setup
Each data center adds addition write cost. If you are expecting to have 5 data centers but only test with 2, you’re going to find out the hard way that each write is actually going out to 4 nodes with 2 data centers and RF 3 in each (forwarding coordinator in the remote data sends the data on to the other replicas in it’s data center), but going out to 7 nodes with 5 data centers. Your writes are now almost 2 times as expensive on your coordinator, do you think it’ll be just as fast with the same cost?
Likewise mimic your network topology with ideally the same latency and the same pipe that will be used later on. Test with comparable TPS that you expect. You’ll often find these remote links can be a limiting factor for many of your use cases and expected bandwidth. Don’t be like one customer of mine and be off by a factor of 20 right before you deploy to production.
Mimic your expected per node density
First decide what this is amount is!
I can tell you from experience unless you’re using SSDs, lots of CPU, a fair amount of RAM, pushing above a TB per node is rarely a good idea. Regardless you’ll want to test it and not take my word for it.
Previous experience tells me the top end of density before people get sad is (depending on compaction strategy):
STCS w SSD: 2–5TB
STCS w HDD: 600gb–1TB
LCS w SSD: 1TB
LCS w HDD: 100gb (cause LCS shouldn’t be used with HDD)
DTCS (assuming the data model is the best fit): 3–5TB
Even at those densities though you’re paying a penalty during bootstrap, but as long as you test this and document it you have my blessing.
Now that you’ve decided your target per node density, fill that cluster up until you hit it. What happens to your SLAs as you approach your target density? What happens if you go OVER your target density? How long does bootstrap take? What happens with repair?
Cassandra stress versus home grown testing
This is a nuanced and tricky topic but we will need to start with a simple naive comparison to have a baseline:
- it’s written already
- steep learning curve
- simple data models
- good reporting
Home grown tooling:
- have to write it
- learning curve is as shallow as they come you wrote it
- can make the data model support as complex as you need it
- have to write your own reporting
Ok so now that simplification is out the way:
If you are new to Cassandra, you will probably botch the home grown tooling, so I suggest you learn Cassandra stress and do your best with that (upcoming blog post for that). After you’ve got your cluster well tuned with stress, go ahead and build the appropriate home grown tooling and compare. This can help you identify problems in your code as well.
If you’re very experienced at writing Cassandra applications, (then why are you reading this) you probably do not need stress and it won’t be worth it to master it.
If you’re just looking to smoke test a cluster that will be general purpose for several applications, go ahead and throw Cassandra stress at it with the default data model
Follow all these guidelines and you should be able to avoid an unpleasant surprise in production. Do it well enough and you may get a growth plan that you can live by for the next 5 years. If you’re really good, maybe just maybe, you can be like Crowdstrike and turn a commonly held wisdom upside down.