My Cassandra 2.0 Diagnostics Checklist (Brain Dump)
This has gotten pretty dated now and the state of the art is Cassandra 2.1 and the G1GC JVM with JDK 1.8. I’ll update this checklist at a later date, but for now it’s useful for those of you still on 2.0. It’s really a mixed bag on 2.1 as all of the JVM settings advice is suspect in 2.1.
This isn’t remotely complete, but I had a colleague ask me to do a brain dump of my process and this is by and large usually it. I’m sure this will leave more questions than answers for many of you, and I’d like to follow up this post at some point with detail of the why and the how of a lot of it so it can be more useful to beginners. Today this is in a very raw form.
Important metrics to collect
- schema of all tables
- nodetool status
- ulimit -a as user that cassandra is running as (make sure it matches our settings)
Import facts to collect
- heap usage under load, is it hitting 3/4 of MAX_HEAP ?
- Pending compactions (opscenter, will have to look up JMX metric later)
- size of writes/reads
- max partition size (histograms will say)
- tps per node
- list of queries run against the cluster
Bad things to look for in logs
- GCInspector (see parnews over 200ms and CMS )
- Out Of Memory
Typically screwed up cassandra-env.sh settings
- heap not set to 8gb
- parnew no more than 800mb (unless using tunings from https://issues.apache.org/jira/browse/CASSANDRA-8150)
Typically screwed up cassandra.yaml settings
- row cache being enabled (unless 95% read with even width rows)
- vnodes set with solr
- system_auth keyspace still with RF 1 and SimpleStrategy
- flush_writers set to crazy high level (varies by disk configuration, follow documentation advice, double digits is suspect)
- rcp_address: 0.0.0.0 (slows down certain versions of the driver)
- multithreaded_compaction: true (almost always wrong)
Typically bad things in histograms
- double hump
- long long tail
- partitions with cell count over 100k
- partitions with size over in_memory_compaction_limit (default 64mb)
Typically bad things in tpstats
- dropped (anything) especially mutations.
- blocked flush writers (if all time is in the 100s it’s usually a problem)
Typically screwed up keyspace settings
- STC compaction in use on SSD when customers have a low read SLA (or no defined one).
- Using RF less than 3 per DC
- Using RF more than 3 per DC
- Using SimpleStrategy with multiDC
Typically screwed up CF settings
- read_repair_chance and dclocal_read_repair_chance adding up to more than 0.1 is usually a bad tradeoff.
- Secondary indexes in use (on writes think write amplification, and on reads think synchronous full cluster scan).
- Is system_auth replicated correctly? And has it has repair run after this was changed? If you see auth errors in the log..the answer is probably no
Typically bad things in nodetool status
- use of racks is not even (4 in one rack and 2 in another..that’s a no no)
- use of racks is not enough to fulfil muliple of RF (if you have 2 racks of 2 and RF 3..how will that get evenly laid out?).
- if load is wildly off. may not mean anything, but go look on disk, if the cassandra data files are really imbalanced badly figure out why.
Performance tuning Cassandra for write heavy workloads
- Run cstress as a baseline
- Run cstress with appications write size this will often identify bottlenecks
- Do math on writes and desired TPS. Are the writes saturating the network? Don’t forget bits and bytes are different ☺
- Lower ParNew for lower peak 99th, now this flies in the face of what is happening with this Jira (https://issues.apache.org/jira/browse/CASSANDRA-8150), but until I’ve worked through all of that ParNew lower than 800MB is generally a good way to tradeoff throughput for smaller ParNew GCs.
- Run Fio with the following profile https://gist.github.com/tobert/10685735 (adjusted to match users system). Using these as baseline (http://tobert.org/disk-latency-graphs/).
Performance Tuning Code for write heavy workloads
Once you’ve established the system is awesome. Review queries and code.
- Things to look for BATCH keyword for bulk loading (fine for consistency, but has to take into account SLA hit if the writes are larger than BATCH can handle).
- If using batches what is the write size?
- Are you using a thrift driver and destroying one or two nodes because of bad load balancing?
- Are you using the DataStax driver, is it using the token aware policy (shuffle on with 2.0.8 ideally)?
- If using the DataStax driver, is it the latest 2.0.x Lots of useful fixes in each release, it really matters.
- Using LWT? They involve 4 round trips and so ..while they’re awesome they’re slower.