Monitoring Cassandra

Michał Łowicki
4 min readDec 6, 2015

Besides many tremendous advantages Cassandra introduces there are some drawbacks to keep it mind. Two I would like to focus on here are operational cost and stability.

I’ll write about stability of 2.1.x series (which is used by my team on production) where already couple of memory leaks have been released (CASSANDRA-9549 or CASSANDRA-9681) and lately my favourite CASSANDRA-7953 which causes unlimited grow of tombstones which for us means very active nights and heroic actions.

Quality of releases is a known issue and it’ll be addressed. For more detail read the whole “Cassandra 2.2, 3.0, and beyond” message sent to
user@cassandra.apache.org by Jonathan Ellis:

After 3.0, we’ll take this even further: we will release 3.x versions monthly. Even releases will include both bugfixes and new features; odd releases will be bugfix-only. You may have heard this referred to as “tick-tock” releases, after Intel’s policy of changing process and architecture independently.

The primary goal is to improve release quality. Our current major “dot zero” releases require another five or six months to make them stable enough for production. This is directly related to how we pile features in for 9 to 12 months and release all at once. The interactions between the new features are complex and not always obvious. 2.1 was no exception, despite DataStax hiring a full time test engineering team specifically for Apache Cassandra.

It sounds promising and I’m glad that talented people behind Cassandra are trying to promote quality if this complex system to higher league.

Operational cost appears in many places like observing health of the system while running repair, making sure compactors are keeping up with influx of data so number of SSTables is under control or tune garbage collector to avoid long GC pauses.

To make debugging issues easier, making sure system is behaving well or introduce some configuration experiments you need to have solid monitoring integrated. I’ll describe how we handle it for Opera sync. Our team is self-contained so we‘re responsible for everything starting from implementing application logic (mainly in Python), setting up database, to maintaining our playground and production deploys (including guard duty). Our primary monitoring tools are StatsD, Graphite and Grafana. How do we feed StatsD with data related to Cassandra?

Metrics from application are sent either using pure Python client or one of dedicated for particular framework like Django. In order to get metrics from other sources Diamond is an excellent tool. It has support for pluggable collectors with built-in set covering most of basic cases — you can monitor CPU, disk usage or network-related stuff with a simple configuration.

Cassandra exposes metrics via JMX and Diamond already has built-in integration for it using Jolokia collector:

enabled = True
host = localhost
port = 8778
path = cassandra.jmx
mbeans = '''
org.apache.cassandra.metrics:type=Compaction,name=BytesCompacted |
org.apache.cassandra.metrics:type=Compaction,name=CompletedTasks |
org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
'''

Adding Jolokia agent to Cassandra is done automatically through cassandra-env.sh:

[ -e "$CASSANDRA_HOME/lib/jolokia-jvm-1.2.3-agent.jar" ] &&
JVM_OPTS="$JVM_OPTS -javaagent:$CASSANDRA_HOME/lib/jolokia-jvm-1.2.3-agent.jar"

So just make sure JAR file is there (/usr/share/cassandra/lib/jolokia-jvm-1.2.3-agent.jar in our case). After maximum couple of minutes new metrics should be visible in Graphite:

If you want to get list of all available metrics then mx4j is good place to start. It’s a HTML inferface to JMX. Go to cassandra-env.sh in order to enable it. At the bottom of the file you need to uncomment two variables and put mx4j-tools.jar into appropriate location on box where mx4j should be used.

# To use mx4j, an HTML interface for JMX, add mx4j-tools.jar to the # lib/ directory.
# See http://wiki.apache.org/cassandra/Operations#Monitoring_with_MX4J
# By default mx4j listens on 0.0.0.0:8081. Uncomment the following
# lines to control its listen address and port.
MX4J_ADDRESS=”-Dmx4jaddress=127.0.0.1"
MX4J_PORT=”-Dmx4jport=8081"

Web interface exposed by mx4j will show you really long list of various metrics.

StastD + Graphite is an excellent duet to gather metrics but not the best one to visualise it. Hopefully there is Grafana which is used heavily by various teams at Opera. We’ve dedicated dashboards for Cassandra which besides things like disks, CPU, network contain everything we found useful via JMX.

We don’t retrieve everything which is exposed via JMX but we’re monitoring lots of things like compaction, caches, Bloom filters, thread pools, number of processed requests etc.

Cassandra is definitely tool you should consider why looking for database. If C* fits your uses case remember to carefully monitor it from day one. It’ll save you lots of time if problems occur as you could see easily trend or exact moment when issue started which makes finding culprit much more efficient.

Solutions mentioned above aren’t the only ones we’re using to check what is going on with Cassandra. In later posts I’ll write more about monitoring garbage collector and gather warnings and errors from system.log using Logstash and Kibana.

--

--

Michał Łowicki

Software engineer at Datadog, previously at Facebook and Opera, never satisfied.