Cassandra — keep calm and tune options!

Nowadays it is so easy to build your own database. You download a few hundred megs of software, put them on a disk, unpack, install, tweak an option or two in the default config with the first search hit for “ASeriousDB can’t connect why oh why” and voila! It’s working. Kind of. Until you actually start pushing real data into it. More than a few requests per second. And try to read them all later. Then it occurs that you occasionally get nasty timeouts or 500-ish responses to your kind requests. You start debugging, and… you enter a totally new world of no easy answers to google out.

Jakub Lida
Akamai Krakow Blog
4 min readNov 13, 2018

--

Let’s focus on Apache Cassandra, a highly tunable distributed NoSQL database with a low entry level CQL (Cassandra Query Language) resembling SQL. The last property, a SQL-like query language, comes with a big caveat as it may be the cause of many failures in real life usage scenarios. I’d say: “if you want to use a hammer, understand the hammer and think like one”, but it doesn’t mean you should treat anything like if it was a nail, just use it for nailish objects exclusively. If Cassandra is a hammer, a CQL query is a nail, and an SQL query is a screw. You need a screwdriver for screws, forget the hammer, or you won’t be able to pull it out nicely afterwards.

As Cassandra is a NoSQL database without foreign keys and relations, you have to forget about normalization. The fact narrows Cassandra usage to specific data sets as you are forced to duplicate the data in many tables where it’s needed; it’s the only good way to do it. You need to be careful though if you later rely on their consistency. And there we arrive at the crucial question: do you need strong consistency? What price are you able to pay for it?

Cassandra leaves you plenty of choices. You may decide to keep strong consistency at the price of request latency as it takes more time to ensure your data gets replicated to all the places it should. You may choose to ignore the restriction and hope for consistency, in case you have decent network connections, or store statistical data, in which single data points have little or even no importance. You may try to find the sweet spot and tune it so you require data to get replicated in two or most (not necessarily all) of your nodes (quorum). Last but not least, when using lightweight transactions, you may go for linearizable consistency called SERIAL here. Whatever you choose, you may periodically compare and repair your data to fix all discrepancies.

And on the “seventh day” you rest and see it finally starts to do its job well. Then with increasing traffic, your CPUs burst into flames, or you get out-of-memory exceptions here and there.

For CPU overheat, you may be interested in limiting cores involved in writing data to disk, compacting (repacking and organizing) it, tuning compression, and last but not least, actually checking what your naughty users push into and later request for. Be allergic to asterisks, if you see them in a SELECT query, it always means problems, sooner or later. Usually sooner.

For memory problems, you have squadrons of options at your disposal. Start with understanding the defaults, which is not that obvious. Hidden in the startup script is the equation:

It basically means “calculate 1/2 RAM and cap to 1024MB, calculate 1/4 RAM and cap to 8192MB, then pick the max of the above”. You may override it as well as off-heap memory limits, memtable, buffer pool and internode buffer sizes, and flush thresholds, etc. When getting deeper into them, you sometimes may feel like a babe in the woods: the deeper you go, the darker it gets. Get a torch (documentation is helpful in most cases) and don’t hesitate to try it all on your own. To be able to actually compare the results, be sure you collect all the possible metrics from Cassandra, mainly read/write latency and a current number of SSTables (just to name a few).

The golden rule is to keep calm and tune the options one by one. Remember it is completely normal to shift the load from one element of the hardware or infrastructure to another and never find the sweet spot. If they told you there is one, they lied. It is not a spot, it is a blurry nebula of spots moving faster than you can follow, at first. But with a good understanding of what’s going on, after eliminating single machine bottlenecks, you may finally go and ask your manager along with the financial department to buy you more machines. And that may be the ultimate challenge for you. After all, there is a boss at the end of each level.

--

--