Scaling datapoints in Cassandra collections

Published in

Pygmalios Engineering

3 min readOct 6, 2019

Browsing module is a key part of Pygmalios analytics product. We mount shopping carts and baskets with special tracking devices or use person tracking technology from video cameras and get coordinates and timestamps for customers in the store. We then store each shopping session in our database and use data for advanced analytics.

Tracking device mounted on the shopping cart

This is an example of a shopping session for a single customer. We can reconstruct whole shopping session from the start to the end including all coordinates and timestamps.

Cassandra provides collection types as a way to group and store data together in a column. So we have created Cassandra table with collections of coordinates and timestamps to store customer sessions. All data are partitioned by store, floor and year day. But guess what can go wrong?

Cluster failures

You may guess it right. The cluster failures. Whole Cassandra process on a node is down after some time when we start large batches and need to recompute data for new features. This is bad and we have to investigate what is wrong with our Cassandra cluster under heavy load.

Our table design was wrong. Or at least not so good. A collection is appropriate if the data for collection storage is limited. If the data has unbounded growth potential, like our coordinates list, do not use collections. Instead, use a table with a compound primary key where data is stored in the clustering columns.

Keep collections small to prevent delays during querying. Collections cannot be “sliced”; Cassandra reads a collection in its entirety, impacting performance. Thus, collections should be much smaller than the maximum limits.

Introducing sequences

We had to split up our collections to much smaller lists with fixed small number of items and add sequence number for every record so we can reconstruct this session. Our new schema looks like this:

We are surprised how smoothly things works in batches now. There are no cluster failures. But we can still do better if we don’t need updates for collection items.

Freezing collection types

Cassandra has a special frozen keyword. A frozen value serializes multiple components into a single value. Non-frozen types allow updates to individual fields. Apache Cassandra™ treats the value of a frozen type as a blob. The entire value must be overwritten. Frozen collection can only be replaced as a whole, you cannot for example add/remove elements in a frozen collection. As we don’t need collection updates this could be a good fit for us.

The best benefit is that there are no tombstones so it gives us better read/write performance and we can scale for more clients. This is because frozen collections are stored all together in a single Cassandra cell. No tombstone are necessary for inserts.

Our final table schema looks like this.

I hope our failures helps you when you will be designing a new Cassandra table schema.

Scaling datapoints in Cassandra collections

Cluster failures

Introducing sequences

Freezing collection types

Written by Jan Antala