Why: We had a lot of very useful data in our Warehouse and wanted to take advantage of those data in some of our production service to enhance the user’s experience. So we choose to server them from Cassandra for all it’s pros which I’m am not going to get into in this blog.
First stage we went about writing a spark-cassandra exporter. It’s pretty simple and only a couple of line,
This works and took around ~ 30 mins to write ~150 Million rows. But once our services went live we saw the read latencies going a bit high during the bulk insertion time.
The spark-cassandra-connector that we are using here had a few configs that can be used to tune the writes here. Tried a bunch of tuning along the line of reducing concurrent and reducing throughput_mb_per_sec. They helped a bit but still there’s a clear increase in read latency.
Cassandra has sstableloader and we thought of testing it for this case. And so changed the code to use and saw that there’s barely any notable read latency during this task (only a slight increase in the 99 percentile, caused by the IO waits).
Also if you see the networks graph, the traffic is only on “network in” as now we are generating SSTables in spark and then pushing those tables directly to cassandra. The last spike in below network graph is from SSTable method and the rest are from batched writes.
Now let’s get into how to do that in code,
- Using CQLSSTableWriter build the SSTables per partition
- We need to define the create and insert statements, but it’s easy to build that from the spark dataframe
- And stream SSTable to Cassandra script. We pick a random Cassandra server and stream the SSTable to it. Host is chosen at random for a better load balancing of network traffic.
- And finally the code that run’s it all,
- As the no. of partitions Cassandra’s suggestion is several tens of megabytes large to minimize the cost of compacting, we use max of 256 MB per SSTable. “sizeInMB” can be calculated from HDFS.
- Let say the size is 60GB, we will have 256 SSTables of size 256MB each.
- Set this config “mapreduce.output.bulkoutputformat.streamthrottlembits” to throttle traffic to Cassandra.
- SSTables has to be at-least several tens of megabytes in size to minimize the cost of compacting the partitions on the server side.
- This methods increase IO wait since it’s writing directly to Disk and not memory like in Cassandra writes. Depending on the size of data and throughput, you need a SSD with high IOPS.
We’ve been using this method in production for over 6 months now, writing around ~ 300 million rows in < 30 mins without any issue to the read latencies.
Full example code can be found here, https://github.com/therako/sparkles/blob/master/src/main/scala/util/cassandra/SSTableExporter.scala