Saving Apache Kafka Offsets in Apache Cassandra (using PySpark Cassandra)
When processing big data streaming jobs using Apache Kafka and Apache Spark there is great utility in preventing data loss as well as preventing re-processing of copious amounts of data. In a perfect world data should only be processed exactly once in a streaming job and in the event of a failure the job should pick up at exactly the last successful Kafka offset when resumed.
There have been strategies that leverage Zookeeper that do just this, by saving the latest processed Kafka offsets after each successful batch is processed. We took a different approach since we already have a globally distributed Cassandra cluster that we could leverage to do the same. Having the offset data stored in a Cassandra table allows us to easily query and visualize how our Spark jobs are progressing.
At the beginning of each batch we read the offsets for the particular Kafka topic and initialize the direct stream to start where it last left off. We use the PySpark Cassandra library in order to persist and read the offsets to and from Cassandra.
This allows us to do minimal re-processing of data (even in the event of a job / batch failure) as well as keep us from losing data (e.g. restarting the job at the newest Kafka offset vs the one from yesterday that we last successfully processed before our job failed).
At the end of each successful streaming batch we update our Cassandra table with the most recent processed Kafka offset.
On the start of the next batch, we will read in this updated offset and continue processing our data pipeline.
Using a strategy like this, is crucial to obtaining good efficiency and leads to a best effort for exactly once semantics when processing data with Kafka and Spark.