Apache Spark Checkpointing

What does it do? How is it different than caching?

Adrian Chang
4 min readMar 16, 2018
Stopping is ok!

Introduction

I’ve never really understood the whole point of checkpointing or caching in Spark applications until I’ve recently had to refactor a very large Spark application which is run around 10 times a day on a multi terabyte dataset. Sure there are tons of blog posts and StackOverflow questions in regards to the subject but I’ve always felt like they’ve covered the techincal details with respect to using either and never give a easy to understand intuitive reason on why and when to use either. The main point to when using either is to understand that Spark maintains a history of all transformations you may apply to a DataFrame or RDD. This means, as seen below when you run explain on either, have the full entire history of transformations you have applied to either.

While this enables Spark to be fault tolerant, Spark programs take a huge performance hit when fault tolerance occurs as the entire set of transformations to a DataFrame or RDD have to be recomputed when fault tolerance occurs…

--

--