Flink Checkpointing

M Haseeb Asif
Big Data Processing
2 min readMay 12, 2020

--

State management comes out of the box for Flink and it is considered as the first-class citizen. While Flink abstracts the traditional state complexities for application developers, it needs to do a lot more to provide stateful fault-tolerant applications. It needs to checkpoint the state frequently and restore it in case of failures.

Application checkpointing is a common technique in computer science to make applications fault-tolerant. In this approach, we make a copy of the application state, called snapshot, at a regular interval, and store it. When the application fails, we restart our application using the last saved snapshot of the application. It helps streaming applications to continue processing where they left off instead of starting all the calculations from the beginning.

Checkpointing in a distributed system is more complex because of the dynamic and unpredictable nature of the network. Distributed Snapshots, snapshots for a distributed system, at any point in time will contain the state of all the processes (vertices) and their network connections (edges).

Flink is a distributed stream processing engine, hence it uses a distributed snapshot algorithm for checkpointing. It does leverage a variant of the famous Chandy Lamport Algorithm. Different distributed nodes in the Flink cluster process the data independently from each other and the checkpoint will need to re-align the nodes before taking a snapshot.

Flink requires a replayable data source in addition to the state backend for the…

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data