Flink Checkpointing

Published in

Big Data Processing

2 min readMay 12, 2020

State management comes out of the box for Flink and it is considered as the first-class citizen. While Flink abstracts the traditional state complexities for application developers, it needs to do a lot more to provide stateful fault-tolerant applications. It needs to checkpoint the state frequently and restore it in case of failures.

Application checkpointing is a common technique in computer science to make applications fault-tolerant. In this approach, we make a copy of the application state, called snapshot, at a regular interval, and store it. When the application fails, we restart our application using the last saved snapshot of the application. It helps streaming applications to continue processing where they left off instead of starting all the calculations from the beginning.

Checkpointing in a distributed system is more complex because of the dynamic and unpredictable nature of the network. Distributed Snapshots, snapshots for a distributed system, at any point in time will contain the state of all the processes (vertices) and their network connections (edges).

Flink is a distributed stream processing engine, hence it uses a distributed snapshot algorithm for checkpointing. It does leverage a variant of the famous Chandy Lamport Algorithm. Different distributed nodes in the Flink cluster process the data independently from each other and the checkpoint will need to re-align the nodes before taking a snapshot.

Flink requires a replayable data source in addition to the state backend for the…

Flink Checkpointing

Written by M Haseeb Asif