Apache Flink Architecture Overview

Published in

Big Data Processing

2 min readFeb 15, 2020

Apache Flink provides stateful stream processing with robust fault tolerance. It is a stream processing at heart but provides the capability of batch processing as well. Furthermore, it has iterative processing for machine learning and graph analysis support.

Flink abstracts the details of the state management, check-pointing for fault tolerance from the user. It does use a variant of Chandy Lamport for distributed snapshots to store the state of all the operators without pausing the execution.

Apache Flink has a layered software stack like an OSI model. Each layer abstract the details of the bottom layers and make it’s easier for the end-user to program application. Flink run-time layers sit on top of the deployment layer, then we have core APIs followed by specialized abstraction libraries (flinkML, Gelly) as shown below

Streaming dataflow run-time or Flink run-time is a distributed streaming Dataflow. It does receive the graph from the above and it does execute that on the distributed set of nodes while taking care of all the complexities with the help of job manager and task manager nodes.

Flink has two major APIs, DataSet and DataStream. DataSet API is used for batch processing while DataStream API is used for streaming applications. These APIs generate the logical job graph which is then optimized to generate the physical graph by different optimizers based on the API. A physical graph is actually executed on the actual nodes.

Flink does offer a variety of deployment options as well. Cluster is most the obvious one as it is a distributed system with yarn or standalone. It can be run on a local machine for development or prototyping purposes. Additionally cloud vendors e.g. GCE or EC2 has support for the deployment of the Apache Flink as well.

References

Apache Flink Architecture Overview

Written by M Haseeb Asif