Apache Flink Getting Started— Part 1

M Haseeb Asif
Big Data Processing
4 min readApr 12, 2022

It is a three-part series about getting started with Apache Flink. The first part is about building the base on why we need Apache Flink and its use in the data processing ecosystem. The latter two-part will be more hands-on with examples of processing data with batch and stream processing.

With advancements in technology, data is being generated in enormous amounts, and businesses must process this data to understand their customers or products better. Big data has many definitions, such as the data that cannot fit in the memory of the processing machine or the data that cannot be processed with the traditional data processing techniques. Big Data is also described with multiple Vs. Volume, value, variety, velocity, and veracity.

Some of the significant data examples are stock trading data, movies data, daily google searches, IoT connected devices data, and no. of tweets daily. These use cases generate a massive amount of data and cannot be processed using the traditional approaches.

Stream and batch processing

Modern data processing frameworks rely on an infrastructure that scales horizontally using commodity hardware. There are two well-known parallel processing paradigms in this category: batch processing and stream processing. Batch processing refers to performing computations on a fixed amount of data. This means that we already know the boundaries of the data and can view all the data before processing it, e.g., all the sales that happened in a week.

Stream processing is for “infinite” or unbounded data sets processed in real-time. Everyday use cases for stream processing include monitoring user activity, processing gameplay logs, and detecting fraudulent transactions.

Distributed data processing evolution

Big data processing has evolved a lot in recent years. MapReduce represents the first generation of distributed data processing systems. This framework processes parallelizable data and computation on a distributed infrastructure that scales horizontally. MapReduce also abstracts all the system-level complexities of the distributed system from developers and provides fault tolerance, parallelization, and data distribution. It supports batch processing. Some second-generation frameworks of distributed processing systems offered improvements to the MapReduce model. For example, Tez provided interactive programming and batch processing.

Apache Spark is considered a third-generation data processing framework, and it supports batch processing and stream processing. It offers simpler Directed Acyclic Graphs. Spark leverages micro batching for streaming which provides near real-time processing. Micro batching divides the unbounded stream of events into small chunks (batches) and triggers the computations. For example, spark enhanced the performance of MapReduce by 100x because all the processing happens in memory instead of writing intermediate steps’ results back to the disk.

Apache Flink

Apache Flink is a fourth-generation data processing framework, and it is one of the top Apache projects. Although Apache Flink supports both batch and stream processing, it is natively designed for stream processing. “Flink” means Fast in german as it is a project coming out of Technical University Berlin, my master’s University.

Apache Flink’s features are massively scalable distributed dataflow, multiple languages support, and offer support for Machine learning, graph processing, and iterative processing. Also, it is storage and cluster agnostic. Flink has its own memory management instead of relying on the Java virtual machine’s memory management. Flink is high throughput and low latency with exactly-once processing.

Macrometa has a nice comparison of Apache Flink vs. Apache Spark.

Flink has a layered architecture with a storage layer at the base. Then we have a separate layer for the processing layer. It is the beauty of the Flink that both the compute and storage are independent of each other. On the other hand, the Flink job execution model consists of job managers and task managers (similar to master and slave).

Flink supports multiple languages, but we will use Java for our examples. So make sure you have Java 8 and Maven installed on the machine. You can create a new link project via template using the maven archetypes as follows.

mvn archetype:generate                               \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.14.4

It allows you to customize the project settings as it will interactively ask you for the groupId, artifactId, and package name.
Once the project is created, you can open that project with the IDE. I use Intellij, but you can use any IDEs you prefer. First, you will find a couple of templated files with basic examples.

Finally, Flink offers three different deployment modes — local, Cluster, and Cloud. Local mode is used for development on the local machine, and it is straightforward to get started. On the other hand, if you want to do the development, you can use the packages in the code, and it will automatically spin up a mini-cluster for your application when you run the program. Cluster mode can also be used on your local machine, and then you submit your jobs for the deployment. Finally, cloud offerings have everything managed, and you will submit your job to the cluster running in the cloud.

Next time, we will use the newly created project and try to make a batch and stream processing example.

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data