Flink + Docker + Kafka

Daniel Santana
2 min readAug 7, 2023

--

Apache Flink is a powerful stream processing framework that enables real-time data processing. Docker provides an easy way to set up and experiment with Apache Flink locally. In this article, we'll guide you through running Apache Flink with Docker, demonstrate how to integrate Apache Kafka with Flink using a Dockerfile, and provide an example Flink script using Python for stream processing.

Setting Up Apache Flink with Docker

Step 1: Install Docker
If Docker is not installed on your system, you can follow the instructions in the [official documentation](https://docs.docker.com/get-docker/) to install it.

Step 2: Run Apache Flink Container
Run the following command in your terminal to start an Apache Flink container:

docker run -d -p 8081:8081 apache/flink:1.14.0

This will pull the Apache Flink image and start a container with the Flink web dashboard accessible at `http://localhost:8081`.

Dockerfile for Apache Kafka and Flink Integration

Step 1: Create Dockerfile
Create a `Dockerfile` in a directory of your choice with the following content:

FROM flink:1.14.0
# Install Kafka connector dependencies
RUN mkdir -p /opt/flink/usrlib
RUN wget -P /opt/flink/usrlib https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-kafka_2.12/1.14.0/flink-connector-kafka_2.12-1.14.0.jar

Step 2: Build and Run Docker Image
Navigate to the directory containing the `Dockerfile` and run the following commands:

docker build -t flink-kafka-integration .
docker run -it flink-kafka-integration

This will build the Docker image and start a container with Apache Flink and the Kafka connector integrated.

Example Flink Script using Python

Step 1: Install Flink
Install Apache Flink's Python SDK using the following command:

pip install apache-flink

Step 2: Python Flink Script
Here's an example Python script that demonstrates how to use Flink's Python SDK for stream processing:

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
from pyflink.table.descriptors import Schema, Kafka
# Create Flink environment
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
# Define Kafka source
t_env.connect(
Kafka()
.version("universal")
.topic("input_topic")
.start_from_latest()
.property("bootstrap.servers", "localhost:9092")
).with_format(
Json()
).with_schema(
Schema().field("field1", "STRING").field("field2", "INT")
).create_temporary_table("kafka_source")
# Define processing logic
result = t_env.from_path("kafka_source").select("field1, field2 + 1")
# Write the result to a Kafka sink
result.execute_insert("output_topic")
# Execute Flink job
env.execute("Python Flink Job")

Running Apache Flink with Docker provides an efficient way to experiment with stream processing. Integrating Apache Kafka with Flink enhances the capabilities of real-time data processing. Using Python scripts with Flink's Python SDK allows you to easily create and deploy stream processing jobs. By following this guide and utilizing the provided example scripts, you can explore the power of Apache Flink for stream processing and data analysis in your projects.

--

--

Daniel Santana

Software Engineer passionate about F1 and artificial intelligence, specialized in data analysis.