A Deep Dive into Apache Flink’s DataStream API

Concepts, Connectors, Transformations, and Industry-Standard Use Cases

Parin Patel
5 min readAug 9, 2024
Flink stream processing, Quick Read, Apache Flink, Flink source, Flink Sinks, Transformation functions, Industry-standard examples of apache flink

Introduction

Apache Flink is one of the most powerful frameworks for stream processing and batch processing of large-scale data. In this blog, we’ll focus on the DataStream API, which is specifically designed for handling unbounded data streams. The DataStream API is a core component of Flink and is used to process real-time data streams in a scalable, fault-tolerant, and distributed manner.

We’ll cover everything from creating DataStreams, available connectors and sinks, transformations and operations on streams, to industry-standard examples, along with their use cases. By the end of this blog, you’ll have a comprehensive understanding of the DataStream API and how to leverage it for building robust, distributed stream processing applications.

Overview of Flink APIs

Apache Flink provides several APIs, each tailored to different types of processing tasks:

  1. DataStream API: For processing unbounded data streams (e.g., real-time event streams).
  2. DataSet API: For processing bounded datasets (e.g., batch processing).
  3. Table API & SQL: A unified API for batch and stream processing that offers relational APIs with SQL-like syntax.

In this blog, our primary focus will be on the DataStream API, as it is the backbone for real-time data processing in Flink.

Creating DataStreams

A DataStream in Flink represents a stream of data that is continuously generated from a source, such as a message queue or a sensor network. DataStreams are processed in a distributed fashion across the Flink cluster, allowing the system to handle large-scale, high-throughput data streams efficiently.

  • SocketStream: Reading from socket connections.
  • FileStream: Reading from files.
  • Kafka: Reading from Kafka topics.
  • Custom Source Functions: Defining custom data sources.

Example:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Creating a DataStream from a socket
DataStream<String> socketStream = env.socketTextStream("localhost", 9999);

// Creating a DataStream from a file
DataStream<String> fileStream = env.readTextFile("path/to/file.txt");

// Creating a DataStream from a Kafka topic
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<>("topic-name", new SimpleStringSchema(), properties);
DataStream<String> kafkaStream = env.addSource(kafkaSource);

Connectors Available for DataStream API

Flink offers a wide range of connectors to integrate with various data sources and sinks. Some of the commonly used connectors include:

Sources:

  1. Apache Kafka: For reading from Kafka topics.
  2. Kinesis: For reading from AWS Kinesis streams.
  3. RabbitMQ: For reading from RabbitMQ queues.
  4. Files: For reading from various file formats (CSV, JSON, Avro).
  5. Socket: For reading from socket streams.
  6. JDBC: For reading from relational databases.
  7. Cassandra: For reading data from Apache Cassandra tables, A NoSQL DBMS

Sinks:

  1. Apache Kafka: For writing to Kafka topics.
  2. Kinesis: For writing to AWS Kinesis streams.
  3. Elasticsearch: For writing to Elasticsearch indexes.
  4. Files: For writing to various file formats (CSV, JSON, Avro).
  5. JDBC: For writing to relational databases.
  6. HDFS: For writing to Hadoop Distributed File System.
  7. Cassandra: For writing to Apache Cassandra tables.

Example of Using a Kafka Source and Sink:

// Kafka Source
FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<>("input-topic", new SimpleStringSchema(), properties);
DataStream<String> kafkaStream = env.addSource(kafkaSource);

// Kafka Sink
FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>("output-topic", new SimpleStringSchema(), properties);
kafkaStream.addSink(kafkaSink);

Transformations and Operations on Streams

Transformations in Flink allow you to modify or filter the data in your streams. These transformations are executed in a distributed manner across the cluster, ensuring scalability and fault tolerance. Some common transformations include:

  1. Map: Applies a function to each element in the stream.
  2. FlatMap: Similar to Map but can return zero, one, or more elements.
  3. Filter: Filters elements based on a predicate.
  4. KeyBy: Groups the stream by a key, enabling distributed state management.
  5. Reduce: Combines elements in the stream using a reduce function.
  6. Window: Groups elements in the stream into windows for batch processing.
  7. Join: Joins two streams based on a common key.
  8. Union: Combines two or more streams into a single stream.

Example of Common Transformations:

// Mapping each element to its length
DataStream<Integer> lengths = kafkaStream.map(String::length);

// Filtering elements that are greater than 10
DataStream<String> filteredStream = kafkaStream
.filter(value -> value.length() > 10);

// Keying by a specific field and then reducing
DataStream<Tuple2<String, Integer>> keyedStream = kafkaStream
.map(value -> new Tuple2<>(value, 1))
.keyBy(value -> value.f0)
.reduce((value1, value2) -> new Tuple2<>(value1.f0, value1.f1 + value2.f1));

In this example, the keyBy transformation enables distributed grouping of the stream, with each group being processed independently by different parallel instances across the cluster.

Industry-Standard Example: Real-Time Fraud Detection

One of the most common use cases for the DataStream API is in fraud detection systems. These systems monitor transactions in real time and flag any suspicious activity.

Scenario: Suppose we have a stream of transactions, and we want to detect any transaction that exceeds a certain threshold within a short period.

Example Implementation:

DataStream<Transaction> transactions = // source of transactions;

// Flagging large transactions
DataStream<Alert> alerts = transactions
.keyBy(Transaction::getAccountId)
.timeWindow(Time.minutes(1))
.apply(new FraudDetectionFunction());
// FraudDetectionFunction implementation
public class FraudDetectionFunction extends ProcessWindowFunction<Transaction, Alert, String, TimeWindow> {
@Override
public void process(String accountId, Context context, Iterable<Transaction> transactions, Collector<Alert> out) {
for (Transaction transaction : transactions) {
if (transaction.getAmount() > 10000) {
out.collect(new Alert(accountId, transaction.getAmount(), "Potential fraud detected"));
}
}
}
}

In this example, the FraudDetectionFunction processes transactions within a one-minute window, with each account being processed by different parallel instances. This distributed processing ensures that even large volumes of transactions can be analyzed in real time.

Use Cases of DataStream API

The DataStream API is highly versatile and can be used in various industries for different use cases:

  1. Financial Services: Real-time fraud detection, trade monitoring, risk management.
  2. Telecommunications: Network monitoring, call detail record analysis, customer experience management.
  3. E-commerce: Real-time recommendation engines, inventory management, order processing.
  4. IoT: Sensor data processing, predictive maintenance, smart city applications.
  5. Social Media: Real-time sentiment analysis, user engagement tracking, content personalization.

Conclusion

Apache Flink’s DataStream API is an incredibly powerful tool for real-time distributed stream processing. Whether you’re working on fraud detection, real-time analytics, or IoT applications, the DataStream API provides the necessary building blocks to create robust and scalable solutions. In this blog, we’ve covered the key concepts, transformations, available connectors and sinks, and an industry-standard example to help you get started with Flink’s DataStream API.

Remember, the key to mastering Flink lies in understanding its core concepts and experimenting with different use cases. Happy streaming!

Reference

https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/dev/datastream/overview/

https://www.infoq.com/news/2023/09/confluent-apache-flink-preview/

--

--

Parin Patel

Java, Spring framework, Kafka, Apache Flink, Code optimization, Multi-threading, MySQL, Cassandra, AWS, Multitenant systems, Microservices architecture