Apache Kafka Guide #47 Big Data Processing

Paul Ravvich
Apache Kafka At the Gates of Mastery
2 min readMay 9, 2024
Apache Kafka Guide #47 Big Data Processing

Hi, this is Paul, and welcome to the #47 part of my Apache Kafka guide. Today we will discuss how to process Big Data using Apache Kafka.

Big Data Processing

Kafka was originally developed for Big Data Ingestion. Traditionally, it utilized “generic” connectors to facilitate the transfer of data into Kafka and subsequently offload it to various storage systems such as HDFS, Amazon S3, or ElasticSearch. In this role, Kafka can fulfill a dual function. It acts as a “speed layer” for real-time applications, while simultaneously serving as a “slow layer” where data extraction can occur in a batch manner for analytical purposes in storage solutions like HDFS and S3.

This configuration of Kafka as a gateway to Big Data is a prevalent model in the big data industry. Additionally, it is often employed as an “ingestion buffer” in front of other data stores when a buffering mechanism is necessary.

This is the architectural model you would aim to implement.

Initially, you have data producers within your company. These can be any type of data sources that feed into Kafka. At this stage, there is a speed layer that might include your Kafka consumers as well as big data frameworks like Spark, Storm, and Flink. These frameworks enable real-time analytics, the creation of dashboards and alerts, as well as supporting various applications and consumers.

However, if real-time processing is not your goal and you prefer batch analysis, then tools like Kafka Connect or a Kafka consumer become relevant. These tools help in transferring all your data from Kafka to storage and processing systems such as Hadoop, Amazon S3, RDBMS databases, Elasticsearch, or others. This setup facilitates data science, reporting, and audits, or simply serves as a backup and long-term storage.

These configurations are common practices with Kafka, and it’s important to recognize that this architecture is widely used and established.

Thank you for reading until the end. Before you go:

Paul Ravvich

--

--

Paul Ravvich
Apache Kafka At the Gates of Mastery

Software Engineer with over 10 years of XP. Join me for tips on Programming, System Design, and productivity in tech! New articles every Tuesday and Thursday!