What is Apache Kafka?

John Thuma
Jul 18, 2018 · 4 min read

Before I dig into Confluent KSQL, Apache Kafka, and Spark Streaming let’s first take a look at what ‘streaming’ is and why it is so valuable. Data streaming is a continuous generation of lightweight messages, typically in kilobytes, from potentially many different data sources. It can be from a variety of sources such as ecommerce, telematics, trading floors, instrumentation, and much more. Streaming data has many uses and should be processed sequentially and incrementally on a record-by-record basis. You can use it for event management and various types of analytics. What’s the big deal? We can now exploit the value of data as it happens rather than have to wait and process batches of records over a period of time. Some data has time value just like money has time value. I once worked with a major European stock exchange that claimed that a single stock trade transaction loses 80% of its value 5 seconds after that trade occurs. What is the time value of data in your enterprise?

“We can now exploit the value of data as it happens rather than have to wait and process batches of records over a period of time.”

First let’s take a look at Apache Kafka. Apache Kafka is an open source stream processing platform developed in Scala and Java. It provides a low-latency, high-throughput, and unified platform for handling real time data feeds. It provides a massively scalable publisher and subscriber message queue which acts as a distributed transaction log. Apache Kafka also provides ‘Kafka Connect,’ an import/export system for linking to external systems, and t provides Kafka Streams, a Java library for processing streaming data.

How did Apache Kafka get its name? Apache Kafka is a system optimized for writing/capturing data so the inventors from LinkedIn (Jun Rao, Jay Kreps, and Neha Narkhede) thought that having it named after a writer made sense. A better description of Kakfa would be: a system which provides a unified, high-throughput, low-latency platform for handling real-time data feeds.

GREAT QUOTE: Franz Kafka: “By believing passionately in something that still does not exist, we create it. The nonexistent is whatever we have not sufficiently desired.”

This leads us to our next part of the discussion, Spark Streaming. Apache Kafka is a message broker with superb performance and it can redistribute data to other applications such as Spark Streaming. Spark Streaming is a complementary application to Apache Kafka and will be the topic our next section.

Spark Streaming is an extension of the Apache Spark core API. It provides high-throughput, fault-tolerant processing of live streaming data. Data can be ingested from Apache Kafka, Flume, TCP sockets, Kinesis, and others. Data can be processed and exported to databases, filesystems (HDFS), and dashboards. You can even apply Spark’s graph and machine learning algorithms on live streams. You can write these programs using Scala, Python, or Java. Some developers are challenged by the micro-batching processing which means that it is not truly real-time or at the atomic level. However you define real-time, Spark Streaming might be good enough to meet your expectations and business needs. It does require very specific technical knowledge and is bound by the limitations of Apache Spark. Some Apache Spark limits: problems with small files, decompression, and partitioning, back pressure handling(I/O buffer cache requires manual cleanup), and no file management system.

Finally, let’s discuss Confluent KSQL. KSQL is a streaming data processing engine which makes it easy to read, write, and modify data from Apache Kafka streams using a Structured Query Language (SQL) like language. With KSQL you can easily join and aggregate streams of data. SQL is simple to learn and is arguably the most widely used programing language today. Like Apache Spark Streaming it can consume data feeds from Apache Kafka as an application. Use cases include streaming extract/transform/load (ETL), anomaly detection, and event monitoring.

What is Confluent: Confluent is a company founded by the team that built Apache Kafka. They offer a variety of tools that can help your organization build highly robust and scalable streaming applications.

For more details on how Arcadia Data can jumpstart you into real-time Apache Kafka analytics take a look at the following:

John Thuma

Written by

Data Nerd! Walking the Data wire for 30 years. If you are serious about data and analytics then I might be interesting to you!