Real time dashboard with Kafka and Spark Streaming

Tarık Yılmaz
3 min readJan 27, 2017

--

Nowadays, almost every data oriented developer or engineer or whatever they called themselves talks about real time, real time and real time … Most of the time I am working with batch processing such as Hadoop, Hive, Spark etc.

But, What if you need to create real time dashboards?

Probably the most frequent answer is,

ta-da! Apache Kafka and Apache Spark Streaming

But why are we using Spark Streaming ?

Apache Spark is one of the most popular distributed computing libraries out there and Spark Streaming is built-in library in Apache Spark and provides real time stream processing but eventually Spark Streaming is micro-batch oriented stream processing engine. Real-time streaming right! Micro-batches are real time, yes very real time as much as we continue to assume, anyways. Spark Streaming reads streams from source. Being more specific Spark community calls them “receiver” and that received data processes in micro batch jobs for each iteration. The iteration interval must be assigned at some point of development cycle. There are other alternatives such as Flink, Storm or Heron, Samza etc.

Okey but why are we actually using Kafka if we can already use Spark Streaming for the streaming?

Using direct streams through TCP socket maybe meaningless because there is no any parallelism but using Spark requires parallel processing and this is a very good reason to use Kafka. Kafka enables parallel streaming with a support named “partition” which is highly compatible to use with Spark’s “partition”. The other reason to use Kafka that maybe we have very high traffic or maybe your traffic has high saturation. In this case Kafka could help you to handle the traffic without loosing the data. Sending the traffic to Spark Streaming directly or other streaming processing libraries could be cruel because Spark Streaming doesn’t have ability to handle the coming traffic from single pipe.

Making dashboards by using Spark, Kafka and Cassandra is very popular in these days, I tried to store the data in the MySQL but Cassandra could be better option for the time series data because of its own design and also I chose MySQL because of easy to install and better for the “case simplicity”. I mean using MySQL could help to understand the main idea rather than dealing the installation and deploying part.

You can see the simple visualization of the architecture flow below.

Here you can find an example implementation for a real time dashboard application on my Github to have some idea about it.

I tried to make it easy to install, deploy also as you can see it has very simple code base.

Yes we have a such tiny opinion…

--

--

Tarık Yılmaz

big data engineer · do computer things, play drums, draw things