Processing Streaming Twitter Data using Kafka and Spark — The Plan
What is Apache Kafka?
Apache Kafka is a publish/subscribe messaging system. It is often described as a “distributed commit log” or more recently as a “distributed streaming platform.” Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged streaming platform
I recently read the book Kafka: The Definitive Guide by the creators of Kafka. It is truly a wonderful book for anyone who wants to start developing applications with Kafka as well as anyone who wants to know the internals of such a unique platform which is used by most of the Fortune 500 companies.
In this series, I’ll be exploring various aspects of Apache Kafka, all by implementing cool data pipeline:
- We’ll start by setting up a Kafka Cluster in cloud/locally
- After that, we’ll write a Producer Client which will fetch latest tweets continuously using Twitter API and push them to Kafka.
- Then, we will implement an app using Kafka Streams API, which will consume the tweets from Kafka in real-time and do basic processing on them like finding number of tweets per user and most used words (i.e word count).
- We’ll then venture into more cool stuff like writing our own Kafka Connector which will use twitter as data source and learning to use Apache NiFi to achieve the same with less effort.
- We’ll use Spark Streaming to do sentiment analysis on real-time twitter data
- Finally, if everything goes well, we’ll try to tweak our architecture and implement Notification service using Firebase and Kafka which will send push notifications to user if his/her tweet has negative sentiment!