How to use Apache Kafka to transform a batch pipeline into a real-time one

If you need to cross the street, would you do it with information that is five minutes old?

The challenge we’ll solve

A review prompt on Udemy
Screenshot of the statistics for the Apache Kafka for Beginners course
Video to see the code running (for the lazy)

What is Apache Kafka?

A typical representation of a Pub/Sub system

The reviews processing batch pipeline

Personal assumptions of current pipeline. Looks familiar?

The target architecture

Gentle reminder

The contract between two micro-services is the data itself

Target Architecture for our Real-Time Pipeline. Every color is a micro service

1) Reviews Kafka Producer

A typical Kafka Producer
The queue has a fixed size of 100 so it’s blocking until not full
Typical Kafka Producer properties

Hey! (you may say). What’s your Review object?

An extract of the Review Avro Schema

Any other programming language can read the Avro bytes, and deserialize them.

Hey! (may you say). What’s the role of the schema registry then?

How the Schema Registry works.
Starting Kafka using the Confluent Distribution
$ kafka-topics --create --topic udemy-reviews --zookeeper localhost:2181 --partitions 3 --replication-factor 1
$ git clone https://github.com/simplesteph/medium-blog-kafka-udemy
$ mvn clean package
$ export COURSE_ID=1075642 # Kafka for Beginners Course
$ java -jar udemy-reviews-producer/target/uber-udemy-reviews-producer-1.0-SNAPSHOT.jar
[2017-10-19 22:59:59,535] INFO Sending review 7: {"id": 5952458, "title": "Fabulous on content and concepts", "content": "Fabulous on content and concepts", "rating": "5.0", "created": 1489516276000, "modified": 1489516276000, "user": {"id": 2548770, "title": "Punit G", "name": "Punit", "display_name": "Punit G"}, "course": {"id": 1075642, "title": "Apache Kafka Series - Learn Apache Kafka for Beginners", "url": "/apache-kafka-series-kafka-from-beginner-to-intermediate/"}} (ReviewsAvroProducerThread)
$ kafka-avro-console-consumer --topic udemy-reviews --bootstrap-server localhost:9092 --from-beginning
{"id":5952458,"title":{"string":"Fabulous on content and concepts"},"content":{"string":"Fabulous on content and concepts"},"rating":"5.0","created":1489516276000,"modified":1489516276000,"user":{"id":2548770,"title":"Punit G","name":"Punit","display_name":"Punit G"},"course":{"id":1075642,"title":"Apache Kafka Series - Learn Apache Kafka for Beginners","url":"/apache-kafka-series-kafka-from-beginner-to-intermediate/"}}

2) Fraud Detector Kafka Streams

Our Fraud Detection Micro-Service
Pretty much every Kafka Stream app
A very simple Kafka Streams topology
$ kafka-topics --create --topic udemy-reviews-valid --partitions 3 --replication-factor 1 --zookeeper localhost:2181
$ kafka-topics --create --topic udemy-reviews-fraud --partitions 3 --replication-factor 1 --zookeeper localhost:2181
(from the root directory)
$ mvn clean package
$ java -jar udemy-reviews-fraud/target/uber-udemy-reviews-fraud-1.0-SNAPSHOT.jar
It’s about to get real. I’m making sure I have your attention for the rest of the blog

3) Reviews Aggregator Kafka Streams

Architecture for our stateful Kafka Streams Application
Looks easy, but there’s a catch!
A simple Kafka Streams aggregation. Note the .groupByKey() call

What is each course statistics over the past 90 days?

Defining a hopping window of 90 days
The aggregation is similar as before, but we now have an additional parameter.
see the source code for keepCurrentWindow(window)
$ kafka-topics --create --topic long-term-stats --partitions 3 --replication-factor 1 --zookeeper localhost:2181
$ kafka-topics --create --topic recent-stats --partitions 3 --replication-factor 1 --zookeeper localhost:2181
(from the root directory)
$ mvn clean package
$ java -jar udemy-reviews-aggregator/target/uber-udemy-reviews-aggregator-1.0-SNAPSHOT.jar
$ kafka-avro-console-consumer --topic recent-stats --bootstrap-server localhost:9092 --from-beginning
$ kafka-avro-console-consumer --topic long-term-stats --bootstrap-server localhost:9092 --from-beginning
{“course_id”:1294188,”count_reviews”:51,”average_rating”:4.539}
{“course_id”:1294188,”count_reviews”:52,”average_rating”:4.528}
{“course_id”:1294188,”count_reviews”:53,”average_rating”:4.5}

4) Kafka Connect Sink — Exposing that data back to the users

The Kafka Connect Pipeline
Our configuration. Not hard to come up with when you read the docs
Some of the results we get in our PostgreSQL database

Last but not least — Final notes

Learn Apache Kafka like never before with Conduktor Kafkademy

--

--

Udemy Instructor, 5x AWS Certified, Kafka Evangelist, New Tech Hunter https://courses.datacumulus.com/

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store