ML Model Prediction on Streaming Data Using Kafka

Amany Abdelhalim
The Startup
Published in
4 min readSep 17, 2020

In one of my previous posts, I took you through the steps that I performed to preprocess Criteo dataset used for prediction of the click through rate on Ads. I also trained ML models to predict labels on the testing data set. You can find the post here.

In this post, I will be taking you through the steps that I performed to simulate the process of ML models predicting labels on streaming data.

Architecture Diagram

When you go through the mentioned post, you will find that I used pyspark on DataBricks notebooks to preprocess the Criteo data. I split the preprocessed data into training set and testing set. In order to export the test data to my local machine as a single parquet file, first I saved the training set in the FileStore in one partition as one file using dataFrameName.coalesce(1).write.

I used mleap to export my trained models as a zip file. In order to use mleap, I had to install mleap-spark from maven, mleap from pypi, and mlflow. Then, I copied the model to the FileStore so I can download to my local machine.

Make sure that the version of the mleap-pypi version matches the mleap-maven version. For learning the version of the mleap-pypi installed on DataBricks , you can do the following:

You can learn out the version of the mleap-maven version by looking through the coordinates e.g. ml.combust.mleap:mleap-spark_2.11:0.16.0.

After downloading my models and the testing dataset on my local machine, I had a docker compose up running with Kafka, Zookeeper, Logstash, Elasticsearch and Kibana.

I Developed a producer where I used “pyarrow” library to read the parquet file that has the test dataset. The producer then sends the label (class decision) and the features column to Kafka in a streaming fashion.

Output Sample Showing the Test Data Label and Features Column

I Developed a consumer/producer where the consumer part:

consumes the label (class decision) and the features column from Kafka and deserializes the logistic regression model and the SVM model. Converts the features column from a sparse vector to a dense vector. Uses the two models to predict a class label from the input features column.

Output Sample Showing the Expected label, Prediction, and Correct for each Model

The producer part:

Writes the prediction along with the original label and whether the output was correct or not to Kafka. A value of 1 for correct indicates that the model’s prediction and the original label match, 0 indicates that they didn’t match.

I developed a Logstash configuration file that reads the prediction along with the original label and the correct value from Kafka, converts the values as json data and writes them to Elasticsearch.

I connected to Kibanalocalhost:5601/app/kibana#/discover”. I used “visualize” to create two pie charts one for each model (logistic regression and SVM models) that shows the percentage of the count of the correct predictions versus the incorrect predictions. I aslo created a “mark down” that has the title that describes the two pie charts. I then created a dashboard that shows the two pie charts side by side with the mark down as the title of the dashboard. The pie charts gets updated according to the streaming data.

Kibana Dashboard showing accuracy count for ML models on Streaming Data

I hope, you enjoyed my post and found it useful, for full code of the producer, consumer/producer, logstash configuration file, and the docker compose visit my git hub by clicking here.

--

--

Amany Abdelhalim
The Startup

PhD. in Computer Engineering | Research Associate | Computer Science Instructor | love Machine Learning & Big Data.