The Research Nest
Published in

The Research Nest

Creating a Real-time Data Stream Using Apache Kafka in Python

A beginner-level tutorial on setting up Kafka with python in windows 10

Use Apache Kafka with Python 🐍 in Windows 10 to stream any real-time data 📊

Once we understand how to set up this flow, we can use any data source as input and stream it and then do whatever we want.

What is Apache Kafka 🤔?

In simple terms, it is a streaming platform that is highly scalable, durable, and available. If you want to stream data in any scenario, you can consider using Kafka.

Let us do one better :P Why Kafka 🤔?

I don’t want to go too deep into it, but if you want to explore more, Confluent has an excellent article.

🖥️ Phase 1: Setting up your development environment

It is not very straightforward to use Kafka in a Windows system. You will need something called WSL 2 (Windows Subsystem For Linux). It helps you run a Linux environment on your Windows PC where you can run Kafka.

  1. If the Linux kernel update package doesn’t install properly, you can restart your system and try again. It should work then.
  2. After installing Java, follow the below steps.
  3. Use this blog post reference to install Kafka. You can download the latest version using the wget command ( wget <link>). Follow the steps in the article till the second step to start your Kafka server.
  4. Remember that all the terminals you open must be Linux terminals. You will have one terminal running for the Zookeeper server and one for Kafka. You have to keep them running. For now, we need not worry about what’s happening here.
  5. Open another new terminal and cd to the folder where Kafka is extracted to. Run the below command to create a new “topic” called data-stream. The topic name could be anything. For demonstration purposes, I will be sharing a simple example. We will be using this topic as a reference to stream our data.
$ bin/kafka-topics.sh --create --topic data-stream --bootstrap-server localhost:9092

Python setup for Kafka

  • Python is already preinstalled in the Linux environment. However, we need to install some other stuff to use Kafka in Python.
  • Open another new Linux terminal.
  • Install pip.
$ sudo apt install python3-pip
$ sudo apt install python3-venv
$ mkdir kafka-stream
$ cd kafka-stream/
$ python3 -m venv .kafka
$ source .kafka/bin/activate
$ pip install kafka-python
  • Next, run the below command in the terminal (while inside your project directory) to open your project folder in VS Code. You must notice at the bottom-left that the VS Code is connected to your Linux environment.
$ code .

🖥️ Phase 2: Streaming Data

For now, we just need to know two things about Kafka:

  1. A consumer reads the data sent by the producer. Here, we can do whatever we want with the data we keep receiving from the producer.
  • An API endpoint
  • A web socket
  1. We are defining the producer using KafkaProducer method. Using the value_serializer method, we are setting the producer to send the value as a json.
  2. We are fetching the price JSON object using the requests library. You can refer to this blog if you are unfamiliar with how to use requests.
  3. We are running an infinite while loop where the API call is made every 2 seconds, and the latest price of Bitcoin is fetched. This price is then sent to the consumer in real-time. Notice that we are using data-stream as the topic, which the consumer can identify to read this data.
  4. Similarly, instead of price, you can put any data you want. For example, you can make an API call to fetch the latest tweets under some hashtag. You can also load a dataset, an excel, or a CSV file into a pandas data frame and loop through all the rows to send it to the consumer (If it is a huge dataset, you can load and loop in small chunks).
  5. I recommend using JSON, but you can send data in other formats by setting the appropriate configuration. For more info, you can refer to the official documentation of kafka-python.
  1. value_deserializer needs to be set as per the data we are receiving. Here, it is set for JSON.
  2. message.value contains the actual JSON object received. You can filter it further to get the field you want. So, you must know the JSON structure beforehand. The price value is stored under the “amount” parameter in this case.
  3. Once you read the data in your consumer, you can do anything with it. For example, you can extract some variables from this data and pass it on to a deep learning model to predict something and print your prediction.
  4. You could write some logic and send an alert if some conditions are met. You could also plot the data as a graph on some web page to get a real-time graph.
  5. What you can do is limited by your imagination.
python producer.py
Producer Terminal
Consumer Terminal (It is actual real-time data)
$ rm -rf /tmp/kafka-logs /tmp/zookeeper

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
XQ

Tech 👨‍💻 | Life 🌱 | Careers 👔 | Poetry 🖊️