Data Engineering: Build a pipeline using Kafka, Python & AWS.

5 min readSep 3, 2023

Apache Kafka, Python , Aws (EC2, S3, Glue, Athena)

In this article, we are taking you through the overview process of building a data pipeline that streams data from a source (in our case CSV file) and ingests it to AWS S3 Bucket, crawls the data to AWS Glue, and queries the data in Athena.

The process of building a data pipeline using cloud-based services has become the center of interest, and docker plays a significant role in the midst of containerizing microservices.

The above diagram indicates a pipeline process of streaming data from a data source and ingesting the data into object storage, while Kafka acts as an engine that allows communication between producer and consumer.

Dataset

This dataset consists of 1 Million+ transactions by over 800K customers for a bank in India. The data contains information such as — customer age (DOB), location, gender, account balance at the time of the transaction, transaction details, transaction amount, etc.

The dataset can be used for different analyses including:

Perform Clustering / Segmentation on the dataset and identify popular customer groups along with their definitions/rules
Perform Location-wise analysis to identify regional trends in India
Perform transaction-related analysis to identify interesting trends that can be used by a bank to improve/optimize their user experiences
Customer Recency, Frequency, Monetary analysis
Network analysis or Graph analysis of customer data.

Source: Bank Customer Segmentation (1M+ Transactions)

Apache Kafka Definition

Apache Kafka is an event streaming platform that is the practice of capturing data in real time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events.

Storing these event streams durably for later retrieval, manipulating, processing, and routing the event streams to different destination technologies as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.

source: https://kafka.apache.org/intro

Step 1: Run an Aws EC2 Instance

The initial task of the project is to configure and run a Linux AWS ec2 instance on a t2.micro instance type.

Step 2: Configure the AWS CLI

In this step, we configure the settings that the AWS Command Line Interface (AWS CLI) uses to interact with AWS.

The command above allows the user to connect and interact with aws from their CLI. The .pem extension file contains user credentials.

Apache Kafka Configuration

step 3: Install Java & Apache Kafka

check if Java SDK is installed if not use sudo yum to install Java.

The above comments are used to install Apache Kafka on the Linux ec2 instance.

Apache Kafka can be started using ZooKeeper.

Create a topic to store your events, all of Kafka’s command line tools have additional options: run the kafka-topics.sh command without any arguments to display usage information.

A Kafka client communicates with the Kafka brokers via the network for writing (or reading) events. Once received, the brokers will store the events in a durable and fault-tolerant manner for as long as you need even forever. Run the console producer client to write a few events into your topic.

Open another terminal session and run the console consumer client to read the events.

Kafka producer & consumer

An Apache Kafka Producer is a client application that publishes (writes) events to a Kafka cluster. the notebook above entails the configuration settings for tuning.

An Apache Kafka consumer is a client application that subscribes to (reads and processes) events. The notebook above provides an overview of the Kafka consumer and an introduction to the configuration settings for tuning. the producer pushes all the subscribed events to an Aws S3 Bucket.

Step 5: Crawl data from the S3 bucket using AWS Glue

This completes our pipeline. In future projects, we will include a layer of workflow orchestration using Apache Airflow to configure a custom pipeline and run Kafka in a docker container.

Data Engineering: Build a pipeline using Kafka, Python & AWS.

Written by Mafikeng Sello Sydney