Real-time data pipeline with Kafka to BigQuery

Burak Dogu
3 min readSep 8, 2022

Nowadays, data technology is evolving towards the cloud such as Google Cloud Platform(GCP) and Amazon Web Services(AWS). There are a wide variety of tools such on these platforms to ETL process. Also Apache Spark is one of the most important tools in real-time and batch processes. In this post, real-time data pipeline will be mentioned with GCP platform tools and Kafka. To reach more details and scripts you can visit the Github.

As a real-time data source will be Spotfiy playing tracks in this project. Spotfiy API will be used to send this data to KafkaProducer.
Endpoint: https://api.spotify.com/v1/me/player/currently-playing
HTTP Method: GET
Requirements: Oauth Token, User ID,Headers

Dataproc Cluster and Installation Kafka

A single-node Dataproc cluster was created with Google Cloud Shell. Kafka was used to get real-time data to send spark. Because of that,Kafka file with tgz extension has been uploaded into compute engine that created with the dataproc cluster.
Kafka Topic:
bin/kafka-topics.sh — create — topic spotify-kafka — bootstrap-server localhost:9092 — partitions 1 — replication-factor 1
KafkaProducer:
A python file were created in VM as KafkaProducer to send streaming data.

Spark Processes

KafkaProducer will send playing song’s song name and artist, also in spark process timestamp will be added to dataframe to illisturate how many time listened the songs.

Outputmode is used as append to be written each mirco-batch into BigQuery. Google Cloud Storage was used for checkpoint and temporary bucket. And locate to table as table id for instance table id format is like <GCP Project ID.dataset.table-name> . Finally processes were triggerred hourly.
Data where sent from spark ingested successfully in BigQuery.

In this post, I tried to explain how to do a stream process and ingest to data to Google BigQuery. This process could be able to expand with adding BigQuery Schedule to follow demand schedule.

--

--