How to Export Data from Kafka to Pub/Sub using Dataproc Serverless [Java]

Abhishek Pratap Singh
Google Cloud - Community
3 min readSep 16, 2024

Dataproc Serverless

Using Dataproc Serverless, users can run Spark batch workloads without the need to provision and manage their own clusters. Hence, they can simply specify the workload parameters and submit it to the Dataproc Serverless service, which will in turn run the workload on a managed compute infrastructure that autoscales resources as required.

Dataproc

Dataproc Templates

Dataproc Templates is an open-source repository maintained by Google Cloud Platform. It provides templates that can be used to run Spark workloads on Dataproc Serverless for various common use cases. the templates are available in Java and Python.

One of the common use cases is using Dataproc Serverless to migrate data from one data source to another. In this blog post, we will cover how to move data from Kafka to Pub/Sub using the KAFKATOPUBSUB Java Dataproc template.

Pre-requisites

  1. Login into your GCP project and enable the Dataproc API.
Dataproc API

Enable the API by clicking one the “Enable” button

2. Make sure that “Private Google Access” is switched on under your subnet settings. Even if you are using the default subnet, make sure Private Google Access is enabled.

3. Create a staging bucket. This can be done by either of the following methods:

export GCS_STAGING_BUCKET=<staging-bucket-name>
gsutil mb gs://$GCS_STAGING_BUCKET

4. Make sure Maven is installed and added to the path variable. You can confirm it by running mvn --version command. In case it is not added to the path variable, locate the `bin` directory of the Apache Maven folder, copy the path, and run the following command:

export PATH=<PATH_OF_APACHE_MAVEN_BIN_DIRECTORY>:$PATH
mvn --version

5. If the Kafka server is in a GCP VM, add a VPC rule to allow TCP ingress traffic on port 9092. This is required for Dataproc to access the Kafka Server.

New VPC rule added to allow TCP ingress traffic at port 9092

6. Run the zookeeper service and the kafka server in seperate terminals. Confirm that you are able to see the messages in the relevant Kafka topic.

Sample Kafka topic with a few messages

7. Create a Pub/sub topic

Steps for Exporting Data from KafKa to Pub/Sub

  1. Clone the Dataproc git repository repo and setup gcloud login
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java
gcloud auth application-default login

2. Export the relevant variables

export GCP_PROJECT=<PROJECT_ID>
export REGION=<PROJECT_REGION>
export SUBNET=<SUBNET_NAME>
export GCS_STAGING_LOCATION=<GCS_PATH_TO_STAGING_LOCATION>

export GCS_CHECKPOINT_LOCATION=<GCS_PATH_TO_KAFKA_CHECKPOINT_LOCATION>
export KAFKA_SERVER=<KAFKA_BROKER_LIST>
export KAFKA_TOPIC=<KAFKA_TOPICS_LIST>
export PUBSUB_PROJECT=<PROJECT_ID>
export PUBSUB_TOPIC=<PUBSUB_TOPIC_NAME>
export STARTING_OFFSET=<STARTING_OFFSET_VALUE>
export STREAM_TERMINATION=<STREAM_TERMINATION_VALUE>
  • GCS_CHECKPOINT_LOCATION: Stores the checkpoint for Kafka pipeline
  • KAFKA_SERVER: Broker list in the format 10.128.0.15:9092
  • KAFKA_TOPIC: Kafka topics from where data is to to be imported
  • PUBSUB_PROJECT: Project ID where Pub/Sub is present
  • PUBSUB_TOPIC: Pub/Sub topic name where data is to be exported
  • STARTING_OFFSET: Accepted values: “earliest”, “latest” (streaming only)
  • STREAM_TERMINATION: Time in ms to wait before termination of stream

3. Trigger the Dataproc job

bin/start.sh \
-- \
--template KAFKATOPUBSUB \
--templateProperty project.id=${GCP_PROJECT} \
--templateProperty kafka.pubsub.checkpoint.location=${GCS_CHECKPOINT_LOCATION} \
--templateProperty kafka.pubsub.bootstrap.servers=${KAFKA_SERVER} \
--templateProperty kafka.pubsub.input.topic=${KAFKA_TOPIC} \
--templateProperty kafka.pubsub.output.topic=${PUBSUB_TOPIC} \
--templateProperty kafka.pubsub.output.projectId=${PUBSUB_PROJECT} \
--templateProperty kafka.pubsub.starting.offset=${STARTING_OFFSET} \
--templateProperty kafka.pubsub.await.termination.timeout=${STREAM_TERMINATION}

4. Monitor the job on Dataproc batches through the GCP console.

5. Check the messages in a Pub/Sub subscriber

Exported data in Pub/Sub

References

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Abhishek Pratap Singh
Abhishek Pratap Singh