How to Export Data from Kafka to Pub/Sub using Dataproc Serverless [Java]

Published in

Google Cloud - Community

3 min readSep 16, 2024

Dataproc Serverless

Using Dataproc Serverless, users can run Spark batch workloads without the need to provision and manage their own clusters. Hence, they can simply specify the workload parameters and submit it to the Dataproc Serverless service, which will in turn run the workload on a managed compute infrastructure that autoscales resources as required.

Dataproc Templates

Dataproc Templates is an open-source repository maintained by Google Cloud Platform. It provides templates that can be used to run Spark workloads on Dataproc Serverless for various common use cases. the templates are available in Java and Python.

One of the common use cases is using Dataproc Serverless to migrate data from one data source to another. In this blog post, we will cover how to move data from Kafka to Pub/Sub using the KAFKATOPUBSUB Java Dataproc template.

Pre-requisites

Enable the API by clicking one the “Enable” button

2. Make sure that “Private Google Access” is switched on under your subnet settings. Even if you are using the default subnet, make sure Private Google Access is enabled.

3. Create a staging bucket. This can be done by either of the following methods:

Click the “Create” button on the Cloud Storage page in the GCP console or
Run the following commands in the Cloud Shell:

export GCS_STAGING_BUCKET=<staging-bucket-name>
gsutil mb gs://$GCS_STAGING_BUCKET

4. Make sure Maven is installed and added to the path variable. You can confirm it by running mvn --version command. In case it is not added to the path variable, locate the `bin` directory of the Apache Maven folder, copy the path, and run the following command:

export PATH=<PATH_OF_APACHE_MAVEN_BIN_DIRECTORY>:$PATH
mvn --version

5. If the Kafka server is in a GCP VM, add a VPC rule to allow TCP ingress traffic on port 9092. This is required for Dataproc to access the Kafka Server.

New VPC rule added to allow TCP ingress traffic at port 9092

6. Run the zookeeper service and the kafka server in seperate terminals. Confirm that you are able to see the messages in the relevant Kafka topic.

7. Create a Pub/sub topic

Steps for Exporting Data from KafKa to Pub/Sub

Clone the Dataproc git repository repo and setup gcloud login

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java
gcloud auth application-default login

2. Export the relevant variables

export GCP_PROJECT=<PROJECT_ID>
export REGION=<PROJECT_REGION>
export SUBNET=<SUBNET_NAME>
export GCS_STAGING_LOCATION=<GCS_PATH_TO_STAGING_LOCATION>

export GCS_CHECKPOINT_LOCATION=<GCS_PATH_TO_KAFKA_CHECKPOINT_LOCATION>
export KAFKA_SERVER=<KAFKA_BROKER_LIST>
export KAFKA_TOPIC=<KAFKA_TOPICS_LIST>
export PUBSUB_PROJECT=<PROJECT_ID>
export PUBSUB_TOPIC=<PUBSUB_TOPIC_NAME>
export STARTING_OFFSET=<STARTING_OFFSET_VALUE>
export STREAM_TERMINATION=<STREAM_TERMINATION_VALUE>

GCS_CHECKPOINT_LOCATION: Stores the checkpoint for Kafka pipeline
KAFKA_SERVER: Broker list in the format 10.128.0.15:9092
KAFKA_TOPIC: Kafka topics from where data is to to be imported
PUBSUB_PROJECT: Project ID where Pub/Sub is present
PUBSUB_TOPIC: Pub/Sub topic name where data is to be exported
STARTING_OFFSET: Accepted values: “earliest”, “latest” (streaming only)
STREAM_TERMINATION: Time in ms to wait before termination of stream

3. Trigger the Dataproc job

bin/start.sh \
-- \
--template KAFKATOPUBSUB \
--templateProperty project.id=${GCP_PROJECT} \
--templateProperty kafka.pubsub.checkpoint.location=${GCS_CHECKPOINT_LOCATION} \
--templateProperty kafka.pubsub.bootstrap.servers=${KAFKA_SERVER} \
--templateProperty kafka.pubsub.input.topic=${KAFKA_TOPIC} \
--templateProperty kafka.pubsub.output.topic=${PUBSUB_TOPIC} \
--templateProperty kafka.pubsub.output.projectId=${PUBSUB_PROJECT} \
--templateProperty kafka.pubsub.starting.offset=${STARTING_OFFSET} \
--templateProperty kafka.pubsub.await.termination.timeout=${STREAM_TERMINATION}

4. Monitor the job on Dataproc batches through the GCP console.

5. Check the messages in a Pub/Sub subscriber

How to Export Data from Kafka to Pub/Sub using Dataproc Serverless [Java]

Dataproc Serverless

Dataproc Templates

Steps for Exporting Data from KafKa to Pub/Sub

References

Published in Google Cloud - Community

Written by Abhishek Pratap Singh

No responses yet