How to Export Data from Kafka to Pub/Sub using Dataproc Serverless [Java]
Dataproc Serverless
Using Dataproc Serverless, users can run Spark batch workloads without the need to provision and manage their own clusters. Hence, they can simply specify the workload parameters and submit it to the Dataproc Serverless service, which will in turn run the workload on a managed compute infrastructure that autoscales resources as required.
Dataproc Templates
Dataproc Templates is an open-source repository maintained by Google Cloud Platform. It provides templates that can be used to run Spark workloads on Dataproc Serverless for various common use cases. the templates are available in Java and Python.
One of the common use cases is using Dataproc Serverless to migrate data from one data source to another. In this blog post, we will cover how to move data from Kafka to Pub/Sub using the KAFKATOPUBSUB Java Dataproc template.
Pre-requisites
- Login into your GCP project and enable the Dataproc API.
Enable the API by clicking one the “Enable” button
2. Make sure that “Private Google Access” is switched on under your subnet settings. Even if you are using the default subnet, make sure Private Google Access is enabled.
3. Create a staging bucket. This can be done by either of the following methods:
- Click the “Create” button on the Cloud Storage page in the GCP console or
- Run the following commands in the Cloud Shell:
export GCS_STAGING_BUCKET=<staging-bucket-name>
gsutil mb gs://$GCS_STAGING_BUCKET
4. Make sure Maven is installed and added to the path variable. You can confirm it by running mvn --version
command. In case it is not added to the path variable, locate the `bin` directory of the Apache Maven folder, copy the path, and run the following command:
export PATH=<PATH_OF_APACHE_MAVEN_BIN_DIRECTORY>:$PATH
mvn --version
5. If the Kafka server is in a GCP VM, add a VPC rule to allow TCP ingress traffic on port 9092. This is required for Dataproc to access the Kafka Server.
6. Run the zookeeper service and the kafka server in seperate terminals. Confirm that you are able to see the messages in the relevant Kafka topic.
7. Create a Pub/sub topic
Steps for Exporting Data from KafKa to Pub/Sub
- Clone the Dataproc git repository repo and setup gcloud login
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java
gcloud auth application-default login
2. Export the relevant variables
export GCP_PROJECT=<PROJECT_ID>
export REGION=<PROJECT_REGION>
export SUBNET=<SUBNET_NAME>
export GCS_STAGING_LOCATION=<GCS_PATH_TO_STAGING_LOCATION>
export GCS_CHECKPOINT_LOCATION=<GCS_PATH_TO_KAFKA_CHECKPOINT_LOCATION>
export KAFKA_SERVER=<KAFKA_BROKER_LIST>
export KAFKA_TOPIC=<KAFKA_TOPICS_LIST>
export PUBSUB_PROJECT=<PROJECT_ID>
export PUBSUB_TOPIC=<PUBSUB_TOPIC_NAME>
export STARTING_OFFSET=<STARTING_OFFSET_VALUE>
export STREAM_TERMINATION=<STREAM_TERMINATION_VALUE>
- GCS_CHECKPOINT_LOCATION: Stores the checkpoint for Kafka pipeline
- KAFKA_SERVER: Broker list in the format
10.128.0.15:9092
- KAFKA_TOPIC: Kafka topics from where data is to to be imported
- PUBSUB_PROJECT: Project ID where Pub/Sub is present
- PUBSUB_TOPIC: Pub/Sub topic name where data is to be exported
- STARTING_OFFSET: Accepted values: “earliest”, “latest” (streaming only)
- STREAM_TERMINATION: Time in ms to wait before termination of stream
3. Trigger the Dataproc job
bin/start.sh \
-- \
--template KAFKATOPUBSUB \
--templateProperty project.id=${GCP_PROJECT} \
--templateProperty kafka.pubsub.checkpoint.location=${GCS_CHECKPOINT_LOCATION} \
--templateProperty kafka.pubsub.bootstrap.servers=${KAFKA_SERVER} \
--templateProperty kafka.pubsub.input.topic=${KAFKA_TOPIC} \
--templateProperty kafka.pubsub.output.topic=${PUBSUB_TOPIC} \
--templateProperty kafka.pubsub.output.projectId=${PUBSUB_PROJECT} \
--templateProperty kafka.pubsub.starting.offset=${STARTING_OFFSET} \
--templateProperty kafka.pubsub.await.termination.timeout=${STREAM_TERMINATION}
4. Monitor the job on Dataproc batches through the GCP console.
5. Check the messages in a Pub/Sub subscriber