Running Cloud SQL Proxy on dataflow workers with custom containers

Sumit Banerjee
Google Cloud - Community
4 min readSep 15, 2024

Google Cloud Dataflow is a fully managed stream and batch processing service that allows developers to execute data processing tasks effortlessly. When Dataflow starts up worker VMs, it uses Docker container images to launch containerized SDK processes on the workers. Typically, a pipeline employs a prebuilt Apache Beam image. However, you might want to modify the runtime environment of user code in Dataflow pipelines by providing a custom container image. One use case for such customization is starting Cloud SQL Proxy. This blog post serves as a comprehensive guide to securely connect to your Cloud SQL database from Dataflow workers using Cloud SQL Proxy.

Why Cloud SQL Proxy?

Cloud SQL Proxy acts as an intermediary between your Dataflow worker and your Cloud SQL instance. It establishes a secure connection using your Google Cloud credentials, removing the need to whitelist IPs or manage SSL certificates directly on your Dataflow workers.

Why Custom Containers?

This blog provides a guide on how to run Cloud SQL Proxy on Dataflow using a custom container. There are various other reasons why you might want to use custom containers, like:

  • To pre-install pipeline dependencies and reduce worker start time
  • To pre-install pipeline dependencies that aren’t available in public repositories
  • To pre-install pipeline dependencies when access to public repositories is turned off, possibly for security reasons
  • To pre-stages large files to reduce worker start time
  • To launch third-party software in the background
  • To customize the execution environment

Prerequisites

Before you begin, ensure you have:

  • Google Cloud Project: A valid Google Cloud project with billing enabled.
  • Cloud SQL Instance: An instance of Cloud SQL (PostgreSQL or MySQL) set up in your project.
  • Cloud SDK Installed: The Google Cloud SDK (gcloud) installed and configured on your local machine.
  • Docker: Docker installed for building custom containers.
  • Dataflow Permissions: Appropriate IAM permissions to run Dataflow jobs and access Cloud SQL.

Step 1: Create a Custom Docker Image

To use Cloud SQL Proxy on Dataflow workers, you need to create a custom Docker image that installs the Cloud SQL Proxy and subsequent launch of the Cloud SQL Proxy.

1.1. Create a Dockerfile

Create a Dockerfile with the following content:

FROM apache/beam_java8_sdk:2.59.0

# Setup CloudSQL Proxy
RUN curl -o cloud-sql-proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.13.0/cloud-sql-proxy.linux.amd64
RUN chmod +x cloud-sql-proxy
RUN ./cloud-sql-proxy <YOUR_INSTANCE_CONNECTION_NAME> &

ENTRYPOINT ["/opt/apache/beam/boot"]

Note:

  • The custom container’s runtime version must align with the runtime you’ll use to initiate the pipeline. For instance, if you’re starting the pipeline from a local Java 11 environment, the FROM line should specify a Java 11 environment: apache/beam_java11_sdk:
  • At the end of the Dockerfile, you must set the ENTRYPOINT to run the /opt/apache/beam/boot script. This script initializes the worker environment and launches the SDK worker process. If you don’t set this entry point, the Dataflow workers won’t start correctly.

1.2. Build and push the Docker Image

You can use Cloud Build or Docker to build your container image and push it to an Artifact Registry repository. Use below command to build and push the image using cloud build:

gcloud builds submit --tag \
REGION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/FILE_NAME:TAG .

Replace the following:

  • REGION: the region to deploy your Dataflow job in.
  • PROJECT_ID: the project name or username.
  • REPOSITORY: the image repository name.
  • FILE_NAME: the name of your Dockerfile.
  • TAG: the image tag. Always specify a versioned container SHA or tag. Don’t use the :latest tag or a mutable tag.

Step 2: Run the Dataflow job in the custom container

To execute a Dataflow job using the custom container we have created, follow these steps:

  • Use --sdkContainerImage to specify an SDK container image for your Java runtime.
  • Use --experiments=use_runner_v2 to enable Runner v2.

Example:

mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner \
--inputFile=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--gcpTempLocation=TEMP_LOCATION \
--diskSizeGb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdkContainerImage=IMAGE_URI"

Replace the following:

  • INPUT_FILE: the Cloud Storage input path read by Dataflow when running the example.
  • OUTPUT_FILE: the Cloud Storage output path written to by the example pipeline. This file contains the word counts.
  • PROJECT_ID: the ID of your Google Cloud project.
  • REGION: the region to deploy your Dataflow job in.
  • TEMP_LOCATION: the Cloud Storage path for Dataflow to stage temporary job files created during the execution of the pipeline.
  • DISK_SIZE_GB: Optional. If your container is large, consider increasing default boot disk size to avoid running out of disk space.
  • IMAGE_URI: the SDK custom container image URI. Always use a versioned container SHA or tag. Don’t use the :latest tag or a mutable tag.

Step 3: Verify and Monitor

Once the job is submitted, you can monitor its progress through the Google Cloud Console. Make sure to check the logs to verify that the Cloud SQL Proxy is starting correctly and that connections to your database are functioning as expected.

Conclusion

Leveraging Cloud SQL Proxy on Dataflow workers using custom containers offers a secure and optimized approach to accessing your Cloud SQL databases. This configuration combines the flexibility of custom containers with the assurance of a secure connection to your database. By following the steps detailed in this blog, you can establish a robust pipeline capable of handling data processing tasks that require access to Cloud SQL.

References

--

--