Connect Private Data Fusion instance with a Private Cloud SQL instance using CloudSQL proxy

Daniel Villegas
Google Cloud - Community
17 min readDec 15, 2022

A common use case for Cloud Data Fusion is to create a data pipeline that needs to gather some data from a database in a Cloud SQL instance, this is easily performed when both the Data Fusion and Cloud SQL instances are deployed as public instances. However, the best practices recommendation is that both these instances are deployed as private instances so that only private access through local IP addresses is allowed. This lab provides the steps required to securely access a private Cloud SQL instance from a private Cloud Data Fusion instance using a SQL proxy

Overview

Cloud SQL is Google Cloud’s managed relational SQL database service; it supports MySQL, PostgreSQL and SQL Server database engines. As a managed service, Google takes care of all the infrastructure management tasks required to provision the database, allocating storage to it and running it, so that the user only needs to take care of loading their data and using the database.

Cloud Data Fusion is Google Cloud’s native low/no code solution for creating ETL/ELT pipelines at any scale. It makes it easy to create and run data wrangling and transformation pipelines by abstracting all the complexity of writing code to run on top of Cloud Dataproc cluster and providing the user with a point and click visual user interface instead.

Both the services described above can be deployed publicly (in which case a public IP is assigned to a compute instance in the user’s project under your organization’s domain in GCP so connectivity between the two is pretty straightforward )

A proxy is required because the Cloud SQL network is not directly peered with the Cloud Data Fusion network, and transitive peers cannot communicate with each other (for more information, see the VPC Network Peering section in the VPC documentation).

These are the tasks we will perform to be able to reach a private Cloud SQL instance from a private Cloud Data Fusion instance to fetch data from it privately and securely. Although it’s outside the scope of this exercise, by following these steps, it would be possible to also reach both instances privately from your on-premises network, should you have setup private connectivity with GCP (with a HA VPN or any other kind of private interconnectivity) prior to running this exercise.

What we will cover

  1. How to launch a Private Cloud SQL instance with a MySQL database
  2. How to launch a private Cloud Data Fusion instance
  3. How to launch a proxy compute Engine VM instance to allow private connectivity between the Cloud SQL and Data Fusion instances
  4. How to create a Connection to the Private Cloud SQL instance using the MySQL JDBC driver available in the Cloud Data Fusion Hub

GCP Services used for the solution:

  • Google Cloud SQL.
  • Google Compute Engine.
  • Google Cloud Data Fusion.

Setup — Enable the required APIs

Make sure to enable all the required APIs we’ll need for this exercise, in a cloud shell, run the gcloud commands below to enable the following APIs (you may also do so by visiting all the sections in the GCP console one by one and enabling each API when you’re prompted to do so):

  • Data Fusion API
  • Compute Engine API
  • Cloud SQL API
  • Service Networking API
  • SQL Admin API
  • Network management API

Set initial environment variables by running the following command in cloud shell:

export PROJECT_ID=$(gcloud config get-value project)

Now, enable the APIs we will be using in this lab by running this command in cloud shell:

gcloud services enable \
datafusion.googleapis.com \
compute.googleapis.com \
sql-component.googleapis.com \
servicenetworking.googleapis.com \
sqladmin.googleapis.com \
networkmanagement.googleapis.com

Note: Whenever you need to enable an API you can check the list of available services to find the one you need to activate using the gcloud services list command, for example, you may use the command below to list the APIs available in GCP, while limiting the output to the first 1000 APIs:

gcloud services list --available --filter "NAME:googleapis" --limit 1000

Create a private Cloud SQL MySQL instance:

Configuring Private Service Access for Cloud SQL

To create a Private Cloud SQL instance (and a private Data Fusion instance as well), you are required to set up a Private Services Access connection before creating the instance itself. A private services access connection allows the VPC networks you define in your GCP project to communicate internally with another VPC in a Google-managed project (also referred to as “Service producer project”, or “Tenant project”) where the private Cloud SQL instance infrastructure actually resides. This effectively allows you to connect from your project (and from whatever network your project is authorized and connected to) to the private Cloud SQL instance that by nature doesn’t have a Public IP and hence doesn’t have any connectivity outside from its local network (due to its private nature).

When you create a Private Services Access connection, you allocate an IP range that will be used by Google to assign private IPs to the private instances. Google will create a subnet from the allocated range you defined to provision resources on a certain region. The minimum CIDR block you can allocate is /24, but the recommendation is to allocate at least a /16 block.

Effectively, this creates a VPC peering between the VPC in your GCP project and the VPC created from the IP range you allocated to be used in the Google-managed Service producer project.

The following diagram depicts how this works in a sample deployment:

Note that each Google service creates a subnet from the defined range in which to provision resources. The subnet’s IP address range is typically a /24 CIDR block that is chosen by the service and comes from the allocated IP address range. You can’t modify the service producer’s subnet. A service in GCP provisions new resources in existing regional subnets that were previously created by that service. If a subnet is full, the service creates a new subnet in the same region.

This means that this same subnet cannot be allocated to other services (such as Data Fusion) for local communication between the two services, an independent connection from the Service producer GCP project to your GCP project is required for each one of them.

If you do not set up this connection prior to creating any private instance, you will face the following warning when trying to create the instance:

Note: Although this is not the case in this exercise, if in your own deployment you’re using Shared VPC, you will also need to enable the Service Networking API in the Shared VPC Network Host project, you may do so by asking the network administrator in that project to run the required gcloud command:

gcloud services enable servicenetworking.googleapis.com

For this lab, we will use the 10.30.1.0/24 range for the private Cloud SQL instance connection.

Allocate the IP address range for Cloud SQL

Run the following command in cloud shell (alternatively you can perform these steps using the UI in the console):

gcloud compute addresses create google-managed-services-default \
--global \
--purpose=VPC_PEERING \
--addresses=10.30.1.0 \
--prefix-length=24 \
--network=projects/$PROJECT_ID/global/networks/default

Once created, you can check the IP allocation in the INTERNAL IP ADDRESSES sub-section in the VPC networks section:

Create the private connection

Now that you have created the IP allocation, you create the VPC peering connection between the VPC in your project, and the IP range you just allocated. Run the following command in a cloud shell:

gcloud services vpc-peerings connect \
--service=servicenetworking.googleapis.com \
--ranges=google-managed-services-default \
--network=default \
--project=$PROJECT_ID

After a few minutes, the operation will finish.

You should also be able to see the VPC peering in the cloud console, specifically, in the VPC network peering section.

Create the private Cloud SQL instance by referring to the Private Services Access connection created in the previous step

First, make sure all the required environment variables are set:

export INSTANCE_ID=$(echo test-private-cloudsql) \
export PROJECT_ID=$(gcloud config get-value project) \
export VPC_NETWORK_NAME=$(echo default) \
export RANGE_NAME=$(echo google-managed-services-$VPC_NETWORK_NAME) \
export DB_PASSWORD=(echo [YOUR_DB_PASSWORD])

Run the following command in Cloud shell to start the private Cloud SQL instance provisioning process. This will take a few minutes:

gcloud beta sql instances create $INSTANCE_ID \
--project=$PROJECT_ID \
--network=projects/$PROJECT_ID/global/networks/$VPC_NETWORK_NAME \
--database-version=MYSQL_8_0 \
--cpu=2 \
--memory=4GB \
--region=us-central1 \
--no-assign-ip \
--root-password=$DB_PASSWORD \
--allocated-ip-range-name=$RANGE_NAME

After a few minutes, the instance reaches the RUNNABLE state, meaning it’s ready to be used:

In the cloud console, you should also be able to see that the instance has been created, it has set its private IP address by picking an IP from the allocation we created earlier, and has been attached to the peering using Private Service Access:

Create a private Data Fusion instance

Grant the required permissions on the Google-managed Data Fusion service account so that it can spawn Dataproc clusters.

We need to grant the Data Fusion Service Account permission to launch Dataproc clusters. If we don’t, our datafusion pipelines will fail when we try to run them afterwards. When you go to the CREATE INSTANCE wizard in the Data Fusion section in in the UI, you will get a warning like the one depicted in the image below if you have not granted the Data Fusion permission on the Service Account with which the Dataproc clusters will be created (by default is the Compute Engine Service Account, we can check this in the Service Accounts section in IAM & Admin page, its format looks like this: [PROJECT_NUMBER]-compute@developer.gserviceaccount.com)

You may just click the GRANT PERMISSION button (make sure the Compute Service account is selected in the dropdown box below this warning):

Alternatively, you can follow this documentation and grant the required permission using the UI in the IAM section in the Cloud Console, or you can execute the following command in Cloud Shell:

(Optional) First obtain the Service Agent email address:

export PROJECT_ID=$(gcloud config get-value project)
export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")
export DATAFUSION_SA=$(echo service-$PROJECT_NUMBER@gcp-sa-datafusion.iam.gserviceaccount.com)
export COMPUTE_SA=$(echo $PROJECT_NUMBER-compute@developer.gserviceaccount.com)

(Optional) Now, grant the required permissions to the agent:

gcloud iam service-accounts add-iam-policy-binding \
$COMPUTE_SA \
--member='serviceAccount:'$DATAFUSION_SA'' \
--role='roles/iam.serviceAccountUser'

Note: If you see this prompt when navigating to the Data Fusion section, you probably have not enabled the API yet:

Deploy the private Data Fusion instance

In this exercise, we will use the Data Fusion API to create a private instance.

First set some environment variables that will be used later when invoking the API:

export PROJECT_ID=$(gcloud config get-value project)
export LOCATION="us-central1"
export DATA_FUSION_API_NAME=datafusion.googleapis.com
export DF_INSTANCE_ID=test-private-datafusion

Allocate the IP address range for Data Fusion

The same way we did when creating the private Cloud SQL instance, we have to allocate an IP range for Data Fusion, by default, Data Fusion uses a /22 range to deploy its internal components in the tenant project. In this case the range 10.40.0.0/22 will be used. Run this command to create the IP allocation:

gcloud compute addresses create google-managed-services-default-df \
--global \
--purpose=VPC_PEERING \
--addresses=10.40.0.0 \
--prefix-length=22 \
--network=projects/$PROJECT_ID/global/networks/default

Create a Data Fusion instance using the API

To create the instance using the API, run the following cURL command in Cloud shell:

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" https://$DATA_FUSION_API_NAME/v1/projects/$PROJECT_ID/locations/$LOCATION/instances?instance_id=$DF_INSTANCE_ID -X POST -d '{"description": "Private CDF instance created through REST.", "type": "BASIC", "privateInstance": true, "networkConfig": {"network": "default", "ipAllocation": "10.40.0.0/22"}}'

The API receives the request and begins processing it:

We can see in the Data Fusion section in GCP the instance being created:

Note: The instance creation will take approximately 20 minutes.

Once the instance creation process is over, we will have both Cloud SQL and Data Fusion private instances up and running.

From here, by clicking on the Instance Name, we can see the details with which the instance was created, note that it used the IP address allocation we reserved earlier (10.40.0.0/22).

While in the instance details page, TAKE NOTE OF THE TENANT_PROJECT_ID in which the Data Fusion instance is running, we will need it in the next step. We can find it in the service account field in between the @ sign and the next point, right before “.iam.gserviceaccount.com” , it usually ends with “-tp” (in the example above it would be “eh3a2c2bf9e20v09d-tp”)

Create the VPC peering between the local VPC and the range allocated for Data Fusion

Because a peering for Private Services Access has already been created during the creation of the Cloud SQL instance and Data Fusion doesn’t use this, but rather requires a different VPC peering, we cannot use the same gcloud services vpc-peerings connect command we used earlier to create this new connection. Cloud Data Fusion uses a VPC peering between your local VPC in your project, and the Tenant Project managed by Google where the Data Fusion instance is running in its own remote VPC. In order to create the VPC peering between your local network and the network where Data Fusion is running, we need to gather two things:

1) The remote project ID where the Data Fusion instance runs (called “Tenant Project”, we retrieved it in the previous step from the Data Fusion instance details page).

2) The remote network in that Tenant Project: Its generated automatically for us upon Data Fusion private instance creation and is in the following nomenclature: <REGION>-<DATA_FUSION_INSTANCE_NAME>

Run the following commands in cloud shell to create the VPC peering between your local VPC and the peer VPC in the tenant project where Data Fusion runs:

NOTE: Remember to replace the [YOUR-TENANT-PROJECT-ID] value with the tenant project id retrieved in the previous step.

export TENANT_PROJECT=[YOUR-TENANT-PROJECT-ID]
export LOCATION="us-central1"gcloud compute networks peerings create peering-private-datafusion \
--network=default \
--peer-project $TENANT_PROJECT \
--peer-network $LOCATION-$DF_INSTANCE_ID

You should see now the new peering in the VPC network peering section in the console:

Enable Private Google Access in our local VPC subnet

For the Data Fusion pipelines to run properly, we need to enable Private Google Access in the subnet peered to both Cloud SQL and Data Fusion. Otherwise, when we run a pipeline in Data Fusion, the pipeline will get stuck on trying to provision the Dataproc cluster.

Run the following commands in Cloud Shell:

export SUBNET_NAME=default
export REGION=us-central1
gcloud compute networks subnets update $SUBNET_NAME \
--region=$REGION \
--enable-private-ip-google-access

Navigate to the Cloud SQL section in the console and take note of the IP address Cloud SQL (in this case it assigned the IP address 10.30.1.3 to the instance):

Creating the CloudSQL proxy VM

Since both the Cloud SQL and Data Fusion instances have peerings against the local VPC in the GCP project (depicted in the diagram below with the default subnet range for us-central1, 10.128.0.0/20), both these instances will be reachable from and to that network (meaning any resource, such as VM that communicates from that 10.128.0.0/20 network, should be able to communicate independently with both Cloud SQL and Data Fusion private instances).

However, because network peerings do not forward networking routes past the point to which the networks are peered to, the Cloud SQL instance cannot communicate directly with the Data Fusion instance due to transitivity issues caused by the two network peerings.

The diagram above, shows that through the Private Services Access connection created in the previous steps, two consecutive peerings are in the network path required for Data Fusion to reach Cloud SQL, causing Data Fusion not being able to reach it because VPC peerings are non-transitive by networking design. To solve this problem, a VM proxy needs to be deployed within the local VPC (10.128.0.0/20), to propagate the routes between the 2 networks (i.e. Data Fusion’s 10.40.0.0/22 and Cloud SQL’s 10.30.1.0/24). The diagram below shows how this setup will look like:

Exporting initial variables

First, export the variables we will be using in the next steps:

export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1
export ZONE=`gcloud compute zones list --filter="name=${REGION}" --limit 1 --uri --project=${PROJECT}| sed 's/.*\///'`
export NETWORK=default
export SUBNET=default
export INSTANCE_NAME=cloudsql-proxy
export SQL_CONN=private-datafusion-to-cloudsql:us-central1:test-private-cloudsql
export CDF_IP_RANGE=10.40.0.0/22
export VM_IMAGE=$(gcloud compute images list --project=$PROJECT | grep cos-stable | awk 'NR<2{print $2}')
export SQL_PORT=3306 Important Note:

To obtain the value for the “SQL_CONN” environment variable, we can take it from the “Connection Name” filed, on the instance details in the the Cloud SQL section in the GCP console:

Create a firewall rule to allow Data Fusion ingress traffic. Run the following command in the Cloud Shell:

gcloud compute firewall-rules create allow-private-cdf \
--allow=tcp:22,tcp:${SQL_PORT} \
--source-ranges=$CDF_IP_RANGE --network=$NETWORK --project=$PROJECT_ID

Your output should look like this:

You should be able to see the new Firewall rule created in the console named “allow-private-cdf” and allowing ingress traffic in ports 22 & 3306:

Create the proxy VM with the following gcloud command in Cloud shell:

gcloud compute --project=${PROJECT} instances create ${INSTANCE_NAME} \
--zone=${ZONE} \
--machine-type=g1-small \
--subnet=${SUBNET} \
--metadata=startup-script="docker run -d -p 0.0.0.0:${SQL_PORT}:${SQL_PORT} gcr.io/cloudsql-docker/gce-proxy:latest /cloud_sql_proxy -instances=${SQL_CONN}=tcp:0.0.0.0:${SQL_PORT}" \
--maintenance-policy=MIGRATE \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--image=${VM_IMAGE} \
--image-project=cos-cloud

You should see an output like this:

Now, export a variable to hold the internal IP address assigned to the VM so that we can make it a static IP reserved for this machine. Run this in a Cloud Shell:

export IP=`gcloud compute \
--project=${PROJECT} instances describe ${INSTANCE_NAME} \
--zone ${ZONE} | grep "networkIP" | awk '{print $2}'`

Run this command to reserve the VMs IP:

gcloud compute --project=${PROJECT} addresses create mysql-proxy \
--addresses ${IP} --region ${REGION} --subnet ${SUBNET}

You can see the IP in the list of IP addresses in the VPC network section of the console in GCP listed as a static IP (you can also see the External IP of the VM (listed as an ephemeral IP):

Note that this internal static IP is the one you are going to use to establish connections to the Cloud SQL instance from now on.

Connecting from Data Fusion to Cloud SQL

Navigate to the Cloud Data Fusion section in the GCP console, click on the “View Instance” action URL to access the Data Fusion UI.

The URL should look like this:

https://test-private-datafusion-private-datafusion[PROJECT_NUMBER]-dot-usc1.datafusion.googleusercontent.com/

Install the JDBC connector for MySQL

From the Cloud Data Fusion UI, click the “No, Thanks” button in the Welcome popup:

Navigate to the Data Fusion HUB by clicking in the HUB button on the upper right side of the screen:

In the HUB, in the search bar at the top, search for “cloudsql” and select the “CloudSQL MySQL JDBC Driver” from the list of available plugins, click on it:

We’re prompted to download the Driver’s .jar file and then upload it to Data Fusion for it to be deployed and available for use in a data pipeline. Click on the Download link below the first step

Follow the provided instructions, download the .jar file from the link to Google Cloud Platform’s Github to your local machine. You should download a file similar to “mysql-socket-factory-connector-j-8–1.0.16-jar-with-driver-and-dependencies.jar”

Once you have downloaded the file, click on the “Finish” button:

Now it’s time to upload the .jar file with the plugin to your Data Fusion instance. Click on the “Deploy” link next to the link you clicked in the previous step:

Provide the .jar file you downloaded to your local machine and click “Next”

Leave everything by default and click on the “Finish” button.

The deployment process will begin (it can take up to a minute). When it’s finished you will be presented with a success popup and a prompt to either “Create a Pipeline” or “Go to homepage”. Ignore the button, click on the “Go to homepage” link next to it.

Test connectivity to the MySQL database in Cloud SQL using the JDBC plugin

From the Cloud Data Fusion UI home page, click the “No, Thanks” button in the Welcome popup (if it appears again). From here, test creating a simple data pipeline by adding a Wrangler task to the canvas. Click on the “Wrangle” card.

Then click in the ADD CONNECTION button at the bottom of the left menu:

Select the CloudSQL MySQL connection from the list in the middle of the screen:

Enter a name for the new connection, select the driver you deployed earlier in the JDBC Driver and the rest of required inputs according to the table below and then at the bottom, click on the TEST CONNECTION button:

If everything went fine, you should see the connection attempt being successful, as in the image below:

DONE !! You connected Data Fusion with Cloud SQL through their private interfaces.

If you face an error like the one in the image below when hitting the TEST CONNECTION button, it is likely that you have not configured the peerings correctly or you’re missing a firewall rule, remember to put the SQL proxy VMs internal IP in the Connection Name field, putting the Connection Name from CloudSQL or it’s private IP won’t work:

Congratulations!

You have securely connected a Private Data Fusion instance with a Private Cloud SQL database instance.

What we’ve covered

  • How to launch a Private Cloud SQL instance with a MySQL database
  • How to launch a private Cloud Data Fusion instance
  • How to launch a proxy compute Engine VM instance to allow private connectivity between the Cloud SQL and Data Fusion instances
  • How to create a Connection to the Private Cloud SQL instance using the MySQL JDBC driver available in the Cloud Data Fusion Hub

Useful Documentation

Learn about Cloud SQL Private IP: https://cloud.google.com/sql/docs/mysql/private-ip?hl=en

Configure Private Services Access for CloudSQL: https://cloud.google.com/sql/docs/mysql/configure-private-services-access

Configure Private IP for Cloud SQL: https://cloud.google.com/sql/docs/mysql/configure-private-ip?hl=en

Create a Private Data Fusion instance:

https://cloud.google.com/data-fusion/docs/how-to/create-private-ip#curl

Granting Data Fusion Service account permission:

https://cloud.google.com/data-fusion/docs/how-to/granting-service-account-permission?_ga=2.97242134.-86038369.1663269464

Cloud Data Fusion Service Accounts:

https://cloud.google.com/data-fusion/docs/concepts/service-accounts#service_account_table

VPC peering:

https://cloud.google.com/vpc/docs/using-vpc-peering

--

--

Daniel Villegas
Google Cloud - Community

Data & ML Customer Engineer at Google, enabling the onboarding of customers to Google Cloud.