Using HAProxy with Cloud Data Fusion to Navigate Complex Network Topologies

Aaron Pestel
8 min readSep 4, 2021

--

One of the most challenging aspects of any integration effort is networking, and Google Cloud Data Fusion (CDF) is no different. Sometimes setting up VPCs, VPC peering, subnets, VPNs, Interconnects, network routes, and firewall rules are not as easy to configure correctly as they may look on that nice network diagram in front of you.

In this blog, we are going to talk though a few common network topology challenges that users can encounter with CDF (or any integration technology) and how HAProxy can help in some of those cases to mediate between various networks that may be difficult to connect between and through.

First, let’s look at a typical network diagram for a Private IP instance of CDF connecting to database in the local VPC.

Two VPCs with Peering

This is the simple case setup. We can peer between the CDF Tenant Project and the Customer Project (exporting and importing routes), and then the CDF Development Infrastructure (Studio, Wrangler, Preview, etc.) can access the database on the GCE VM and also the CDF pipelines running on Dataproc VMs can access the GCE VM database. If only life were always this simple… Let’s look at another common case where there are three VPCs involved.

Three VPCs with Peering and Proxy (Cloud Proxy)

In this case, we see the need for a “Proxy” in the Customer Project VPC. Why is that? In GCP, VPC Peering is not transitive. This means that a project can route to and VPCs that it is peered to, but it can “not” route to the next level of VPCs through another peering. For example, VPC 1 can route to VPC 2 through a peering, and VPC 2 can route to VPC 3 through a peering, but VPC 1 can NOT route to VPC 3 leveraging the peering between VPC 2 and VPC 3.

So in that case with the diagram above, the CDF Pipelines running in the Customer Project VPC with Dataproc could access the Cloud SQL Database. However, the CDF Infrastructure (Studio, Wrangler, Preview, etc.) can NOT access the Cloud SQL Tenant Project database unless it does it through a proxy that sits inside the Customer Project VPC as shown above. That proxy can be a Cloud SQL Proxy (if the database is on Cloud SQL) or if the third VPC was an additional Customer Project VPC (not Cloud SQL) running its own database, then we could do the proxying with HAProxy. We will show that illustration below:

Three VPCs with Peering and a Proxy (HAProxy)

Below is another common scenario where CDF needs to connect to an on-prem database.

Two VPCs and a VPN to On-Prem

In this scenario above, a proxy is not actually necessary in the Customer Project VPC because there is a VPN between the Customer Project VPC and the Customer On-Prem network. Unlike peering, the VPN connection is transitive. The challenge is that this is not always easy to configure though. Here are some things that cause problems:

  • The VPN on-prem routes have to be added to the VPC peering. Just because the CDF Tenant Project VPC is peered to the Customer Project VPC does not mean that the peer knows that the Customer Project VPC has additional routes to the On-Prem network.
  • The CDF Tenant Project VPC CIDR must be added as a route in the VPN. We typically think of communication going from CDF Tenant Project (the client) to the Customer On-Prem (the database). While that is true, communication has to go back the other way as well. If the CDF Tenant Project tries to create a connection to the on-prem database, it will route TCP packets that way. If the Database can not return route TCP packets (ie knows how to route packets back to the CDF Tenant Project through the VPN), then the database can never respond to the CDF Tenant Project and a database connection can never be established.
  • The Customer Project VPC firewall and the On-Prem Firewalls must allow traffic from the CDF Project VPC CIDR (the address range used by the CDF Project VPC infrastructure), or connections will not be established.

Often we see situations where a user will say “We can connect to the on-prem database from the Customer Project VPC, but not from the CDF Tenant Project tools (Studio, Wrangler, Preview)”. Almost always, that is a problem with one of the three items listed above. Often, users will choose to install an HAProxy GCE VM in the Customer Project VPC even in this VPN case, if for no other reason than testing. That is a perfectly fine thing to do. The CDF Tenant Project tools (Studio, Wrangler, and Preview) only need to connect to the Database for development type tasks (getting schemas, testing connections, creating Wrangler recipes, etc.). If they connect to the Database through the proxy and then the pipeline itself (on Dataproc) connects directly to the Database, that is a fine configuration.

For whatever reason you might choose to install an HAProxy VM in the Customer Project, we’re going to show how to do that below. It’s quite simple once you know how to do it, so let’s get started…

There are really only four things we need to do:

  • Launch the HAProxy GCE VM with a startup script to configure it
  • Configure a firewall rule to allow the CDF Tenant Project to connect to the HAProxy GCE VM
  • [Optional] Test the proxy by SSH’ing into it and running a database client
  • Configure CDF Tenant Project tools (Studio, Wrangler, etc.) to use the private IP address of the HAProxy VM instead of the actual database IP address

Launch the HAProxy GCE VM

Here is a simple snippet you can run in Cloud Shell to launch the GCE VM. This snippet creates the startup script in an environment variable. The startup script will install HAProxy, setup the frontend/backend proxy configuration, and restart the HAProxy daemon.

Note: Make sure you modify the line in the script that says server mssqlserver to have the private IP address and port of your target database.

########################################################
## Store startup script in STARTUP_SCRIPT variable
########################################################
read -r -d '' STARTUP_SCRIPT << EOM
########################################################
## Install HAProxy
########################################################
apt-get install -y haproxy
## Configure the proxy to our backend databfase
cat <<EOF >> /etc/haproxy/haproxy.cfg
frontend tcp-in-mssql
bind *:1433
mode tcp
use_backend mssql
backend mssql
mode tcp
server mssqlserver 10.18.0.3:1433 check
EOF
########################################################
## Restart the proxy
########################################################
service haproxy restart
EOMgcloud compute instances create haproxy7 \
--zone=us-central1-a \
--tags=haproxy \
--metadata=startup-script="$STARTUP_SCRIPT"

Add a Firewall Rule to allow CDF to access the Proxy VM

You’ll note that in the step above, we added a “tag” of haproxy to the GCE VM we launched. We did that so that our firewall rule could use that tag as a target. Also, you’ll need to get your IP “source-ranges” from you Data Fusion Private IP CIDR range that you can get from the Data Fusion instance details page in the GCP console as shown below.

Once you have that range, you can create the firewall rule to allow the CDF Tenant Project to access your HAProxy VM via this snippet that you can run in Cloud Shell.

gcloud compute firewall-rules create haproxy \
--direction=INGRESS --priority=1000 --network=default \
--action=ALLOW --rules=tcp:1433 \
--source-ranges=10.55.36.0/22 --target-tags=haproxy

Test the Proxy

It is a good idea to test the proxy to make sure it works before trying to connect to it from CDF. If you SSH into the proxy VM, here are some example commands you can run to install a databbase client and connect to the database. These commands are specific to SQL Server, but this proxy mechanism could be used for any database (SQL Server, MySQL, Postgres, Oracle, ect.). Note that if you connect to 127.0.0.1 (as in the snippet below), you’ll be connecting to the proxy. You can also use the private IP address of the database itself. Both should work exactly the same from the proxy VM if the proxy is working correctly. If you can connect to the 127.0.0.1 database proxy as if it were your database, then you can be confident that the proxy is working and will continue to work when CDF Tenant Project tools (Studio, Wrangler, Preview, etc.) connect to it.

curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list | sudo tee /etc/apt/sources.list.d/msprod.listsudo apt-get update
sudo apt-get install -y mssql-tools unixodbc-dev
/opt/mssql-tools/bin/sqlcmd -U sqlserver -P G00gl3! -S "127.0.0.1,1433" -d master -Q "create database mydatabase" /opt/mssql-tools/bin/sqlcmd -U sqlserver -P G00gl3! -S "127.0.0.1,1433" -d mydatabase -Q "create schema sqlserver" /opt/mssql-tools/bin/sqlcmd -U sqlserver -P G00gl3! -S "127.0.0.1,1433" -d mydatabase -Q "create table sqlserver.mytable (col1 int)" /opt/mssql-tools/bin/sqlcmd -U sqlserver -P G00gl3! -S "127.0.0.1,1433" -d mydatabase -Q "insert into sqlserver.mytable values (1)" /opt/mssql-tools/bin/sqlcmd -U sqlserver -P G00gl3! -S "127.0.0.1,1433" -d mydatabase -Q "select * from sqlserver.mytable"

Connect CDF to the Proxy

The last step is to connect CDF to the database. This will use the exact same settings in the CDF database source, except the IP address used will be the private IP address of the proxy VM instead of the IP address of the destination database.

Final Thoughts

As we have discussed above, there may be many situations where you have difficulty connecting the CDF developer tools (Studio, Wrangler, Preview) to a remote database even though you can connect from a GCE VM in your project to the same database. Whether this is because it is not actually possible (non-transitive peering) or whether this is because you are struggling to get all the networking correct (complex on-prem VMs), installing an HAProxy VM in the Customer Project will give you the chance to start connecting your CDF tools to your database.

Some have asked if the CDF pipelines running on Dataproc should also connect to the HAProxy or if they should connect directly to the database. There is not a “right” answer to that. There are positives about having all the connections (CDF developer connections and CDF production pipeline connections) to the end database funneled through a common proxy (firewall simplicity, limiting total connections, etc.). There are also positives about having the production pipeline go directly to the destination database without an additional hop (less latency and less points of failure). You need to weigh the trade-offs and make the best decision for your particular use case and environment.

--

--

Aaron Pestel

Google Cloud engineer, data specialist. Posts and opinions are my own.