Making “docker buildx” Fast in GCP Using arm64 VMs

Doug Donohoe
15 min readJan 25, 2023

--

A sailboat docker beneath a colorful wall

I wrote this article to share some hard-earned knowledge about creating multi-architecture Docker builds. I explain why these images are needed and summarize the lessons I’ve learned using making them via docker buildx. Then I give a step-by-step tutorial explaining how to use an arm64 VM to speed up docker buildx builds on Google Cloud Build.

Apple Silicon

The conventional thinking has been this: The vast majority of cloud provider hardware runs on amd64 (aka Intel64 or x86_64) chips. Most developers work on Mac, Linux or Windows machines which also use amd64 chips. Thus, if one is building container-based cloud-based software, it is sufficient to have Docker workflows that only build images for amd64.

That thinking was upended when Apple released their Arm-based “Apple Silicon” chips (aka the M1 and M2). It is now desirable and often necessary to have Docker images which support both amd64 and arm64 architectures.

In addition, cloud providers have begun offering arm64 hardware (e.g., Ampere chips at GCP/Azure and Annapurna Labs Graviton chips at AWS). There might be certain workloads that perform better on arm64 hardware.

Many open source and internal company images still only support single architecture amd64 images. While it is possible to run amd64 images on an arm64 host (or vice-versa), it can be very, very slow due to QEMU emulation, which translates machine code from one chip architecture to another on-the-fly. Slow performance is most likely to cause issues when running more complex software like a database or an in-memory cache. One might find that integration tests take a painful amount of time to complete and/or are flakey due to timeouts.

For all of these reasons, now more than ever, it is necessary to create multi-architecture images.

Building Multi-Architecture Images

Building multi-architecture images is most easily done using the docker buildx command (the alternative way is to stitch together individually-built images using docker manifest create TAG --amend AMD_TAG --amend ARM_TAG && docker manifest push TAG).

For the purposes of this article, I’ll briefly summarize the lessons I learned getting docker buildx to work. The companion git repo has real-working examples and in-depth documentation, so I encourage you to explore it.

Base Images

One has to make sure that the base image is multi-architecture and all RUN commands take architecture into account. For example, see how the gcloud installation is handled in Dockerfile-build or how the gcc x86 patch is handled in Dockerfile-odb.

Caching

Caching intermediate results can speed up subsequent builds substantially. A clean build might take 30 minutes, but a subsequent build with no changes to the Dockerfiles might only take two. Here’s the full command to build a runtime image with caching to/from Artifact Registry:

docker buildx build --file Dockerfile-runtime \
--platform linux/amd64,linux/arm64 \
--builder builder-local \
--progress plain \
--build-arg DOCKER_REPO=us-docker.pkg.dev/multi-arch-docker/docker-dev \
--build-arg ALPINE_VERSION=3.15 \
--cache-from type=registry,ref=us-docker.pkg.dev/multi-arch-docker/docker-dev/cache/runtime:3.15 \
--cache-to type=registry,ref=us-docker.pkg.dev/multi-arch-docker/docker-dev/cache/runtime:3.15,mode=max \
--pull --push \
--tag us-docker.pkg.dev/multi-arch-docker/docker-dev/runtime:3.15 .

Notice the use of --cache-from/--cache-to to a dedicated cache folder in the Docker repository. Also note the use of mode=max which ensures all layers are cached.

Builder Name

If running concurrent docker buildx builds in Cloud Build, it is necessary to have a unique builder name for each step, otherwise you get an “already in use” error. See the note in the Makefile for details and the build steps in pr.yaml for how to set the name.

Long Build Push Error

On Google Cloud Build (and possibly other CI/CD platforms), you may encounter errors when pushing the final image if the build lasts longer than an hour (this is different than the Cloud Build timeout property, which you can set longer than an hour). I have fully documented the issue in my build-timeout repo.

Happily, there is a workaround, which is to split the docker buildx into two phases. The docker-buildx.sh script does just this and the odb build is an example of its use.

Emulation Woes

Due to QEMU emulation, actually building an image for a non-native architecture can also be quite slow. I’ve witnessed arm64 C/C++ builds take four to five times longer on amd64 hosts (and vice-versa). This is very visible when building on Google Cloud Build, which only has amd64 workers.

Fortunately, there is a way to speed things up — use an arm64 VM to do the arm64 half of the build. It isn’t exactly straightforward to make this work, so the remainder of this article explains precisely how to enable this in GCP and Google Cloud Build.

While this is specific to the GCP, the concepts used should be applicable to other CI/CD systems or cloud providers that have arm64 VMs like AWS or Azure. Honestly, the biggest issues are likely to be around enabling the right permissions and proper networking.

Solution

Docker has the concept of “contexts”, which allows one to use a remote machine to do the actual work of a build. An arm64 VM can be created and configured to be used as a context for the arm64 part of the build.

In a normal local network, behind a firewall, with trusted machines, it is pretty easy to set this up. You can configure the Docker daemon to listen on a regular tcp port and point a context at that machine/port.

For a bit more security, it is also relatively straightforward to use ssh to secure the connection to the Docker daemon.

Unfortunately, in Google Cloud Build, it simply isn’t currently possible to configure the network to allow a Cloud Build worker to talk to a VM directly via ssh (see this bug for details).

Cloud Builder connected via IAP/ssh to arm64 VM
The Solution: Connect Cloud Builder to arm64 VM over IAP/ssh

However, it is possible to use Google’s Identity Aware Proxy to connect to a VM via ssh. As far as I can tell, Docker’s ssh support cannot be used directly since gcloud has to be used for ssh. Instead it is just as easy to use an ssh tunnel from the Cloud Build worker to the VM’s normal Docker tcp port. The worker simply needs a Docker context that listens to that local port.

It took some effort to figure all of this out and the primary motivation to write this article is to share my learnings to hopefully help someone else out. I should note I drew inspiration from this post by Gabriel Hodoroaga. It pointed me in the right direction and is also an excellent tutorial on IAP.

Okay, now on to the step-by-step guide to make this a reality.

Side Note: Many of these steps are manual, through the GCP Console. There are undoubtedly gcloud equivalents for everything done herein, but I didn’t have time to figure out all of the equivalents. I leave that as an exercise for the reader.

Prerequisites

I develop on a Mac, and have only tested these instructions on a Mac, but any commands given should also work for Linux and probably even Windows.

You’ll need to have Docker Desktop installed.

You’ll also need to have Go installed (for the crane tool).

I assume you have some familiarity with GCP, have access to a GCP account or can create one, and have the gcloud CLI installed.

For this article, I myself created a new GCP project called multi-arch-docker. If you have the right permissions, you might want to do the same at GCP Console — Manage Resources. A benefit of a new project is that it is easy to clean things up at the end — just delete the project.

You can also use an existing project, however you will need to replace multi-arch-docker in certain places with your project name (for example, in the next paragraph). Be aware that some links I have to the GCP Console may not work for you, but I trust you’ll figure out how to manually fix them.

It helps to set an environment variable, which is used in some commands below, and also overrides the GCP_PROJECT variable in the Makefile:

export GCP_PROJECT=multi-arch-docker

The commands shown in the remainder of this article assume you have set your current gcloud project like so:

gcloud config set project $GCP_PROJECT

Doing this eliminates the need to specify the --project flag in subsequent gcloud commands.

Companion Repo

My multi-arch-docker companion git repo provides not only all of the source code discussed herein, but is a robust real-world example of tooling to create and maintain multi-architecture Docker images.

You’ll need the code to run the later examples, so go ahead and clone it into your favorite working directory:

git clone git@github.com:dougdonohoe/multi-arch-docker.git

Create Docker Repositories in Artifact Registry

Artifact Registry is where images are pushed to and pulled from. I believe it is a best practice to have multiple environments for images. In real world use, I like to have docker-dev for doing development/testing and docker-prod for official use in other build steps. The Makefile infrastructure in the companion repo easily allows switching between environments via the ENV variable.

Enable the Artifact Registry API:

gcloud services enable artifactregistry.googleapis.com

For this tutorial, we only need one repository. In Artifact Registry, Click + CREATE REPOSITORY and use these values:

  • Name: docker-dev
  • Format: Docker
  • Location Type: Multi-region, US

If you instead want to use your own repository, you can prefix make commands with ENV=myrepo. To avoid that on each command, set an environment variable:

export ENV=docker-dev

Create a Service Account

For use by the VM, create a dedicated service account that can access Artifact Registry in GCP Console — Service Accounts. Click + CREATE SERVICE ACCOUNT and use these values:

  • Name: builder
  • Role: Artifact Registry Writer - to enable pull/push images from/to Artifact Registry

SSH Credentials

An ssh key is needed for the Cloud Build worker to tunnel to the arm64 VM. First, create a temporary directory and the key:

mkdir /tmp/builder-keys
cd /tmp/builder-keys
ssh-keygen -t ed25519 -f build-google_compute_engine.ed25519 \
-N "" -C "root@builder"

This creates two files:

  • build-google_compute_engine.ed25519
  • build-google_compute_engine.ed25519.pub

Save these files as GCP secrets (enable the Secrets API if prompted):

gcloud secrets create build-google_compute_engine-ssh-priv \
--replication-policy="automatic" \
--data-file=build-google_compute_engine.ed25519
gcloud secrets create build-google_compute_engine-ssh-pub \
--replication-policy="automatic" \
--data-file=build-google_compute_engine.ed25519.pub

You should see them in GCP Console — Secret Manager. You can also fetch them via gcloud to verify they are there:

gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-priv
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-pub

The public key also needs be uploaded as project metadata. The instructions are based on these docs (enable the Compute API if prompted):

# Get existing keys (only found in an existing project, if at all)
gcloud compute project-info describe \
--format="value(commonInstanceMetadata[items][ssh-keys])" \
| tee ssh_metadata

# Append new key - Be careful here - you don't want to delete any
# existing metadata, so be sure and use the -a flag to append the
# new key to existing keys.
echo "root:$(cat build-google_compute_engine.ed25519.pub)" \
| tee -a ssh_metadata

# Visually verify the ssh_metadata file has any prior keys
# (blank lines are ok)
cat ssh_metadata

# Save
gcloud compute project-info add-metadata \
--metadata-from-file=ssh-keys=ssh_metadata

# Fetch metadata again to verify it worked
gcloud compute project-info describe \
--format="value(commonInstanceMetadata[items][ssh-keys])"

You should see them in GCP Console — Metadata.

Finally, remove the temporary directory and keys:

cd && rm -rf /tmp/builder-keys

IAP — Identity Aware Proxy

In order for the Cloud Builder to talk to the VM, it has to use Identity Aware Proxy (IAP).

To use and configure IAP, you need to grant yourself permissions via
IAM Admin. Find your email and click the pencil icon, then add these roles:

  • IAP-secured Tunnel User
  • IAP Policy Admin
  • IAP Settings Admin
  • Compute Instance Admin (v1)
  • Service Account User

If you can’t grant yourself these permissions, ask a teammate or manager who has higher privileges to help or beg and plead to your local cloud ops team.

Enable the IAP and Cloud Build APIs:

gcloud services enable iap.googleapis.com
gcloud services enable cloudbuild.googleapis.com

Add a firewall rule enabling IAP access from Cloud Build IPs:

gcloud compute firewall-rules create allow-ssh-ingress-from-iap \
--direction=INGRESS \
--action=allow \
--rules=tcp:22 \
--source-ranges=35.235.240.0/20

Because IAP is being used, and for stronger security, I recommend optionally disabling normal ssh via this command. Of course, you may not want to do this if there are other VMs in your project that currently depend on direct ssh.

gcloud compute firewall-rules update default-allow-ssh --disabled

You can see firewall rules in GCP Console — Firewall.

You also need add these roles to your Cloud Build service account so that it can fetch ssh keys, use IAP and use the VM:

  • Compute Admin
  • Service Account User
  • IAP-secured Tunnel User
  • Secret Manager Secret Accessor

To determine your cloud build service account, run this:

PROJECT_NUMBER=$(gcloud projects list --filter=$GCP_PROJECT \
--format="value(PROJECT_NUMBER)")
SERVICE_ACCOUNT="$PROJECT_NUMBER@cloudbuild.gserviceaccount.com"
echo $SERVICE_ACCOUNT

Creating the VM

Create an arm64 VM and perform some manual installation steps:

Create T2A Instance

In GCP Console — VM Instances, click Create Instance:

  • Name: builder-arm64-2cpu
  • Region/Zone: us-central1 / us-central1-a
  • Series: T2A
  • Machine Type: t2a-standard-2 (experimentation shows that 2 CPUs is sufficient to handle my multi-arch builds)
  • Boot Disk: 40G, SSD persistent disk, Debian GNU/Linux 11 (bullseye)
  • Service Account: builder@multi-arch-docker.iam.gserviceaccount.com

Environment Variables

Define some more environment variables for use in subsequent commands:

INSTANCE_NAME=builder-arm64-2cpu
ZONE=us-central1-a

Login over IAP

Login and then make yourself root since all subsequent commands need to run as root (this makes life easier than having to sudo everything):

# On your machine
gcloud compute ssh --zone $ZONE $INSTANCE_NAME --tunnel-through-iap

# On VM
sudo su -

You may get a warning to install numpy for better performance. Exit from the VM, run this, and then login again:

# On your machine
$(gcloud info --format="value(basic.python_location)") -m pip install numpy

Install Docker

Run Docker installation steps as found in the Debian install instructions (just copy+paste this whole block):

# On the VM
apt-get install --yes ca-certificates curl gnupg lsb-release make
mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \
$(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install --yes docker-ce docker-ce-cli containerd.io docker-compose-plugin

# To test
docker run hello-world

Configure Docker Access

Configure Docker to be able to talk to Artifact Registry:

# On the VM
gcloud auth --quiet configure-docker us-docker.pkg.dev

SSH Access For Root

To allow root login and port forwarding, change these lines to yes or uncomment them in /etc/ssh/sshd_config on the VM:

PermitRootLogin yes
AllowTcpForwarding yes

Restart ssh:

# On the VM
service ssh reload

Test you can now connect from your machine as root:

gcloud compute ssh --zone $ZONE root@$INSTANCE_NAME --tunnel-through-iap

There is also a make command that does this:

cd multi-arch-docker
make ssh-vm

Port Forwarding

IAP can be combined with forwarding the local Docker port over ssh to allow a Google Cloud Build to access the VM. The first step is to enable tcp on the machine using the localhost address.

By default, listening on the tcp socket isn’t turned on. To turn it on, add an override file to specify an alternate ExecStart command. This overrides the default, found in /lib/systemd/system/docker.service:

# On the VM
mkdir -p /etc/systemd/system/docker.service.d
cat << EOF > /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -H tcp://0.0.0.0:2375 --containerd=/run/containerd/containerd.sock
EOF

Then restart Docker and validate tcp is working:

# On the VM
# To restart, reload daemon and restart docker:
systemctl daemon-reload
systemctl restart docker.service

# Check status
systemctl status docker --no-pager # should show "Active: active"
systemctl is-enabled docker # should show "enabled"

# Validate
docker -H tcp://0.0.0.0:2375 ps # should show empty list
netstat -tulpn | grep dockerd # should show '0 :::2375' in 3rd/4th columns

Test Remote Build on VM

The remote VM should be ready to use. Let’s test this from our local machine. First, in a separate terminal window, start an ssh tunnel (don’t forget to set your environment variables):

gcloud compute ssh --zone $ZONE $INSTANCE_NAME --tunnel-through-iap -- \
-L 8375:0.0.0.0:2375 -N

In another window, configure Docker to talk to Artifact Registry, setup a buildx context using the tunneled tcp port and a builder to use that context:

gcloud auth --quiet configure-docker us-docker.pkg.dev
docker context create arm_node_tunnel --docker "host=tcp://127.0.0.1:8375"
docker buildx create --use --name remote-arm-tunnel --platform linux/arm64 \
arm_node_tunnel

To list current context/builders:

docker context ls
docker buildx ls

The third party base images need to be copied to our Docker repository (this is done to avoid Docker Hub rate limiting). The crane tool is used to do the copying and is auto-installed, which relies on Go having been installed.

cd multi-arch-docker
make thirdparty

Start top on the remote VM — when the build is running you should see processes pop up as Docker does its work.

Test the remote docker by using the Makefile in the companion repo:

BUILDER=remote-arm-tunnel PLATFORMS=linux/arm64 \
TAG_MODIFIER="arm64-test-remote" make buildx-publish-runtime

This will do an arm64 only build and use the remote VM to do the work.

This actually publishes the runtime image to Artifact Registry, but with the arm64-test-remote suffix to avoid conflicting with the normal build tag.

After testing is done, remove the context and builder created above:

docker buildx rm remote-arm-tunnel
docker context rm arm_node_tunnel

Then stop your ssh tunnel with CTRL-C.

Cloud Build

To run the cloud build, it is necessary to create the dedicated cloud builder:

make build-publish-cloud-builder

Then, at long last, to run the cloud build:

make cloud-build

A link to the Cloud Build log will be displayed, which will take you to the build in the GCP Console. Output from the build will also stream to your terminal. You should again see activity in top on the VM.

Code Notes

A lot of the magic that makes this work is in the Cloud Build pr.yaml configuration file and the Makefile. I’d like to call out a few things:

Secrets

The secrets are fetched in the first step, which are required for the ssh tunnel:

mkdir -p /builder/home/.ssh
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-priv \
> /builder/home/.ssh/google_compute_engine
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-pub \
> /builder/home/.ssh/google_compute_engine.pub
chmod 700 /builder/home/.ssh
chmod 600 /builder/home/.ssh/google_compute_engine

Note that in Cloud Build, the /builder/home directory is persistent across build steps.

Contexts

The setup step is what creates the Docker context for the arm64 remote host. The default context is built-in and used for amd64. Note that the builder’s port 8375 maps, via ssh tunnel, to the remote VM’s port 2375. I use different ports to make it clear which is on the local builder (8375) and which is on the VM (2375).

docker context create arm_node --docker "host=tcp://127.0.0.1:8375"

Docker Builder Creation

Each build target depends on the buildx-setup target, which creates a builder that assumes the existence of the default and arm_node contexts created above. For example, this target resolves to these commands for the build image:

docker buildx create --use --name build --platform linux/amd64 default
docker buildx create --append --name build --platform linux/arm64 arm_node

Per-Step Tunnel

Each step that does an actual build has to create an ssh tunnel because each step runs in its own container:

make ssh-tunnel

The ssh-tunnel target is defined as follows:

ssh-tunnel:
gcloud compute ssh --project $(GCP_PROJECT) \
--zone us-central1-a $(ARM64_VM) \
--tunnel-through-iap -- -L 127.0.0.1:8375:0.0.0.0:2375 -N -f

The --tunnel-through-iap flag is the magic that invokes IAP to facilitate the use of ssh from the builder. Unlike the tunnel test done above, the addition of the -f flag puts the tunnel in the background.

Build Image

Finally, note that the custom build image is created from Dockerfile-cloud-builder, which adds the Docker CLI and crane tool to Google’s standard gcr.io/cloud-builders/gcloud image.

Clean Up

To shut down your VM:

make stop-vm

You can restart it with:

make start-vm

NOTE: After restarting, it can take up to 60 seconds for IAP to become aware of it (before this, ssh might fail).

If you want to delete resources created in this tutorial:

# OF COURSE, PROCEED WITH CAUTION HERE - DON'T DELETE PRE-EXISTING THINGS
gcloud compute instances delete $INSTANCE_NAME
gcloud iam service-accounts delete builder@$GCP_PROJECT.iam.gserviceaccount.com
gcloud artifacts repositories delete docker-dev --location us
gcloud secrets delete build-google_compute_engine-ssh-priv
gcloud secrets delete build-google_compute_engine-ssh-pub
gcloud compute firewall-rules delete allow-ssh-ingress-from-iap

If you created a new project for this, you can also just delete that and every resource in it goes away:

# BE VERY VERY VERY VERY VERY VERY CAREFUL
gcloud projects delete multi-arch-docker

Summary

It seems like a lot of work to set things up properly to use an arm64 VM, but the juice is worth the squeeze. The time it takes to do complex builds, especially C++, is drastically reduced as the amd64 and arm64 parts take effectively the same time.

I sincerely hope this tutorial makes someone’s life easier. Thanks for reading.

Doug Donohoe is a seasoned software engineer, working remotely from Pittsburgh, Pennsylvania. These days, he’s doing work in Go running in Docker on Kubernetes, but is known to have slung some Scala in the past. Connect to him via LinkedIn.

--

--

Doug Donohoe

Seasoned, top-notch technology leader with deep hands-on skills. Polyglot programmer (Go, Scala, Java, Python, …)