Making “docker buildx” Fast in GCP Using arm64 VMs
I wrote this article to share some hard-earned knowledge about creating multi-architecture Docker builds. I explain why these images are needed and summarize the lessons I’ve learned using making them via docker buildx
. Then I give a step-by-step tutorial explaining how to use an arm64
VM to speed up docker buildx
builds on Google Cloud Build.
Apple Silicon
The conventional thinking has been this: The vast majority of cloud provider hardware runs on amd64
(aka Intel64
or x86_64
) chips. Most developers work on Mac, Linux or Windows machines which also use amd64
chips. Thus, if one is building container-based cloud-based software, it is sufficient to have Docker workflows that only build images for amd64
.
That thinking was upended when Apple released their Arm-based “Apple Silicon” chips (aka the M1 and M2). It is now desirable and often necessary to have Docker images which support both amd64
and arm64
architectures.
In addition, cloud providers have begun offering arm64
hardware (e.g., Ampere chips at GCP/Azure and Annapurna Labs Graviton chips at AWS). There might be certain workloads that perform better on arm64
hardware.
Many open source and internal company images still only support single architecture amd64
images. While it is possible to run amd64
images on an arm64
host (or vice-versa), it can be very, very slow due to QEMU emulation, which translates machine code from one chip architecture to another on-the-fly. Slow performance is most likely to cause issues when running more complex software like a database or an in-memory cache. One might find that integration tests take a painful amount of time to complete and/or are flakey due to timeouts.
For all of these reasons, now more than ever, it is necessary to create multi-architecture images.
Building Multi-Architecture Images
Building multi-architecture images is most easily done using the docker buildx
command (the alternative way is to stitch together individually-built images using docker manifest create TAG --amend AMD_TAG --amend ARM_TAG && docker manifest push TAG
).
For the purposes of this article, I’ll briefly summarize the lessons I learned getting docker buildx
to work. The companion git repo has real-working examples and in-depth documentation, so I encourage you to explore it.
Base Images
One has to make sure that the base image is multi-architecture and all RUN
commands take architecture into account. For example, see how the gcloud installation is handled in Dockerfile-build
or how the gcc x86 patch is handled in Dockerfile-odb
.
Caching
Caching intermediate results can speed up subsequent builds substantially. A clean build might take 30 minutes, but a subsequent build with no changes to the Dockerfiles might only take two. Here’s the full command to build a runtime image with caching to/from Artifact Registry:
docker buildx build --file Dockerfile-runtime \
--platform linux/amd64,linux/arm64 \
--builder builder-local \
--progress plain \
--build-arg DOCKER_REPO=us-docker.pkg.dev/multi-arch-docker/docker-dev \
--build-arg ALPINE_VERSION=3.15 \
--cache-from type=registry,ref=us-docker.pkg.dev/multi-arch-docker/docker-dev/cache/runtime:3.15 \
--cache-to type=registry,ref=us-docker.pkg.dev/multi-arch-docker/docker-dev/cache/runtime:3.15,mode=max \
--pull --push \
--tag us-docker.pkg.dev/multi-arch-docker/docker-dev/runtime:3.15 .
Notice the use of --cache-from/--cache-to
to a dedicated cache
folder in the Docker repository. Also note the use of mode=max
which ensures all layers are cached.
Builder Name
If running concurrent docker buildx
builds in Cloud Build, it is necessary to have a unique builder
name for each step, otherwise you get an “already in use” error. See the note in the Makefile for details and the build steps in pr.yaml
for how to set the name.
Long Build Push Error
On Google Cloud Build (and possibly other CI/CD platforms), you may encounter errors when pushing the final image if the build lasts longer than an hour (this is different than the Cloud Build timeout
property, which you can set longer than an hour). I have fully documented the issue in my build-timeout repo.
Happily, there is a workaround, which is to split the docker buildx
into two phases. The docker-buildx.sh script does just this and the odb build is an example of its use.
Emulation Woes
Due to QEMU emulation, actually building an image for a non-native architecture can also be quite slow. I’ve witnessed arm64
C/C++ builds take four to five times longer on amd64
hosts (and vice-versa). This is very visible when building on Google Cloud Build, which only has amd64
workers.
Fortunately, there is a way to speed things up — use an arm64
VM to do the arm64
half of the build. It isn’t exactly straightforward to make this work, so the remainder of this article explains precisely how to enable this in GCP and Google Cloud Build.
While this is specific to the GCP, the concepts used should be applicable to other CI/CD systems or cloud providers that have arm64
VMs like AWS or Azure. Honestly, the biggest issues are likely to be around enabling the right permissions and proper networking.
Solution
Docker has the concept of “contexts”, which allows one to use a remote machine to do the actual work of a build. An arm64
VM can be created and configured to be used as a context for the arm64
part of the build.
In a normal local network, behind a firewall, with trusted machines, it is pretty easy to set this up. You can configure the Docker daemon to listen on a regular tcp
port and point a context at that machine/port.
For a bit more security, it is also relatively straightforward to use ssh
to secure the connection to the Docker daemon.
Unfortunately, in Google Cloud Build, it simply isn’t currently possible to configure the network to allow a Cloud Build worker to talk to a VM directly via ssh
(see this bug for details).
However, it is possible to use Google’s Identity Aware Proxy to connect to a VM via ssh
. As far as I can tell, Docker’s ssh
support cannot be used directly since gcloud
has to be used for ssh
. Instead it is just as easy to use an ssh
tunnel from the Cloud Build worker to the VM’s normal Docker tcp
port. The worker simply needs a Docker context that listens to that local port.
It took some effort to figure all of this out and the primary motivation to write this article is to share my learnings to hopefully help someone else out. I should note I drew inspiration from this post by Gabriel Hodoroaga. It pointed me in the right direction and is also an excellent tutorial on IAP.
Okay, now on to the step-by-step guide to make this a reality.
Side Note: Many of these steps are manual, through the GCP Console. There are undoubtedly gcloud
equivalents for everything done herein, but I didn’t have time to figure out all of the equivalents. I leave that as an exercise for the reader.
Prerequisites
I develop on a Mac, and have only tested these instructions on a Mac, but any commands given should also work for Linux and probably even Windows.
You’ll need to have Docker Desktop installed.
You’ll also need to have Go installed (for the crane
tool).
I assume you have some familiarity with GCP, have access to a GCP account or can create one, and have the gcloud
CLI installed.
For this article, I myself created a new GCP project called multi-arch-docker
. If you have the right permissions, you might want to do the same at GCP Console — Manage Resources. A benefit of a new project is that it is easy to clean things up at the end — just delete the project.
You can also use an existing project, however you will need to replace multi-arch-docker
in certain places with your project name (for example, in the next paragraph). Be aware that some links I have to the GCP Console may not work for you, but I trust you’ll figure out how to manually fix them.
It helps to set an environment variable, which is used in some commands below, and also overrides the GCP_PROJECT
variable in the Makefile
:
export GCP_PROJECT=multi-arch-docker
The commands shown in the remainder of this article assume you have set your current gcloud
project like so:
gcloud config set project $GCP_PROJECT
Doing this eliminates the need to specify the --project
flag in subsequent gcloud
commands.
Companion Repo
My multi-arch-docker
companion git repo provides not only all of the source code discussed herein, but is a robust real-world example of tooling to create and maintain multi-architecture Docker images.
You’ll need the code to run the later examples, so go ahead and clone it into your favorite working directory:
git clone git@github.com:dougdonohoe/multi-arch-docker.git
Create Docker Repositories in Artifact Registry
Artifact Registry is where images are pushed to and pulled from. I believe it is a best practice to have multiple environments for images. In real world use, I like to have docker-dev
for doing development/testing and docker-prod
for official use in other build steps. The Makefile
infrastructure in the companion repo easily allows switching between environments via the ENV
variable.
Enable the Artifact Registry API:
gcloud services enable artifactregistry.googleapis.com
For this tutorial, we only need one repository. In Artifact Registry, Click + CREATE REPOSITORY and use these values:
- Name:
docker-dev
- Format:
Docker
- Location Type:
Multi-region
,US
If you instead want to use your own repository, you can prefix make
commands with ENV=myrepo
. To avoid that on each command, set an environment variable:
export ENV=docker-dev
Create a Service Account
For use by the VM, create a dedicated service account that can access Artifact Registry in GCP Console — Service Accounts. Click + CREATE SERVICE ACCOUNT and use these values:
- Name:
builder
- Role:
Artifact Registry Writer
- to enable pull/push images from/to Artifact Registry
SSH Credentials
An ssh
key is needed for the Cloud Build worker to tunnel to the arm64
VM. First, create a temporary directory and the key:
mkdir /tmp/builder-keys
cd /tmp/builder-keys
ssh-keygen -t ed25519 -f build-google_compute_engine.ed25519 \
-N "" -C "root@builder"
This creates two files:
build-google_compute_engine.ed25519
build-google_compute_engine.ed25519.pub
Save these files as GCP secrets (enable the Secrets API if prompted):
gcloud secrets create build-google_compute_engine-ssh-priv \
--replication-policy="automatic" \
--data-file=build-google_compute_engine.ed25519
gcloud secrets create build-google_compute_engine-ssh-pub \
--replication-policy="automatic" \
--data-file=build-google_compute_engine.ed25519.pub
You should see them in GCP Console — Secret Manager. You can also fetch them via gcloud
to verify they are there:
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-priv
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-pub
The public key also needs be uploaded as project metadata. The instructions are based on these docs (enable the Compute API if prompted):
# Get existing keys (only found in an existing project, if at all)
gcloud compute project-info describe \
--format="value(commonInstanceMetadata[items][ssh-keys])" \
| tee ssh_metadata
# Append new key - Be careful here - you don't want to delete any
# existing metadata, so be sure and use the -a flag to append the
# new key to existing keys.
echo "root:$(cat build-google_compute_engine.ed25519.pub)" \
| tee -a ssh_metadata
# Visually verify the ssh_metadata file has any prior keys
# (blank lines are ok)
cat ssh_metadata
# Save
gcloud compute project-info add-metadata \
--metadata-from-file=ssh-keys=ssh_metadata
# Fetch metadata again to verify it worked
gcloud compute project-info describe \
--format="value(commonInstanceMetadata[items][ssh-keys])"
You should see them in GCP Console — Metadata.
Finally, remove the temporary directory and keys:
cd && rm -rf /tmp/builder-keys
IAP — Identity Aware Proxy
In order for the Cloud Builder to talk to the VM, it has to use Identity Aware Proxy (IAP).
To use and configure IAP, you need to grant yourself permissions via
IAM Admin. Find your email and click the pencil icon, then add these roles:
IAP-secured Tunnel User
IAP Policy Admin
IAP Settings Admin
Compute Instance Admin (v1)
Service Account User
If you can’t grant yourself these permissions, ask a teammate or manager who has higher privileges to help or beg and plead to your local cloud ops team.
Enable the IAP and Cloud Build APIs:
gcloud services enable iap.googleapis.com
gcloud services enable cloudbuild.googleapis.com
Add a firewall rule enabling IAP access from Cloud Build IPs:
gcloud compute firewall-rules create allow-ssh-ingress-from-iap \
--direction=INGRESS \
--action=allow \
--rules=tcp:22 \
--source-ranges=35.235.240.0/20
Because IAP is being used, and for stronger security, I recommend optionally disabling normal ssh
via this command. Of course, you may not want to do this if there are other VMs in your project that currently depend on direct ssh
.
gcloud compute firewall-rules update default-allow-ssh --disabled
You can see firewall rules in GCP Console — Firewall.
You also need add these roles to your Cloud Build service account so that it can fetch ssh
keys, use IAP and use the VM:
Compute Admin
Service Account User
IAP-secured Tunnel User
Secret Manager Secret Accessor
To determine your cloud build service account, run this:
PROJECT_NUMBER=$(gcloud projects list --filter=$GCP_PROJECT \
--format="value(PROJECT_NUMBER)")
SERVICE_ACCOUNT="$PROJECT_NUMBER@cloudbuild.gserviceaccount.com"
echo $SERVICE_ACCOUNT
Creating the VM
Create an arm64
VM and perform some manual installation steps:
Create T2A Instance
In GCP Console — VM Instances, click Create Instance:
- Name:
builder-arm64-2cpu
- Region/Zone:
us-central1
/us-central1-a
- Series:
T2A
- Machine Type:
t2a-standard-2
(experimentation shows that 2 CPUs is sufficient to handle my multi-arch builds) - Boot Disk:
40G
,SSD persistent disk
,Debian GNU/Linux 11 (bullseye)
- Service Account:
builder@multi-arch-docker.iam.gserviceaccount.com
Environment Variables
Define some more environment variables for use in subsequent commands:
INSTANCE_NAME=builder-arm64-2cpu
ZONE=us-central1-a
Login over IAP
Login and then make yourself root
since all subsequent commands need to run as root
(this makes life easier than having to sudo
everything):
# On your machine
gcloud compute ssh --zone $ZONE $INSTANCE_NAME --tunnel-through-iap
# On VM
sudo su -
You may get a warning to install numpy
for better performance. Exit from the VM, run this, and then login again:
# On your machine
$(gcloud info --format="value(basic.python_location)") -m pip install numpy
Install Docker
Run Docker installation steps as found in the Debian install instructions (just copy+paste this whole block):
# On the VM
apt-get install --yes ca-certificates curl gnupg lsb-release make
mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \
$(lsb_release -cs) stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install --yes docker-ce docker-ce-cli containerd.io docker-compose-plugin
# To test
docker run hello-world
Configure Docker Access
Configure Docker to be able to talk to Artifact Registry:
# On the VM
gcloud auth --quiet configure-docker us-docker.pkg.dev
SSH Access For Root
To allow root
login and port forwarding, change these lines to yes
or uncomment them in /etc/ssh/sshd_config
on the VM:
PermitRootLogin yes
AllowTcpForwarding yes
Restart ssh
:
# On the VM
service ssh reload
Test you can now connect from your machine as root
:
gcloud compute ssh --zone $ZONE root@$INSTANCE_NAME --tunnel-through-iap
There is also a make
command that does this:
cd multi-arch-docker
make ssh-vm
Port Forwarding
IAP can be combined with forwarding the local Docker port over ssh
to allow a Google Cloud Build to access the VM. The first step is to enable tcp
on the machine using the localhost address.
By default, listening on the tcp
socket isn’t turned on. To turn it on, add an override file to specify an alternate ExecStart
command. This overrides the default, found in /lib/systemd/system/docker.service
:
# On the VM
mkdir -p /etc/systemd/system/docker.service.d
cat << EOF > /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// -H tcp://0.0.0.0:2375 --containerd=/run/containerd/containerd.sock
EOF
Then restart Docker and validate tcp
is working:
# On the VM
# To restart, reload daemon and restart docker:
systemctl daemon-reload
systemctl restart docker.service
# Check status
systemctl status docker --no-pager # should show "Active: active"
systemctl is-enabled docker # should show "enabled"
# Validate
docker -H tcp://0.0.0.0:2375 ps # should show empty list
netstat -tulpn | grep dockerd # should show '0 :::2375' in 3rd/4th columns
Test Remote Build on VM
The remote VM should be ready to use. Let’s test this from our local machine. First, in a separate terminal window, start an ssh
tunnel (don’t forget to set your environment variables):
gcloud compute ssh --zone $ZONE $INSTANCE_NAME --tunnel-through-iap -- \
-L 8375:0.0.0.0:2375 -N
In another window, configure Docker to talk to Artifact Registry, setup a buildx
context using the tunneled tcp
port and a builder to use that context:
gcloud auth --quiet configure-docker us-docker.pkg.dev
docker context create arm_node_tunnel --docker "host=tcp://127.0.0.1:8375"
docker buildx create --use --name remote-arm-tunnel --platform linux/arm64 \
arm_node_tunnel
To list current context/builders:
docker context ls
docker buildx ls
The third party base images need to be copied to our Docker repository (this is done to avoid Docker Hub rate limiting). The crane
tool is used to do the copying and is auto-installed, which relies on Go having been installed.
cd multi-arch-docker
make thirdparty
Start top
on the remote VM — when the build is running you should see processes pop up as Docker does its work.
Test the remote docker by using the Makefile
in the companion repo:
BUILDER=remote-arm-tunnel PLATFORMS=linux/arm64 \
TAG_MODIFIER="arm64-test-remote" make buildx-publish-runtime
This will do an arm64
only build and use the remote VM to do the work.
This actually publishes the runtime
image to Artifact Registry, but with the arm64-test-remote
suffix to avoid conflicting with the normal build tag.
After testing is done, remove the context and builder created above:
docker buildx rm remote-arm-tunnel
docker context rm arm_node_tunnel
Then stop your ssh
tunnel with CTRL-C
.
Cloud Build
To run the cloud build, it is necessary to create the dedicated cloud builder:
make build-publish-cloud-builder
Then, at long last, to run the cloud build:
make cloud-build
A link to the Cloud Build log will be displayed, which will take you to the build in the GCP Console. Output from the build will also stream to your terminal. You should again see activity in top
on the VM.
Code Notes
A lot of the magic that makes this work is in the Cloud Build pr.yaml
configuration file and the Makefile
. I’d like to call out a few things:
Secrets
The secrets are fetched in the first step, which are required for the ssh
tunnel:
mkdir -p /builder/home/.ssh
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-priv \
> /builder/home/.ssh/google_compute_engine
gcloud secrets versions access latest \
--secret=build-google_compute_engine-ssh-pub \
> /builder/home/.ssh/google_compute_engine.pub
chmod 700 /builder/home/.ssh
chmod 600 /builder/home/.ssh/google_compute_engine
Note that in Cloud Build, the /builder/home
directory is persistent across build steps.
Contexts
The setup
step is what creates the Docker context for the arm64
remote host. The default
context is built-in and used for amd64
. Note that the builder’s port 8375 maps, via ssh
tunnel, to the remote VM’s port 2375. I use different ports to make it clear which is on the local builder (8375) and which is on the VM (2375).
docker context create arm_node --docker "host=tcp://127.0.0.1:8375"
Docker Builder Creation
Each build target depends on the buildx-setup
target, which creates a builder that assumes the existence of the default
and arm_node
contexts created above. For example, this target resolves to these commands for the build
image:
docker buildx create --use --name build --platform linux/amd64 default
docker buildx create --append --name build --platform linux/arm64 arm_node
Per-Step Tunnel
Each step that does an actual build has to create an ssh
tunnel because each step runs in its own container:
make ssh-tunnel
The ssh-tunnel
target is defined as follows:
ssh-tunnel:
gcloud compute ssh --project $(GCP_PROJECT) \
--zone us-central1-a $(ARM64_VM) \
--tunnel-through-iap -- -L 127.0.0.1:8375:0.0.0.0:2375 -N -f
The --tunnel-through-iap
flag is the magic that invokes IAP to facilitate the use of ssh
from the builder. Unlike the tunnel test done above, the addition of the -f
flag puts the tunnel in the background.
Build Image
Finally, note that the custom build image is created from Dockerfile-cloud-builder
, which adds the Docker CLI and crane
tool to Google’s standard gcr.io/cloud-builders/gcloud
image.
Clean Up
To shut down your VM:
make stop-vm
You can restart it with:
make start-vm
NOTE: After restarting, it can take up to 60 seconds for IAP to become aware of it (before this, ssh
might fail).
If you want to delete resources created in this tutorial:
# OF COURSE, PROCEED WITH CAUTION HERE - DON'T DELETE PRE-EXISTING THINGS
gcloud compute instances delete $INSTANCE_NAME
gcloud iam service-accounts delete builder@$GCP_PROJECT.iam.gserviceaccount.com
gcloud artifacts repositories delete docker-dev --location us
gcloud secrets delete build-google_compute_engine-ssh-priv
gcloud secrets delete build-google_compute_engine-ssh-pub
gcloud compute firewall-rules delete allow-ssh-ingress-from-iap
If you created a new project for this, you can also just delete that and every resource in it goes away:
# BE VERY VERY VERY VERY VERY VERY CAREFUL
gcloud projects delete multi-arch-docker
Summary
It seems like a lot of work to set things up properly to use an arm64
VM, but the juice is worth the squeeze. The time it takes to do complex builds, especially C++, is drastically reduced as the amd64
and arm64
parts take effectively the same time.
I sincerely hope this tutorial makes someone’s life easier. Thanks for reading.
Doug Donohoe is a seasoned software engineer, working remotely from Pittsburgh, Pennsylvania. These days, he’s doing work in Go running in Docker on Kubernetes, but is known to have slung some Scala in the past. Connect to him via LinkedIn.