Bit-for-bit reproducible builds with Dockerfile

Akihiro Suda
nttlabs
Published in
5 min readMar 19, 2023

--

At FOSDEM last month, I talked about the current status of bit-for-bit reproducible builds with Dockerfile:

Presentation slides at FOSDEM 2023 (PDF)

What are reproducible builds?

Reproducible builds is a practice to guarantee that identical binaries can be built from its source by anybody at anytime.

Slide 2

Reproducible builds are helpful for security assessment of binary releases. If binary releases are reproducible, users can verify that the binary releases were actually built from the corresponding source release. Otherwise it is practically impossible to verify whether they were built from the claimed source release. Potentially, such binary releases might be built from a malicious source code, due to a compromised CI pipeline, or just due to the bad intent of the binary releaser.

The reproducibility has to be attestable by anybody at anytime, but not necessary on any machine. Typically, the machine has to have a specific version of toolchains. And sometimes the machine has to use a specific version of the host operating system, with a specific filesystem, with a specific CPU. This is very far from ideal, but sometimes inevitable.

Challenge 1: Reproducing timestamps

An obvious challenge for reproducible builds is that the output binary may contain timestamps. The community’s standard for this challenge is the SOURCE_DATE_EPOCH environment variable. SOURCE_DATE_EPOCH can be set to a decimal UNIX representation of an arbitrary timestamp, such as $(git log -1 --pretty=%ct) . Toolchains that support the SOURCE_DATE_EPOCH specification replace the timestamps with the value of the SOURCE_DATE_EPOCH.

BuildKit, the upstream implementation of Dockerfiles, provides preliminary support for the SOURCE_DATE_EPOCH specification since BuildKit v0.11 (January 2023):

FROM debian:bullseye-20230109
ARG SOURCE_DATE_EPOCH
RUN echo "hello ${SOURCE_DATE_EPOCH}" >/hello

# === Workarounds below will not be needed when
https://github.com/moby/buildkit/pull/3560 is merged ===

# Limit the timestamp upper bound to SOURCE_DATE_EPOCH.
# Workaround for https://github.com/moby/buildkit/issues/3180

RUN find $( ls / | grep -E -v "^(dev|mnt|proc|sys)$" ) \
-newermt "@${SOURCE_DATE_EPOCH}" -writable -xdev \
| xargs touch --date="@${SOURCE_DATE_EPOCH}" --no-dereference

# Squash the entire stage for resetting the whiteout timestamps.
# Workaround for https://github.com/moby/buildkit/issues/3168

FROM scratch
COPY --from=0 / /

Currently, the SOURCE_DATE_EPOCH value is only applied to the timestamps in
the image metadata (e.g., the docker history timestamps), but not
automatically applied to the timestamps of the files inside the image layers.
So, BuildKit v0.11 still requires a very complex Dockerfile like above to touch the files, but this will be significantly simplified when https://github.com/moby/buildkit/pull/3560 gets merged. The PR also handles non-determinism of overlayfs “whiteouts” that are created on a removal of a layered file.

The above Dockerfile can be deterministically built with the following
commands:

# Make sure to pin the BuildKit version
docker run \
-d \
--name=buildkitd \
--privileged \
--restart=always \
moby/buildkit:v0.11.0

docker cp buildkitd:/usr/bin/buildctl /usr/local/bin/buildctl

export BUILDKIT_HOST=docker-container://buildkitd

# 2023-02-02 00:36:48 UTC
SOURCE_DATE_EPOCH=1675298208

# Change to "true" for pushing the image to the registry
PUSH=false

buildctl build \
--frontend dockerfile.v0 \
--local dockerfile=. \
--local context=. \
--metadata-file metadata.json \
--output type=image,name=
example.com/foo:$SOURCE_DATE_EPOCH,buildinfo=false,push=$PUSH \
--opt build-arg:SOURCE_DATE_EPOCH=$SOURCE_DATE_EPOCH

# Verify the image digest
[ "$(jq -r '."containerimage.digest"' < metadata.json)" =
"sha256:b313cd9751ed3e0c3f7185c034fde857302d56642ac5518f1ebbf7fc2e8eed93" ]

Note that the buildctl CLI does not automatically propagate the
SOURCE_DATE_EPOCH environment variable from the host to the containers. So, the --opt build-arg:SOURCE_DATE_EPOCH=$SOURCE_DATE_EPOCH flag is explicitly required.

Challenge 2: Reproducing package versions

The second challenge is how to reproduce the apt/dnf/apk/pacman versions.

Thankfully, the Debian project has been preserving old apt packages on snapshot.debian.org since 2005:

# /etc/apt/sources.list , pinned to 2023-02-02 00:36:48 UTC
deb http://snapshot.debian.org/archive/debian/20230202T003648Z/ bullseye main

There has been also a similar server snapshot.notset.fr since 2017.

However, these servers aren’t mirrored widely, and aren’t appropriate for massive use, due to the limited bandwidth. The situation is similar for Fedora and Arch Linux too.

repro-get

My alternative approach is repro-get: decentralized & reproducible apt/dnf/apk/pacman .

repro-get cryptographically locks the package versions using plain old SHA256SUMS files:

35b1508eeee9c1dfba798c4c04304ef0f266990f936a51f165571edf53325cbc pool/main/h/hello/hello_2.10-2_amd64.deb

Blobs can be fetched from several remotes like HTTP(S), OCI registries, and IPFS, to avoid causing huge traffics against the central archive server such as snapshot.debian.org.

e.g., the remote list can be configured as follows:

  • http://deb.debian.org/debian/{{.Name}} (Fast, ephemeral)
  • http://debian.notset.fr/snapshot/by-hash/SHA256/{{.SHA256}} (Slow, persistent)
  • oci://example.com/oras-image@sha256:{{.SHA256}}
  • http://ipfs.io/ipfs/{{.CID}}

In this example, repro-get first attempts to fetch http://deb.debian.org/debian/pool/main/h/hello/hello_2.10-2_amd64.deb (Fast, ephemeral) and then fallbacks to http://debian.notset.fr/snapshot/by-hash/SHA256/35b1508eeee9c1dfba798c4c04304ef0f266990f936a51f165571edf53325cbc (Slow, persistent) on HTTP 404.

Generating the hash

A hash file can be generated as follows:

$ repro-get hash generate >SHA256SUMS-amd64.old

$ apt-get install -y hello

$ repro-get hash generate --dedupe=SHA256SUMS-amd64.old >SHA256SUMS-amd64

The current user experience isn’t really great for keeping the hash file up-to-date. The plan is to provide some helper script for GitHub Actions to automate updating the hash file.

Reproducing packages

Packages can be reproduced from the hash file as follows:

$ cat SHA256SUMS-amd64
35b1508eeee9c1dfba798c4c04304ef0f266990f936a51f165571edf53325cbc pool/main/h/hello/hello_2.10-2_amd64.deb

$ repro-get install SHA256SUMS-amd64
(001/001) hello_2.10-2_amd64.deb Downloading from
http://debian.notset.fr/snapshot/by-hash/SHA256/35b1508eeee9c1dfba798c4c04304ef0f266990f936a51f165571edf53325cbc
...
Preparing to unpack .../35b1508eeee9c1dfba798c4c04304ef0f266990f936a51f165571edf53325cbc ...
Unpacking hello (2.10-2) ...
Setting up hello (2.10-2) ...

See https://github.com/reproducible-containers/repro-get for the further usage.

Demo

A demo is available at https://github.com/reproducible-containers/repro-get/tree/v0.3.0/examples/gcc .

Try the following commands to reproduce the image archive 0a3bcfebc67c85cac40e9c2cadee7b2b2b5077dc5ff985d8c396f008df818690 :

$ git clone https://github.com/reproducible-containers/repro-get.git
$ cd repro-get
$ git checkout v0.3.0
$ docker run -d --name buildkitd --privileged moby/buildkit:v0.11.0
$ docker cp buildkitd:/usr/bin/buildctl /usr/local/bin/buildctl
$ export BUILDKIT_HOST=docker-container://buildkitd
$ ./hack/test-dockerfile-repro.sh examples/gcc

...
0a3bcfebc67c85cac40e9c2cadee7b2b2b5077dc5ff985d8c396f008df818690 /.../0-oci.tar
0a3bcfebc67c85cac40e9c2cadee7b2b2b5077dc5ff985d8c396f008df818690 /.../1-oci.tar

The image is expected to be reproducible on any x86_64 machine, however, at least the BuildKit version must be pinned to the same version (v0.11.0). It’s also recommended to use the same filesystem (ext4) and the same host OS (Ubuntu 22.04).

Slide 12

Wrap-up

BuildKit v0.11 supports bit-for-bit reproducible image builds, but it still needs very complex Dockerfiles for eliminating non-determinism of the timestamps and the package versions.

BuildKit v0.12 will require less complex Dockerfiles for deterministic timestamps, assuming that https://github.com/moby/buildkit/pull/3560 will be merged in v0.12.

The package versions can be pinned using repro-get: decentralized & reproducible apt/dnf/apk/pacman. It still needs huge improvements though, especially for the user experience of maintaining the hash files.

NTT is hiring!

We at NTT are looking for engineers who work in Open Source communities like BuildKit and their relevant projects. Visit https://www.rd.ntt/e/sic/recruit/ to see how to join us.

私たちNTTは、BuildKit などのオープンソースコミュニティで共に活動する仲間を募集しています。ぜひ弊社採用情報ページをご覧ください: https://www.rd.ntt/sic/recruit/

--

--