Extracting a single artifact from a Docker image without pulling

Tom Shaw
5 min readApr 26, 2020

--

This post was inspired by a recent Docker Blog post by Tõnis Tiigi.

Update : The manual steps detailed throughout this post have been mostly automated by the “docker artifact” CLI plugin. Code can be found here : https://github.com/tomwillfixit/docker-artifact . All feedback welcome.

The “COPY –from=stage0 /binary0 /bin” approach was particularly interesting and I decided to update a Dockerfile at work to use this approach and pull in some binaries from other container images. This worked a treat and the build looks much cleaner but there was one problem. When copying a single file or directory from another image the entire image was being pulled. If the build is running on a host with a cache of images then this probably isn’t a big deal as only the differences of the images are being pulled. However on a fresh host or a fresh build agent with no cache then this adds quite an overhead to the build time. So I asked some folks on Twitter for help … Thanks to Darragh Bailey and Adrian Mouat for the responses.

What am I trying to achieve ?

I’d like to be able to pull a single binary from a Docker Image stored in Docker Hub without pulling the whole image. There may well be a simpler way to do this but I couldn’t find it so here are the steps I took. I’m planning to wrap these steps up in a single script called “copy_binary” that I can call within a Dockerfile like this :

RUN copy_binary –from=tomwillfixit/healthcheck /bin/helloworld.bin

This would allow the build to pull a single blob from Docker Hub which contains only the “helloworld.bin” binary file without pulling the rest of the “healthcheck” image.

How does this work?

The example code can be found here if you want to follow the steps. The following steps were used on Docker for Mac.

Step 1 : Build the Image

I’ve checked out the “healthcheck” repository and built the image as normal :

The most important part is -> 45efcfd27c03

After building the image it is necessary to push the image to Docker Hub before we can find the sha256 value of layer 45efcfd27c03.

Step 2: Find out where helloworld.bin is stored

Next up we need to find where the “helloworld.bin” is being stored so we can pull it.

We can use “docker inspect” to find the sha256 value of the RootFS layer where “helloworld.bin” is stored.

Since we are running Docker for Mac we need to login to the Docker for Mac VM to get the sha256 value of the blob which contains “helloworld.bin”. If you are using Linux then you can look in : /var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256 for the sha256 value of the blob containing “helloworld.bin”.

To login to the VM we run :

Now that we are inside the Docker VM we can use the sha256 value which we got from the RootFS layer to find the actual location of the “helloworld.bin” file. (Update : If the sha256 value does not exist then it may be necessary to push the image to Docker Hub first. I need to investigate this requirement further.)

Step 3 : Add a Label which points to helloworld.bin

We know where the “helloworld.bin” file can be found but how can we pull just that one file from the Docker Hub? For the purpose of this post I’m going to add a label to the “healthcheck” image which tells me which layer “helloworld.bin” is in. It looks like this :

This new image with the label applied is then pushed to Docker Hub.

Step 4 : Extracting “helloworld.bin” from Docker Hub

I’m using curl and jq to extract “helloworld.bin”. Firstly I authenticate myself with Docker Hub using “docker login” and then set some variables.

reg=”registry.hub.docker.com

repo=”tomwillfixit”

image=”healthcheck”

tag=”latest”

name=”${repo}/${image}”

token=$(curl -s “https://auth.docker.io/token?service=registry.docker.io&scope=repository:$name:pull" | jq -r .token)

At this point we have our authentication token and we are ready to query the “HELLOWORLD_BIN” label that we added earlier. This label contains the sha256 value for the location of the “helloworld.bin” file.

curl -s -H “Authorization: Bearer $token” “https://${reg}/v2/${name}/manifests/${tag}" | jq ‘.history[0].v1Compatibility’ |jq -r |jq ‘.container_config.Labels | to_entries[] | select(.key | startswith(“HELLOWORLD_BIN”))’ |jq -r ‘.value’

In this example, “sha256:2db578c3bba06cf12b67ed42e72b8d0582e62dc2bde2fdcdaf77cb297fbd4fcb” is returned and this is the blob which holds the “helloworld.bin” file.

Now we download and extract the blob :

curl -s -L -H “Authorization: Bearer $token” “https://${reg}/v2/${name}/blobs/sha256:2db578c3bba06cf12b67ed42e72b8d0582e62dc2bde2fdcdaf77cb297fbd4fcb" |tar -xz

Summary

In summary, we added a label at build time to tell us the location of a specific file called “helloworld.bin”, then we downloaded just the blob referenced by the sha256 value. In this example the image is small and we don’t really gain much but in a real world example the image may be a few hundred megabytes and being able to extract files without pulling the whole image may be useful for someone.

What’s next?

I’m going to wrap all this up into a script that can be baked into our base images and start labelling artifacts of interest at build time so they can be extracted if needed. Is this something that Docker may provide in future? An “ARTIFACT” directive to expose certain files for download from an image? I’ve no idea but it might have some cool use cases.

Update : “docker artifact” CLI plugin code : https://github.com/tomwillfixit/docker-artifact

Thanks for reading.

#StayHome #StaySafe #StayContained

--

--