Spot the Docker difference
Can you use the Docker Registry to recreate a Dockerfile?
I wanted to explore what we can find out by inspecting registry information about Docker images. Could we tell the difference between different images? Could we recreate the Dockerfiles they were built from?
My first step is to look at the differences in the registry information about two images built from the same Dockerfile.
For experimentation purposes I’ve created a very simple Dockerfile:
I’ve used this to create two images which I’ve pushed to Docker Hub called lizrice/imagetest and lizrice/sameimage. The only thing that’s different about them is the name.
I wrote a quick Python script to pull down the registry information for a public image, print out the manifest and headers that the registry gives you. Here’s the diff’ed output for those two images built from the same Dockerfile.
Superficially it looks like there are a lot of differences here, but on closer inspection there are a lot of similarities. Let’s start by looking at what the script is printing out.
- Lines 1 and 18 are names printed by my script — so of course these are different.
- Lines 2–17 are the manifests
- Line 19 contains the headers
The headers contain a digest that uniquely identifies the image, but nothing from which you can learn anything about the image contents. So that’s no help for figuring out what was in the Dockerfile.
The only things that are different in the manifests for the images are the name and the signatures. We expect the signatures to be different — if they weren’t, they wouldn’t be serving much purpose as signatures. And the names, well, of course they aren’t the same.
But everything else is identical. So if we look at two images built from the same Dockerfile*, we can expect everything in the manifest, apart from their names and signatures, to be the same.
*Update: This isn’t always true!
What’s in the manifest
Let’s look in more detail at the manifest. The interesting information is in the fsLayers (file system layers) and history fields. As you probably know, Docker images are built from layers, and each layer in the image has both a file system layer and a history layer. There is a one-to-one correspondence between the fsLayer and history items in the manifest.
There are two layers in lizrice/imagetest:
Recall that this image is based on Alpine 3.3? If we look at the fsLayers from alpine:3.3, this is what we get.
That’s exactly the same SHA as the second of the layers in my test image.
This made me wonder what you get if you base an image on lizrice/imagetest. Here’s the very simple Dockerfile for lizrice/childimage:
LABEL com.label-schema.description='A child image'
And the fsLayers we get from this image:
Interesting! At the bottom of this list, we have that exact same layer from alpine:3.3, and the next one up is the same as lizrice/imagetest, which is the image that this child image was based on. But then, working up the list, there are two more layers with exactly the same SHA.
The reason for this is that the directives in the childimage Dockerfile don’t change the file system, they just change the configuration. We have to go to the history fields to see what’s happening there.
There are lots of details in the history fields that we don’t need to worry about in detail for the purposes of this discussion. (You can use my script to take a look for yourself if you’re interested.)
Looking for correspondences with the Dockerfile, we find that each layer in the history contains a Cmd field that resembles the Dockerfile directive.
These Cmd fields from all the layers are presumably built into a script that gets run when the container starts.
Where the line doesn’t affect the container itself (such as MAINTAINER or LABEL directives) it’s turned into a no-op Cmd by being commented out. In fact event the ADD file line is a no-op (but recall that file system changes are dealt with by fsLayers). In contrast, a RUN directive gets turned directly into something that is executed in that script.
Here’s what we see in the Cmd fields for lizrice/childimage:
The top two layers in childimage are the same as the two lines in its parent. In other words, just as the fsLayers of a parent image are including in the child, so are the configuration layers in history.
In this example I’ve got identical MAINTAINER lines in both imagetest and childimage. As you can see, this results in a duplicate Cmd field in the history layer.
Adding and copying files
When the file system is modified by adding or copying files, the manifest gives us an identifier that presumably represents the modifications to the filesystem that is then reflected in the equivalent layer in fsLayers. Note however that the hashes are not the same — the ADD file command for alpine:3.3 has a hash starting 86864, and the blobSum starts 6c123.
We can compare the ADD file between manifests to see if they are the same, but we can’t tell what they really are. You’d have to pull the image and inspect the file system to see the contents.
Other fields in the manifest
If you care to look into the manifest in detail, you’ll see that MAINTAINER is used to populate the author field, and LABEL populates a Labels sub-field, but for the purpose of trying to recreate the Dockerfile we don’t need to look at this since the information is also in the Cmd field.
From this study we can draw two conclusions:
- If you’ve got two identical images — or rather, two images built from identical Dockerfiles — their fsLayers and history will be identical.
- A child image starts with the same fsLayers and history as its parent.
Does this mean you can reconstruct a Dockerfile from an image’s manifest? Well, sort of. When the file system is modified by a layer you can’t say what the file contents are, but you can compare with the manifest of another image to see if they are identical.
We’ve made use of this in MicroBadger to display as much as we can in terms of recreating the Dockerfile, including identifying possible parent images for child images. Here’s what the layer break-out for my test image looks like, showing how we’ve been able to match this to its parent, alpine:3.3
All this is based on the Manifest Schema 1 of Docker’s Registry API V2. Docker are right now in the process of moving to a second version of the schema which doesn’t have the config information broken out into layers.
Update — thanks to Stephen Day’s comments on Reddit I now know that the config information is available in the Schema V2. The config object is listed in the manifest, and can be pulled independently of the layers. Stephen wrote the spec so he should know :-)