Last night we migrated a key service to a new environment. Everything went smoothly and we concluded the maintenance window early, exchanged a round of congratulations and killed the zoom call. This morning I settled in at my desk and realized that this key service’s builds were breaking on master. My initial, and I think understandable impulse was that somehow I had broken the build when I merged my work branch for the migration into master the night before. Nothing pours sand on your pancakes like waking up to find out the thing you thought went so well last evening is now a smoking pile of ruin.
Except that wasn’t the problem. There was no difference between the commit that triggered the last good build and the merge commit to master that was now failing. I’m fine with magic when it fixes problems. We even have an emoji for it. “Hey, that thingamajig is working now!” Magic. I do not like it when it breaks things, although it is possible to use the same emoji for those cases as well. The first clue as to what was really happening was that the broken thing was a strict requirements check we run on a newly built image before unit tests. It has a list of packages it expects to find, and fails if it finds any discrepancy between that and the image contents.
In this case it was complaining about a specific set of libraries, and another member of our team quickly tracked them down to the official python 2.7.17 image on docker hub. Awesome. If the libraries are in the official image then we can add them to our list and fix the problem. Except… why weren’t they already in the list? We built the list off of what we expected to be in the official python image, so it should have matched what actually was in the official python image. The reason it didn’t is that we were not using python 2.7.17 when we built that list. We were using an earlier version of the image. So how did we end up building off of 2.7.17 this morning?
The answer was in the Dockerfile, where the base image in the FROM directive was given as python:2.7
. The tag 2.7
is what’s known as a shared tag. In contrast to simple tags, which always point to a single version of an image built for one or more architectures, shared tags identify a set of images, each of which may be built for a different platform and architecture. In both cases it’s up to the host docker daemon to choose which image to pull. I’m not sure what version of the image we built our manifest off of (we do know, and I could go look, but it’s late and I’m lazy), but at some point afterward the 2.7
tag was applied to a new release and when we ran a build this morning it pulled a new image, with new stuff in it, and things broke.
Which is mostly good, by the way. That is exactly what the check is supposed to do. An image with unlisted packages did not get into production. The only downside was the broken build that needed to be fixed on a Thursday morning after a maintenance window the night before. But there’s a valuable lesson in it: if you don’t use a tag that identifies a specific image then you don’t really know what bits are going to end up on your server disk. And it doesn’t have to be a shared tag to cause a problem. It just has to be a tag that can get removed from one image and applied to another. The latest
tag is a common example. Shared tags presumably have their uses, like pulling multiple images at once, they just have no more place in a production Dockerfile than latest
does.
Of course, this is all convention, and image authors are free to tag things the way they wish, and probably someone will chime in with a comment about how I still don’t really know what bits are going to end up on the disk unless we built the image ourselves from scratch (which in some cases we do). True enough, but the whole topic of trusting images from docker hub is another deep discussion. The thing that bit us here isn’t anomalous or unexpected: the image a shared tag identifies will definitely change at some point; it’s more or less the reason for shared tags. Interestingly this problem hit us twice in one day, as later our on-call person had to help fix a node.js build that broke for the exact same reason. So probably not the way I would have chosen to remind ourselves to be specific with image tags, but a good reminder nevertheless.