Every software engineer who has worked at a once-small startup has had the pleasure of seeing tools evolve from infancy to maturity, ultimately growing into modern technology. A long and frequently amusing list of growing pains never fails to accompany this experience. In fact, the part of hiring I look forward to most is getting a beer and sharing some war stories with a new generation of employees.
We tell them about the way things used to be, back in the “dark ages”: the time when a misconfigured application DDoSed our own website; the time a malfunctioning link shared in company chat caused a service outage because too many curious people clicked it; or the (shamefully, several) times a test environment accidentally wiped a production database table because we lazily used the same credentials. We hope they will learn from our mistakes, and personally, I hope they will retell these absurd myths, giving them a life of their own.
This is one such story: the mystery of why
make failed with an SSH error.
It may not surprise you that Xandr follows a service-oriented architecture for its real-time ad serving platform. The RPC strategy may also seem familiar: a single repository contains hundreds of protobuf message type definitions, and applications each compile the ones needed to interact with their respective dependencies. Though protobuf is language agnostic, and we do implement services in a variety of languages, the majority of the real-time platform is implemented in C. To facilitate development in C, we created a tool which parses each of the protobuf message specifications and generates some of the additional boilerplate needed to serialize and interpret messages in our code. Additionally, we created a 10-byte message header to prepend to serialized messages that contains some useful metadata. As this tool was originally written when we were still called AppNexus, the message format was dubbed AnMessage, and the tool to generate the additional code was called anm-gen.
At this point, you may be wondering why we had to create this tool. After all, protobuf is extensible, and code generation can be handled with
protoc. The reasoning is less sound now, but the answer is that anm-gen and AnMessage were created at a time when protobuf didn’t support some of our needs. For example, AnMessage headers contain a field specifying (for the generated data) which protobuf message type was used, and another field to specify which version of that message type was used. Additionally, we generate code to parse or serialize JSON-encoded data.
This is one of those aforementioned tools that has grown from infancy to maturity. Originally, our message specifications were not protobuf, but JSON. Originally, anm-gen was not written in Golang, but Python. Recently, we added support for Differential Encoding of repeated integer types to conserve bandwidth. This project was successful, compressing some fields by more than 50% of their original size. However, while profiling our largest application, I discovered a large number of CPU samples with symbols related to resizing those arrays of diff-encoded integers. Glancing through the anm-gen code, I confirmed that we were not correctly pre-allocating space for the data, and set about fixing it.
I started out the same way I start all my changes. I pull the latest code for the project and checked out a development branch. I make some changes which I am confident will work, and try to build and install the new version of the tool, so I can test it against some of our code locally.
Normally, I’d expect
sudo make install to have dependencies on the build itself, so it’s understandable that this didn’t recompile. I lazily hit the up arrow on my keyboard and deleted the
install target from the command.
I wasn’t anticipating
make to fail. I definitely wasn’t anticipating the SSH error
Permission denied (publickey) in the output. My first instinct was that, duh, I shouldn’t run this as root. Perhaps the SSH error is benign, and my user’s public key would be accepted.
Oh. Now it works? In my early career, this would have been the end of the story. By now, though, I am old and disgruntled, and need to know just why my SSH key is required. I want to solve this mystery, in part so I could eliminate a build-time dependency on SSH, but also out of morbid curiosity.
Let’s take a look at the Makefile line that is triggering the error.
cd /home/jshufro/anm-gen/cmd/anm_gen && \
go vet -mod=vendor && \
go install -mod=vendor -ldflags "..."
cd is benign. The folder exists. I try running
sudo go vet -mod=vendor.
Now we’re getting somewhere. The question of the SSH error is unanswered, but we know it’s coming from the invocation of
go here. To dig further, I decide to check the output of
go version, so I can read up on what
go vet does (spoiler alert: it definitely isn’t supposed to SSH).
I start to wonder if I’m going insane. Without sudo, this prints the predictable.
So, the binaries must be different?
The binaries are different! As you can see, the root invocation uses
/usr/bin/go and the unprivileged invocation yields a binary in a local shim directory owned by goenv, a very useful go version management tool.
Well, this explains why omitting
sudo builds successfully, but we still don’t know why
/usr/bin/go is triggering SSH nor why it’s refusing to print a version.
I decided to check the package data for the binary to see which package installed it.
This was my moment of epiphany, but you, I’m sure, are confused about what
appnexus-maestro-tools might be. AppNexus was founded in 2007 as a cloud computing company, and would later pivot to offering its real-time advertising platform. In the early days of the company, this package of perl scripts was put together to make it easier to SSH from host to host. The script is simple: it fetches a list of service hostnames from an internal tool called Maestro, pattern matches it against the input, and starts an SSH session with whichever host it finds. If more than one host matches the pattern, a list of hosts is returned and no SSH session is started. For example, to log onto my devbox, I type
go 558 jshu which matches only my internal hostname of
558.bm-jshufro.user.nym2 and connects me. Since golang didn’t exist in 2007 (and wouldn’t for a couple more years), nobody thought twice about naming this utility
go. It would take 12 years, but that decision would cause the issue I’m writing about today.
Only one question remained: where was
go trying to SSH into? I read the source and discovered it passed the input along to a script called
mfind which is provided by the same package.
go parses the
-mod=vendor portion as a parameter, which it ignores, returning the
Unknown option: mod error. Subsequently,
mfind is invoked like this:
There it is. The result of pattern matching finds
vet in ‘dspcreativetesting`.
This felt like the perfect synergy of errors.
sudo makewas using a coincidentally named binary which was not intentionally installed. That binary tolerated the parameters meant for golang-go and pattern matched exactly one host within Xandr’s ecosystem.
So, how did my environment get into this state? If I try to install both
golang-go, dpkg returns a useful error…
dpkg: error processing archive /var/cache/apt/archives/golang-go_2%3a1.6-1ubuntu4_amd64.deb (--unpack):
trying to overwrite '/usr/bin/go', which is also in package appnexus-maestro-tools 0.2.25
This means that I couldn’t have possibly installed both. I didn’t install goenv without installing golang, so something else must have overwritten the global binary. Suddenly I remembered: a few months earlier, I ran
dpkg -L anm-gen | xargs sudo rm -rf in order to clean up the aptitude-installed version of anm-gen and replace it with a locally built one. I forgot that
dpkg -L lists
/usr among its entries for any package contained within. This deleted my entire
netcat to copy a colleague’s
/usr directory, I was able to restore my devbox to a mostly-working state. He had
appnexus-maestro-tools installed, and I had
golang-go installed. The Frankensteinian combination of our environments worked perfectly until I ran