I’ve had some chances on a recent trip to get more first-hand engagement with some enterprise-scale Angular customers. What I learned reinforces the impression I’ve gotten about our industry as micro-service architecture rolls out: we no longer do Continuous Integration (CI).
These companies seem self-assured that they have good testing practices: they have an automated test suite, and they run it continuously. That’s really great, and catches some bugs earlier and makes happier developers who don’t have to go back as frequently to debug through the sludge of years-old code. So you have the C in CI, no problem.
We must remember what the I in CI stands for. What are we integrating?
Any large organization requires breaking down work into departments. Ever wondered why a space agency has a control room with such a huge number of desks with controllers? Rocketry and space flight is such a massive technical undertaking, with different scientific and engineering disciplines working together, that they’ve built an operations process which gives them immediate access to information from each.
Imagine if instead of the big control room, the astronauts were just talking to a “frontend” team who then had to consult what the “backend” team thought, sometimes with a several day round-trip, and that in turn was using outdated manuals for the spacecraft. Each department might be doing the right thing on their own, but they are not “integrated”. They don’t act as a single unit.
The problem I see in companies adopting a microservice architecture is similar. Each team has tested their own components and are ready for deployment. Then when it’s time to get some new software up there to our astronauts (or other users), we discover that it doesn’t work in the QA environment — taking a day to trace the change in some other department. This is really bad for the business. If it takes weeks to integrate the software each time we want to deploy, then critical business initiatives have to build in extra time budget for software changes.
This delay in shipping causes a terrible feedback loop in the organization. As it takes weeks to release a change through “the process”, teams want to have more autonomy to have their own release schedule. It seems like it should accelerate the process, but of course this exacerbates the problem. These new autonomous units only run their own tests, and so the interactions are untested when used with the other systems it will have to integrate with in production.
The usual software architect or consultant replies, “This is not a problem because rigorous API contracts are drawn at the boundaries of each service. Each part is tested against that API”. It sounds like a great answer. Does it work?
There’s a guy who worked on the C++ team at Google named Hyrum Wright. He had to make global changes across Google’s monorepo to migrate everyone to newer versions of some base library API. That API had been specified, and was tested against, and the API was not being changed. Yet library changes would inevitably cause a bunch of test failures across Google. What were we doing wrong?
There’s now an observation called “Hyrum’s Law” based on this experience:
With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.
Another way to say this is, while your API surface is constrained in spirit, it is unconstrained in practice. Change how a result is sorted? Someone relied on the prior ordering.
Now comes the incorrect conclusion from some architects: “So then that was a bug in the client of the API. The contract never guaranteed that the data would be sorted. Shame on them”. What is the business to make of this? The client software team made an avoidable error and is therefore negligent? Passing the blame this way doesn’t actually solve the business problem: these integration errors happen and are preventable. API contracts are not a sufficient way to prevent them.
As software engineers, we know that even if we are quite diligent, we’ll make accidental assumptions that happen to work today. The solution to developers making human mistakes is to add QA which catches it. Therefore, we should be writing (and continuously running) tests that exercise the *entire stack* we’ll deploy on: integrating it continuously. Only this can assure reliable delivery of our new code. And remember, the astronauts are depending on us.
Postscript: Another time, I hope to write about more steps needed to implement true CI. The first step is to make a recipe which reliably creates a full new QA environment on someones computer. Then make it easy to do this on-demand against the current sources. A later step is to have incremental builds (like under Bazel) so that the CI resource consumption can grow under tighter bounds.