Test-Driven Infrastructure
A couple of weeks ago, we picked up a dev improvement task to try and reduce the number of errors in our logs. There’s still a lot of low-hanging fruit there, so we’re looking at the most frequently occurring errors and dealing with them one by one. This time, we picked the Jetty errors we were getting where people were trying to hit endpoints which didn’t exist.
This clearly wasn’t legitimate traffic, and was potentially harmful, so there was real merit in preventing this: well worth spending some time on. It was also quite straightforward to address: we run nginx in front of each of our exchanges to handle SSL termination, and filtering the traffic is well within nginx’s wheelhouse.
So we set to work white-listing our endpoints in the nginx config, and at first everything seemed nice and straightforward.
Incrementalism!
We didn’t want to simply block all traffic that wasn’t on a whitelist — that’s a very aggressive move to make in one step, it relies on creating
a complete whitelist. Rather, we’d want to deliver work incrementally. How does that work with whitelists?
Well, we started off with no filtering, which specified a location as follows:
location / {
proxy_pass http://$upstream;
}That just means anything that hits the root gets redirected to our application. Eventually we’re going to remove that in favour of
more specific whitelists, but before then we can start locking down individual paths better.
For example, we have a couple of health checks: one just checks the application is alive, and another checks our configuration is up to date.
We can allow access to those endpoints, but block any other paths starting with health with some config like so:
location /health {
location /health/alive {
add_header Allow “GET” always;
if ( $request_method !~ ^(GET)$ ) {
return 405;
}
proxy_pass http://$upstream;
} location /health/check-deals-were-recently-updated {
add_header Allow “GET” always;
if ( $request_method !~ ^(GET)$ ) {
return 405;
}
proxy_pass http://$upstream;
}
location /health {
deny all;
}
}
This permits us to slowly roll out changes to the various different paths, making sure our approach is safe in production on less-contentious
endpoints before rolling out more widely, and verifying one small piece of configuration at a time. Eventually, we’ll have whitelisted anything, and then we can stop accepting anything on the root.
So we started making some changes of this ilk, happy we had a basic approach. As we thought around the problem a bit more, an important question came up.
How should we handle changes?
Our application lived in one repo, and our infrastructure lived in a separate repo for our puppet configuration. The issue was that now we’d introduced a dependency — the nginx configuration in our puppet project had to reflect the application. When we add a new endpoint to our application, we need to also modify the puppet project to whitelist that new endpoint.
How should we manage keeping these things in sync?
This is the sort of thing which is incredibly easy to get wrong.
- Firstly, you need to know that, when making certain very specific changes to one project, further changes need to be made elsewhere.
- Secondly, you also need to remember to do so.
Various approaches were raised.
One was to monitor the problem: try to find ways to list the endpoints we had and doing some sort of verification that nginx had those paths whitelisted. It wasn’t clear whether listing the endpoints was automatable, and if not, then that just means we have something else to keep in sync, with all the same vulnerabilities. This didn’t seem to solve the fundamental problem.
Another approach seemed far more effective: tests. If we ran all our acceptance tests through nginx, and all our acceptance tests pass, that means that nginx allows all the traffic through. At least: all traffic important enough to merit an acceptance test. If we add a new endpoint, with an acceptance test, that will feed back to us that we need to add to the nginx whitelist.
Everyone was agreed that being able to test in this way was desirable. The issue was that it looked like quite a lot of work to get there.
Convenience vs thoroughness
It’s super important to have a good workflow while developing. It should be easy to run the tests, and for them to run quickly. If that’s not true, there are two consequences:
- Firstly, it will consume time, attention, and happiness from developers.
- Secondly, they won’t run the tests as frequently. That means more work being done on top of broken code.
Our main exchange’s tests were built to be as easy to run as possible in the IDE. When I joined the team, we had a single task which ran all the tests.
That’s about as simple as it gets.
Simple doesn’t necessarily mean easy, though, which was illustrated well when a small change resulted in 40 failing tests. It was difficult to
identify exactly what had gone wrong and why from the test output.
Everything became clearer when we realised we had 39 failing acceptance tests and one failing unit test.
The unit test told us very straightforwardly what was broken, and why.
The acceptance tests were failing because the bug was preventing the application from starting up properly.
That prompted us to separate the unit and acceptance tests, and run the unit tests first.
This added a little complexity:
- Instead of one test task in the IDE, we had two.
- Acceptance tests needed to be annotated so we could distinguish them.
- The Maven build needed an additional step, with a new plugin.
But that complexity was minor and mostly upfront, and it made our day-to-day life easier.
On a similar note, our acceptance tests had been built to start up the application as the first step in the test.
This was nice and convenient when running from the IDE: running a test was a one-click operation. It also had a ton of downsides:
- We weren’t using the same initialisation code we were using when we were deployed.
- We weren’t testing the same artifact we deployed.
- We were wasting time initialising the application in every test suite.
We’d been punished by both the first two points in the past. We’d seen packaging problems leave the application throwing NoClassDefFoundErrors and failing to start, that hadn’t shown up in tests. We’d seen errors in production because the initialisation paths only injected vital objects in acceptance tests. But mostly, we wanted to speed up our deploy, which was an annoyance multiple times a day.
Before being in a position to run tests against the application, through nginx, one thing we needed to be able to do was separate the tests and the application. Another thing was to run nginx, based on the current configuration (which lived in a different project), ideally without actually installing it fresh each time on the test workstation.
These were things that we thought were valuable, but they weren’t trivial.
They were also things we’d need to protect our whitelisting — we’d want these in place before changing the endpoints for future features.
So, we agreed to do that as a follow-up task immediately after putting the whitelists in place.
This was a huge mistake, for which we were punished almost immediately.
When Everything Broke
Having already whitelisted one set of paths, the next set of paths seemed pretty straightforward. Like before, we picked a relatively non-critical set of paths to whitelist, to control the potential impact. There were some new subtleties to the syntax required, so we tested manually on our workstation by copying the config files and starting up nginx, and everything looked good. So, we were reasonably comfortable. We made sure we had the metrics for those endpoints available, and deployed the change.
It’s moments like these which illustrate the value of good monitoring. We hadn’t just broken the endpoints we thought we’d whitelisted: we were getting all sorts of errors and alerts. In situations like that, before trying to understand what’s happening, revert. That’s what we did and everything got back to normal pretty quickly.
It didn’t take much investigation to find out what had happened. A cursory look at the nginx logs showed us that nginx had failed to start up because a file was missing. That’s because when we introduced a new whitelist file, we hadn’t added the bit to the puppet manifest to actually put that file on the hosts. If nginx fails to start up, all these messages we’re receiving on port 443 and 80 aren’t getting redirected to our application.
In other words: we had a full outage for fifteen minutes.
There are many industries where an outage like that can cause real problems. Back when I was working on a currency exchange, that would have caused severe reputational damage and potentially large liabilities, if the market had moved. Fortunately, the world of ad serving is slightly different. The impact of this was simply that we weren’t serving ads for fifteen minutes: a drop in revenue for the day of maybe 2%.
So, not disastrous, but not a good sign. That’s when we reconsidered our previous position, and decided maybe it would be a good idea to have tests in place before making changes going forward.
First things first
Forget the config being compatible with the exchange, we had an immediate problem in that our config wasn’t compatible with nginx. How could we protect against puppet runs blowing up our services?
Our puppet deploy is all done through a script which applies the changes to one host, then if it’s successful, radiates out to all the others. The run wasn’t successful here — after the change, nginx wasn’t running — but we hadn’t been notified of it. It looked like that’s because `service restart nginx` just makes sure it can actually invoke the nginx process, it doesn’t track what happens afterwards (after all, it’s a daemon, can’t wait for a return code).
What we would ideally like to do is install the new config, validate it, and then restart nginx if that validation passes.
So we did that.
puppet allows you to override the restart operation for a service. So we put the nginx config validation in as the first step, giving us the following code:
service { ‘nginx’:
ensure => running,
enable => true,
hasrestart => true,
restart => ‘nginx -t -c /etc/nginx/nginx.conf -p /etc/nginx && service nginx restart’,
require => [File[…snip…]]
}
Just to validate this did what we wanted, we tried to push the same broken change with this in place. The puppet deploy failed. Success!
Learning from your mistakes
There was one silver lining. When we were thinking about tests before, it was in terms of validating the configuration. Now we knew there was something else we needed to test, too: the deployment of that configuration.
It wouldn’t be enough just to run tests against nginx initialised with our config. We’d need to run tests against nginx with config put in place by the same mechanism as production used.
So, we got to work. We fixed our acceptance tests so they had no dependencies on the application process, stopped them initiating the application, and required it to run in a different process. We updated our maven configuration to start the application prior to running integration tests (as it calls them) and shut it down afterwards. We provided a run target to easily start the application from within the IDE, and made sure the team was aware of the changes. That’s step 1.
Step 2 — putting nginx in the middle — was slightly trickier. We started off simply: we just mounted the config files in a Docker nginx container, started the container in our pre-integration-test phase, and redirected our acceptance tests to port 80. That proved we could run tests through nginx in our pipeline.
That only allowed us to test the configuration files though — and not their deployment, which is where we’d failed in the past. It also required that the puppet repo was checked out in a specific location (so we could find the config files) and we had another task to run before we could run acceptance tests in the IDE. Not just checked out, but up-to-date, too.
The dependency of the application on the infrastructure was starting to become an impedance.
What Changes Together, Lives Together
We’d moved from a situation where nginx was a more-or-less application-agnostic proxy, to it being dependent on the current capabilities of that application. Having to make updates in two places when changing one behaviour is a useful smell. Here it was telling us that the nginx config no longer belonged under the banner of infrastructure. It was now part of the application.
Our first idea was to move the puppet module into the exchange project, and run it using puppet apply (instead of using the puppet agent). Getting this to work in Docker proved troublesome.
Setting up Docker to run puppet and nginx took a little while, extracting the nginx config and building a module properly was fiddly, and we had to find a way to bundle all the manifests and files to get them onto the container effectively. We spent some time building an rpm of the puppet manifests and finally got the Docker image running puppet apply on them on startup.
Of course, puppet failed. Eventually, it was failing meaningfully: it was running our manifests. Then it said it was missing a class — it had a dependency on something else from our puppet project.
We spent ten minutes duplicating those dependencies before coming to the conclusion this was definitively the wrong way to go.
A quick chat with colleagues pointed out the obvious: instead of putting puppet manifests in an rpm, maybe put the config itself in the rpm?
Yeah. That worked.
Test-Driven Infrastructure
This now enabled us to do what we wouldn’t have thought twice about for a change to the Java code.
When we wanted to make a change to the config, we started with a failing test (a pre-existing test we’d make fail, by blocking another path to the exchange). Then we’d add to the whitelists, and watch the tests slowly pass. If we found that we could continue without opening up an endpoint, we’d write a new test which hit it (not necessarily fully testing its behaviour — just that it’s reachable, for now at least), and then add another item to the whitelist.
And with that, we know that our changes are safe, have the desired effect, and reflect what will happen on deploy.
