Blind spots and safety nets

I’m going to fall, but I’d rather not reach the ground.

Greg Sarjeant
DevSubOps
4 min readDec 30, 2019

--

The good news was that the system was secure. As I sat at my desk tapping the “f” key, my annoyance growing with each failure of that action to change in any way the text that was displayed in my terminal, I could at least comfort myself with the knowledge that nobody else outside the data center could log on to the server, either. My goal had been to lock the system down and I had certainly done so.

I’d been on a firewall kick. I was trying to define the most restrictive set of rules that would still allow our application to function, so that we could adopt it as part of our baseline configuration. I had just received some new development servers at a remote manged hosting provider, so I had the perfect playground to try them out.

I made absolutely sure that I had a comprehensive list of the firewall rules that I needed. Anything that one part of the application needed in order to talk to another part of the application would be allowed. Any other traffic would be denied. I documented all of the services on the systems and the ports they listened on. I determined which needed to be accessible from other servers and which didn’t. I double-checked the list. I ran it by the development teams and my fellow sysadmins. This was one well-validated list of firewall rules.

Now that I had a set of firewall rules with which there couldn’t possibly be anything wrong, I confidently put it in place — and immediately realized that I’d been so focused on what I needed to do to let my application communicate with the server that I’d ignored what I needed to do so I could keep communicating with it. I hadn’t allowed SSH. In what would be our last conversation for a time, I told the server not to listen to me any more. It happily obliged.

I’d been tripped up by a blind spot, and now the only thing I could do was to send an email to the hosting provider and wait for someone there to reset the firewall. This gave me some extra time to sit at my desk and tap on the “f” key while I thought about how I could avoid this sort of thing in the future.

The problem with blind spots is that you can’t see them until it’s too late. That’s their nature. Now that I’d stumbled into this one, I could figure out a way to deal with it (about which more later), but I knew there would be another. There’s always another, and the solution to this one wouldn’t help me deal with the next one. Moreover, it’s folly trying to avoid the blind spots. That’s not how they work; the moment I’m able to avoid something it isn’t a blind spot any more. Since the sorts of issues that I was concerned about were precisely those that I wouldn’t see ahead of time, preemptive approaches weren’t likely to help.

What I really wanted was a safety net, so that when I inevitably encountered the next blind spot, something could at least catch me and let me claw my way out of my hole before I hit the bottom. Instead of trying to figure out where all the blind spots were, I started to think about the likely consequences of stumbling into one, even if I couldn’t know what the specific blind spot would be. Once I understood those consequences, I could take steps to mitigate them.

In the case of my firewall misadventure, I knew what firewalls do: at a high level, they allow or deny network traffic to a computer. So, regardless of what I might screw up, there were only two likely consequences:

  1. I would allow traffic that I wanted to deny.
  2. I would deny traffic that I wanted to allow.

The first of these is easy to fix. The second, as I’d just learned, can be impossible to fix without getting physical access to the server. If I had had access, the solution to the problem would be fairly straightforward: restore the firewall to its last known working state. So the question I had to ask was this:

Is there some way for me to restore the firewall to a known good state without physical access to the server if I screw up and lock myself out of the box?

It turns out that there are a few ways. I ended up using the situationally handy but seldom-used at command, which lets you run a command at a specific time. Before I started messing around with the firewall the next time, I set up an at command that said “reload the default firewall rules in 10 minutes.” Now I had my safety net. If I screwed something up, even if that screwup locked me out of the server, all I had to do was wait a few minutes for my command to run and reset the firewall to the known good state. It worked beautifully.

As handy as that at command was, the thing that was most helpful was changing my approach to risk mitigation. While it’s impossible to predict everything that could go wrong with a given operation, it’s usually possible to understand the most likely consequences of something going wrong. What might happen when I rename a piece of infrastructure? Alter a database table? Change an API signature? Deploy a new version of an application? What could I do to get myself back to a known good state if the unexpected happens?

Whenever I find myself feeling like I’m walking a tightrope in production, I stop to ask myself these kinds of questions — to think about what might happen if I slip and fall. Once I understand the consequences of falling, I can build myself an appropriate safety net. Most of the time I don’t need it, but when I do, I’m always glad it’s there.

--

--