Don’t Fix the Bug, Fix the System

Published in

Inside League

8 min readFeb 22, 2019

Grace Hopper finds the first actual computer bug, 1947

Bugs are such a bummer! You’re just hanging out, trying to write some cool new thing, and then somebody comes up and says “hey remember that thing you wrote earlier and aren’t thinking about at all anymore? It’s totally busted and 100% of our customers hate you now, even Jennifer!”

Even worse is when you fix a bug, and then six months later someone says “hey I think this other thing is also busted” and you spend a bunch of time looking at it and IT’S BASICALLY THE SAME BUG WTF I FIXED YOU. This is the worst!

To try to have things not be the worst, when I find a bug I think “what class of bug is this?” Is it a logic error? A problem with module Foo assuming something incorrect about module Bar? A typo? Passing parameters in the wrong order? A misunderstanding about requirements?

Some of those classes of bugs (the requirements one especially!) are pretty hard to fix systematically, but there are plenty of avoidable bugs.

Let’s look at a couple techniques for this!

Changing the System so the Bug Isn’t Possible

This is the one smart people are always making pithy, unhelpful statements about!

“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies.”
- Tony Hoare
“If you want more effective programmers, you will discover that they should not waste their time debugging, they should not introduce the bugs to start with.”
- Edgar Dijsktra

This is always the goal, but I certainly don’t always get there, and I’d argue that you shouldn’t always get there! Given the choice between “make this 30-minute change that does the thing we need but makes our system less elegant” and “spend a month rearchitecting our system so it looks like that feature was planned from the start”, very few people have the luxury of picking the latter. Shipping is important and can mean the difference between competing and getting left behind.

Nevertheless, simplification should always be under consideration, because the opposite problem also happens: you end up left behind because your system is so complicated that making changes is a fraught and lengthy process.

The benefits of “code where a bug isn’t possible” seem pretty obvious, the tricky thing is figuring out when that’s possible! Here’s an example where I think fixing the system was the right call:

We have a type called Context that holds request-scoped information (like what user is making the request, whether they’re authenticated, etc). In particular, it had these two things called Subject and MessageType. MessageType was the name of the service being called, and Subject was…complicated. In one of our apps it was the same as MessageType, but in another one it was being used to figure out “is the message I just received a direct response to something I sent, or is it a broadcast?” in a publish-subscribe mechanism we have.

This was causing bugs all over the place, because people would look at Context and think “hmm which of these is the right thing for me to care about” and they would sometimes pick Subject when they meant MessageType, and then their code would work but only some of the time. Code that works but only some of the time is the worst kind of code!

So we removed Subject, and now we keep track of that “is this a direct response to something I sent” information by using a closure when we subscribe. Now this sort of “used Subject when I meant MessageType” bug isn’t possible, because Subject isn’t a thing anymore!

The “passed some function arguments in the wrong order” class of bugs also tends to be a great candidate for change-the-system fixes. For instance, we had a lot of places we were passing pagination info like (offset int, limit int) and it was hard to remember which order they went in, which had caused us a few bugs over the years when people got that order wrong, so now we pass those as distinct types and the problem has totally disappeared because the compiler yells at us if we do it wrong!

Catching All Versions of the Bug With Static Analysis

Static analyzers are programs that inspect your program without running it. It turns out there are lots of kinds of bugs that can be caught this way!

Our backend is written in Go, so there are a few good options we can use. Go vet is the most common Go-based static analyzer, but there are others. We run Staticcheck, for instance, and so should you! When we introduced it we found a bunch of long-standing problems in our code, and when we added it to our pre-merge continuous integration checks we prevented countless more bugs from getting into the code in the first place. Staticcheck and vet are both great, their “zero false positives” goals make them really easy to introduce, because the things you need to fix are all “make the code better” and not “change the code in some way to appease the static analysis tools” or “add a bunch of filters to ignore these false positives”.

We also run a spellchecker for our strings files, goimports, a separate spellchecker for our source code, ineffassign, unconvert, and gosec. Combined, these checks prevent dozens of issues every week — mostly nitpicky style things, but also some real bugs.

PRO TIP: the easiest time to introduce a new static analyzer is right now! Adding them to new, small projects is easy and keeps bugs out as things grow, but the bigger your codebase gets, the more issues you’ll have to resolve when trying to introduce new checks. Oculus CTO and programming wizard John Carmack wrote something about this when he was experimenting with the C++ analyzer PVS-Studio, and it certainly rings true for us:

“I noticed that each time PVS-Studio was updated, it found something in our codebase with the new rules. This seems to imply that if you have a large enough codebase, any class of error that is syntactically legal probably exists there. In a large project, code quality is every bit as statistical as physical material properties — flaws exist all over the place, you can only hope to minimize the impact they have on your users.”

That doesn’t mean that if you already have a big codebase you can’t add new analysis tools, it just means that if you wait longer it’s only going to get harder!

Writing Your Own Static Analyzers

Other people’s static analyzers are great because they’re already written, but an option more people should consider is writing analyzers just for your codebase!

Let me show an example of why that might be useful — our string localizer supports template variables for substitution, and it looks like this:

str := loc.T(“Hello, {{.first_name}}, welcome to League!”, map[string]interface{}{“first_name”: “Reilly”})

That would get localized for a French user to “Bonjour, Reilly, bienvenue à League!”. Using template variables like this is a useful thing to do because it allows the translators a bit more context than just having a %s in there, which might impact how they translate the phrase, and if you have multiple parameters they don’t have to appear in the same order in all languages.

However, this interface is prone to bugs! Here’s a bug I found in our code a couple weeks ago:

str := loc.T(“Reminder: You have bookings confirmed for {{.title}} on {{.date}}.”, map[string]interface{}{“time”: title, “date”: tm})

The parameter timehere should be title, which is the name of the appointment the user had booked.

Fixing this bug is pretty easy, and we could add a test that this one invocation is now correct, but it’s also easy to introduce this same kind of bug somewhere else in the code. Instead, let’s fix this bug everywhere, forever!

This T()interface is pretty convenient, so I don’t want to change it, but we can introduce a simple static analyzer to look for this kind of bug. If you’re curious, the full code for it looks like this (sooo much indenting!), but here’s a summary of what it does:

Are we calling a function named T()?
If so, is the first argument a string literal?
If so, is it a valid Go template? (error out if not)
If so, are subsequent arguments maps?
If so, do the keys to those maps match any template parameters in our first argument? (error out if not)

The whole thing’s 41 lines of code, and 15 of those lines are just }s to get out of all those nested checks. Not too bad!

Now, a check like this is never going to find its way into something like “go vet”, because a function called T() could do literally anything that’s Turing-complete, and I’m sure this check has false positives on really weird code samples that I didn’t think about.

BUT!

We aren’t writing go vet! We’re writing something that has to work on exactly one codebase: ours. That lets you make all kinds of assumptions that you can’t make with a general-purpose tool! Incidentally, that’s also why the rest of our analyzer isn’t open-source; it’s not useful for anyone else’s code (I did publish https://github.com/reillywatson/enumcover, which is a more generally useful check).

Once you get the hang of the go/ast package this sort of check is pretty quick to write. The above took me maybe an hour from start to finish, and now we never have this kind of bug again!

PRO TIP: ast.Print() is your best friend here! I can never remember all the different nodes and what they’re called, but I don’t have to. I just write the smallest sample program I can think of and run it through ast.Print() to see what the parse tree looks like, then write my analyzer function so it catches code that looks kind of like whatever that sample program’s parse tree is.

Here are a couple other kinds of bugs we catch the same way:

calling fmt.Printf() and friends in production code (we have an internal logger package we want to use instead, it provides more context for debugging and handles log file rotation)
calling certain functions that take an interface{} but the argument reeeeally needs to be a pointer (there’s a similar check-in go vet for Unmarshal calls, we use the same technique but with a list of our own functions)

The next version of Go, 1.12, includes some support for running custom analyzers without the performance cost of parsing the AST separately for each check you write (see here for more details). We aren’t using it yet (we run a harness built on top of Staticcheck), but we plan to move to it shortly after 1.12 is released.

Conclusion

Writing software at scale is hard, and the speed at which things change means bugs are going to happen! If you can treat every bug with the attitude “how did this happen and how can we make sure it doesn’t happen again” though, every bug becomes an opportunity to make everything better, instead of just making one thing less bad.

If you like working with smart people to make good things better, that’s totally an option, we’re hiring!

Don’t Fix the Bug, Fix the System

Changing the System so the Bug Isn’t Possible

Catching All Versions of the Bug With Static Analysis

Writing Your Own Static Analyzers

Conclusion

Written by Reilly Watson