Tests, Semantics, and Reliability

You can do so much better than fixtures

Published in

The Monad Nomad

11 min readOct 21, 2016

Not even NASA can write bug-free software.

Let that sink in. Despite their huge amount of quality-assurance, some of the smartest, most well-educated programmers working in excruciating detail for years on a single problem still can't account for all factors.

There will be bugs, unexpected behaviour, and some factors that are literally out of your control (hardware issues and OS bugs for instance). In short, any computer will exhibit unexpected behaviour from time-to-time. There are simply too many variables to account for.

Does this mean that you should just give up on trying to assure quality? Not testing (or testing "just the really important bits") is better than spending time and energy when you can’t catch everything, right? No! Of course you should still test your code!

This is a topic that has generated a fair bit of discussion over the past few years, and there is no shortage of articles running through the pros and cons.

In fact, there are better, more confidence-boosting methods of testing that are available than what most people picked up writing RSpec back in their Rails days.

Excuses, excuses…

Sure, tests can be a pain to write. In my experience, very few people red/green test, because they’re generally doing exploratory coding followed by refactor.

Tests are also time consuming, and often phrasing your specs correctly takes more time than actually writing the code in the first place. It's a trade off of short-term versus long-term goals.

Here are a few excuses that I’ve heard over the years:

“Tests don’t create customer value”
“Tests are hard / time consuming”
“The compiler is smart enough to catch any major errors”
“It’s not part of the iOS culture”
"It's too late to start testing this project now"
“Self-documenting code is better than tests”
“You don’t need to test pure code”
“You don’t need to test functional code”
“You don’t need to test object-oriented code”
“You don’t need to test typed languages”
“OTP will restart any processes with errors, so just let it fail fast”
"It's a client project, so our team won't benefit from tests long-term"
“What’s a test?”

Communication

We call them programming languages for a reason: all code is communication. We’re having a conversation with a machine, giving them detailed instructions on how to perform some task. It can be challenging to cut through the noise and get right down to intent. One of the much lauded aspects of strong types, FP, and logic programming is that they make it easier to express meaning, at least as far as a shared sense of meaning can go with a machine (the universal language: math).

Tests help verify that we say what we mean. By rephrasing an idea, you get another data point. You can label these ideas with some human annotation, so that you work on the human-end of the discussion. You also gain confidence that the machine “understands you”, much in the same way as teaching a child a new skill: have them repeat a variation of the task back to you.

Confidence Now. Confidence Later.

No matter what the naysayers proclaim, we do reap many benefits from a well thought out test suite. Not only do they serve as secondary documentation, tests help you design your system, expose assumptions, enforce best practices, encourage flexible code, and double nicely as a regression suite. The return on investment will differ from project-to-project, but for the overwhelming majority of long-term projects. The stronger the guarantees, the better for the project’s continued health. As the saying goes: an ounce of prevention is worth a pound of cure. No one wants to be forced to rewrite their app from scratch.

Below I propose a very loose hierarchy of types of testing that give you an increasing amount of confidence. You need to satisfy the lower ones to get benefit from the higher, more refined kinds. The higher up on the hierarchy a project stands, the more I feel like there’s a good handle on the code, and have the confidence to make changes without worrying that things are going to get horribly mangled on deploy.

You’ll get a different ROI from working up the chain from project to project, but really the more & stronger your tests are, the better. The only balancing factor is the time investment. For instance, if you’re working on medical applications, then please, please have the full range of beyond-comprehensive tests for your software and hardware. But in all cases, it’s much cheaper to solve a bug before it makes it out into the wild.

Level 0: No Tests (i.e.: testing in production)

Living Dangerously

Do you enjoy that rush of adrenaline when you get the 3am call that the server has melted? How about the challenge of walking into a codebase riddled with random, undocumented failure conditions? If so, you are a rare breed indeed, and can go back to live debugging your production system.

For the rest of us, not having any tests is dangerous to the long-term health of a project (also the long-term health of your devs 😉). Not having automated tests means that you’ll either end up testing things by hand while writing code (far more time consuming than writing tests), or delegating QA out to your customers (who may not stay customers). Playing whack-a-mole with fixes that regress other issues isn’t fun for anyone.

Level 1: Top-level Feature Tests & Human QA

Absolutely Indispensable

Coding is a form of communication. You're telling a machine what to do, in extreme detail. We can unit test as much as we want, but can't test that the overall system actually does the right thing. It’s very easy to say the slightly wrong thing when giving a machine detailed step-by-step instructions.

Having the ability to quickly test the top-level API is a huge boost in confidence! In fact, this is the fundamental end-result that we care about. This is the minimum set of tests for the business. Most of the rest are for the developers, to help you debug, improve your code, and design more flexible software.

Level 2: Fixture Tests

The term "fixture testing" comes from electrical engineering, where you'd hook up something that you’re working on (often a single component) to a specialized electrical apparatus that provides known input (certain voltage, etc), and checks that the output is correct. This is great for very simple cases, or when there are only a tiny number of possible states (like a lightbulb: on/off).

Computers are huge, complex series of circuits. The equivalent circuitry to even the most common functions are enormous. Programs have to make decisions, make all sorts of assumptions, and depend on messy heuristics.

On one hand, having fixture tests is great! Fast, simple, automated tests are much better than cranking open the terminal and running the function on the same input by hand. We have certainty about one or two specific examples, which can be run when we change our code, and give us a sense that things work more or less as expected.

But we also only have certainty about a handful of very specific examples. If that's all the behaviour that we needed, we could just hardcode all of the return values in a little switch statement. But we have fancy logic for a reason: your code needs to be general.

Many developers stop at this stage. Fixture tests are very easy to think about, being what you do in the terminal. However, they leave you open to very large classes of uncovered errors. With a few tweaks to your flow, you can get much more power and certainty by cranking up the sophistication just one notch.

The one place that I believe fixture tests do well is in doc tests. These are concrete examples of usage embedded right into your documentation. The explanatory nature of these tests usually requires that the example is unchanging.

Level 3: (Randomized) Factory Tests

Factory testing is the standard at most shops. While you can use factories to write fixture tests, it really defeats their purpose. In essence, you want the data to be as random as possible while still satisfying your intended assumptions (like dates being in a certain format).

The act of writing a unit test is more an act of design than of verification
— Robert “Uncle Bob” Martin

Why randomize your data? Isn't that just unneeded complication? Nope! You just need to change your style. Instead of always expecting your full_name function to return "Kermit the Frog", you can generate a random user and expect it to always:

be text (not empty, not an integer, &c)
contain user.first_name
contain user.last_name
have a first_name before last_name

This test is now much more general, and we've uncovered three assumptions that we've made about the behaviour of our code! Also, when we later swap the name format to "Last, First", we still pass all functions except for the one that depends on ordering (few changes necessary). Also, when we break our code, we'll start to see particular tests fail.

Intermittently failing tests are also great red flags that you wouldn't have in a fixture test. They show you that you haven't accounted for some case that will happen in the real world.

Level 4: Property Tests

Properties ("props") are a distillation of randomized factories, edging towards logic programming. They test some invariant behaviour across a large number of inputs, ideally without side effects. As a very simple example, we know that these always hold for integers:

You can also do more sophisticated properties. For example, here’s the semi-famous Monad Laws that ensure that monads behave consistently:

The beauty of prop testing is that we start to drill down into the deep structure of the code, looking at what the code is at a fundamental level. Edge cases start vanishing, because we’re quantifying over all inputs. We can start making assumptions about how our code will behave, such as being able to compose certain functions, or being able to skip/substitute entire subsystems with more efficient versions for particular situations (ex.: no need to actually set a value repeatedly).

We also get massive coverage. Because properties are (generally) very simple, you can run them over tens of thousands of inputs in a few seconds. We've gone from testing one input in fixture tests, to testing 9999+ possibilities per run.

Furthermore, mature prop testing libs (like QuickCheck) provide shrinking: if they run into a failing case, it'll boil it down to the smallest version of that example for you to work with. As a contrived example, let's say that your property is that a list doesn't contain 42. If it generates [99, 31, 42, 26252], it'll fail the test, and then shrink the failing case to [42], which obviously fails. You can now go back to your code and find what's causing it to allow 42 into your list.

Many props have fancy names, such as reflexivity, idempotence, transitivity, commutativity, right identity, and so on. You don’t have to worry about them, but they are good to know. They’re tools to think about your code at a higher level of precision, and give you a good place to start looking for otherwise hidden invariants.

Level 5: Proofs, Dependent Types, and Static Analysis

Proofs are the absolute gold-standard of guarantees for your known problem space. While prop tests help improve confidence about invariants, proofs guarantee that they hold for all possible inputs. You don't even have to run a test suite against particular cases: they're true because math.

While proofs can be done on paper or with a proof assistant, some languages are pioneering dependently typed programming. Idris, Coq, and Dependent Haskell are a few that come to mind. With these, your program both does something useful, and acts as a proof of itself by construction. The classic example is concatenating two indexed vectors, and guaranteeing that the length will be the sum of the two lengths. This must be true of any use of this function, or else the compiler will refuse your code (and give you a helpful error about the lengths not lining up).

It can now also do all sorts of fancy things with memory allocation, knowing exactly how long the output is going to be.

Level 6: Holistic (Human) QA Testing

That's not a typo; it's intentionally listed twice, but with a slighly different sense. Rather than running through the known unknowns, great QA is active, creative, and holistic.

Unknown Unknowns

People often downplay the importance of good manual QA. QA is expensive, time consuming, and monotonous for non-specialists. We look for reasons to get around it, or to automate it away as much as possible. On one hand, this is great: you should only be QAing for stuff that you don't know about yet. Everything else should be automated.

QA helps catch extreme unknown unknowns before they get to your customers. You can only test what you know about, and QA uncovers new and horrible bugs. Even with rock-solid proofs, you've only tested what you knew to prove, and there can (and will be) blind spots in almost any code.

Creative & Adversarial

I’ll even go so far as to say that the ideal QA is adversarial (in the Alice/Bob/Eve sense). Great QA looks to break the site from a technical and design perspective. Proofs and prop tests aren’t “smart”; they confirm your assumptions. Creativity isn’t something that computers do well (yet), so there’s no alternative to a good flesh and blood human at this time.

While automating away as many tests as possible is great, having someone trying to break your application in new and interesting ways can reveal all sorts of bugs in the system that users may run into but never report. I’ve worked at several companies that have failed basic security tests (ex. revealing the database password in plaintext via a simple cURL request that’s outside of the main app flow).

Wrap Up

Many developers aren’t familiar with tests beyond what comes bundled with their standard lib or framework (I’m looking at you, RSpec and ExUnit). As seen above, there’s so much more that you can do to ensure your computer actually understands what you mean.

The wider and deeper your tests are, the better. The more creatively and aggressively you test your system, the better. Tests improve the bottom-line in the long-term by keeping customers happy, and helps reduce stress when making changes to your code. No one likes surprises the weekend after a big deployment.

Your ROI will vary, of course. If you’re launching billion-dollar satellites that could fail and crash into a populated area, you’ll probably want more and deeper tests/proofs/test launches than if you’re building the next blogging platform. But even then, we want our software to work, because broken software causes billions of dollars of lost productivity. The trade-off is developer time vs peace-of-mind. Just like how you buy insurance for the peace of mind in case something goes wrong, investing in tests shields you from surprises down the road.

While there will always be surprises in the wild, that’s no excuse to not know about them in advance. Besides, we all like to sleep well at night knowing that our code works.