Write Better Python with Hypothesis

Tim Renner
HomeAway Tech Blog
Published in
5 min readSep 24, 2018
Photo by Chris Ried on Unsplash

When testing code, we at HomeAway always start with unit tests. For a pure function, we specify the inputs, what we know the output should be, and then run the function and compare the answer with truth. Unit tests are great for capturing the behavior of the function for cases we write tests for. What about the stuff we didn’t think of? Usually this is edge-case stuff — NaN values, empty lists, etc. — that aren’t on the top of our minds but can and will cause our code to fail. Unit tests are great for testing what we already know. They’re not so great at testing the things we don’t know to test. If we want to test the cases we didn’t think of, we‘re going to need a new set of tools and a different way of thinking about tests.

Hypothesis is a testing framework for Python that performs what’s called property-based testing. It’s designed to find the edge cases in code and ensure certain general guarantees about the code hold true. The framework calls the function thousands of times with generated data, specified loosely by types and bounds, and ensures the properties we’ve defined hold true. If an assertion fails, Hypothesis will keep searching to find the minimal example required to violate the assumptions and show it to us. It’s an excellent way to test code because Hypothesis will find all the nasty things we didn’t think of and shove them into our function.

Property-based tests are most commonly associated with Haskell’s QuickCheck library, but there are lots of implementations or near-implementations. Here’s a walkthrough of a function I implemented to calculate something called the “average agreement” between two ranked lists (read this paper if you dare — it’s part of a larger calculation of something called rank-biased overlap). The first version of this function I wrote passed all of the unit tests.

Here’s the first attempt at the function.

Basically, this function takes two lists and a depth at which to compare them, then calculates the the number of common elements at each depth preceding the max depth, and then averages those. If the lists are identical, the value is 1. If they’re completely different, the value is 0.

Here’s the property based test I wrote using Hypothesis. I’m using pytest as the general testing framework. Hypothesis plugs right in using decorators.

For each test run, Hypothesis generates two lists of integers with minimum value 1 for the lists, and depths of at least 1. Choosing the data is usually easy, but choosing the properties and assertions isn’t. For this function I chose two properties:

  • The value is always between zero and one
  • The function should return the same value when the arguments are reversed (that is, it is symmetric)

For “normal” inputs — whatever that means — Hypothesis will make sure these assertions work. Hopefully it’s also going to find inputs that crash the function.

Which it did. Hypothesis was kind enough to show us how the code broke, and what it did to break the code (under “Falsifying example” above). It’s pretty easy (now) to see a ZeroDivisionError when the lists are both empty because of this line:

agreements.append(2 * len(intersection) / (len(set1) + len(set2))) 

If set1 and set2 are empty, we’re dead by ZeroDivisionError. But most importantly, if either of them is empty, the value’s automatically zero, full stop. So I put a short circuit in the code.

Now that I’m handling the zero-length list case, it’s worth pointing out that even though the data generated by Hypothesis is random, the framework is stateful. It will remember the inputs that failed the test and make sure it calls the function with those inputs every time the test is run. Let’s check the test now.

It passed, but Hypothesis threw a warning. Here are a few things to notice:

  1. The test passed.
  2. There’s a warning that one of the test cases took over third of a second to run.
  3. The test overall (one test) took over fifteen seconds to run.

That smells bad. This function shouldn’t take that long to run, even for decently sized lists. The details of the warning suggest that I need to set a “deadline” — a maximum time the function is allowed to execute. Once I do that, Hypothesis will show me the error and the inputs that caused it.

Here’s what I got:

This has a lot of chatter about inconsistent timings, but the important thing to zero in on is this:

Falsifying example: test_average_agreement_properties(list1=[1], list2=[1], max_depth=80624)

Oops — what happens when the depth is much much much longer than the lengths of the lists?

for depth in range(1, max_depth+1):
set1 = set(list1[:depth])
set2 = set(list2[:depth])

intersection = set1 & set2

agreements.append(
2 * len(intersection) / (len(set1) + len(set2))
)

The loop goes on past the lengths of the lists, python doesn’t throw an error when the lists are sliced past the lengths, it just returns the whole list. So “agreements” becomes this huge list where everything past the length of the longest list is the same value. Not only is this slow, it’s wrong. This function needs to truncate the max depth to the length of the longest list when it’s too large.

Now when I run the test, I get this goodness:

Down from 15+ seconds to half a second and the function won’t return wrong answers.

It’s easy to test the code that’s in front of us, but it’s hard to imagine every gnarly thing that could possibly be thrown at it. Hypothesis is a good way to fill that gap, generating lots of data and finding the edge cases that break the code, slow it down, or violate the assumptions we have about what the function should do. It’s a little more work to add to your test suite, but it’s well worth it.

--

--