Tests are an incredibly important part to producing quality software quickly, but, as with all things in life, can cause more harm than good when used incorrectly. Consider the following overly simple function and test. In this case, the author wants to insulate their tests from external dependencies, so they rely on mocks.

Looks great! Fully tested, 4 assertions to ensure the code executed as expected. The tests even pass!

$ python3.6 -m pytest test_simple.py
========= test session starts =========
itemstest_simple.py .
======= 1 passed in 0.03 seconds ======

Of course, the problem is this code is wrong. md5 only accepts bytes not str (See this blog post for an explanation of how bytes and strings changed in Python 3). The test case had little value; it really only tested string formatting, granting us a false sense of security. We thought we had written our code correctly and we had test cases to prove it!

Thankfully mypy catches these issues:

$ mypy test_simple.py
test_simple.py:6: error: Argument 1 to “md5” has incompatible type “str”; expected “Union[bytes, bytearray, memoryview]”
test_simple.py:8: error: Argument 1 to “update” of “_Hash” has incompatible type “str”; expected “Union[bytes, bytearray, memoryview]”

OK great, we fix the underlying code to first encode our strings as bytes:

Our code now works, but we still have a problem. Let’s say someone else comes through and simplifies our code even further to just a couple of lines:

This code is functionally identical to the original code. For the same set of inputs it will always return the same output. Even so, our tests fail:

E AssertionError: Expected call: md5(b’hello’)
E Actual call: md5(b’helloworld’)

There’s clearly a problem with this simple test. It’s both been subject to Type I Error (failed when the underlying production code was fine) and to Type II Error (passed when the underlying production code was broken). In an ideal world tests would fail if and only if the underlying code was broken. In an even more perfect world if our tests passed we would have 100% confidence our production code was correct. While neither of these ideals is achievable, they are paragons worth pursuing.

I call these kinds of tests “Tautology Tests”. They assert that the code is correct by ensuring the code executes as written which, of course, assumes the way it is written is correct.

I argue that Tautology Tests are a net negative for your code for a few reasons:

  1. Tautology Tests give engineers a false sense of security that their code is correct. They might look at the high code coverage and feel good about their project. Others coming into the code base will feel confident pushing changes as long as tests pass even though those tests aren’t really testing anything.
  2. Tautology Tests effectively freeze the implementation rather than test that the code behaves as desired. Whenever implementation details change, the tests must be updated to reflect the new implementation, instead of changing tests when the expected output changes. In turn, this trains engineers to correct the tests when they fail — instead of investigating why the test failed. When this happens tests reduce themselves to burdens, losing their original purpose as tools for preventing bugs from getting into production.
  3. Static Analysis tools are capable of finding blatant errors in your code like typos that would otherwise be caught by a Tautology Test. Especially in the realm of dynamic languages, static analysis tools have improved significantly over the past 5 years, be it mypy in Python, hack in PHP, or TypeScript in JavaScript. These tools are frequently more appropriate for finding typos and provide additional value to other engineers by making the code easier to understand and navigate.

In short, Tautology Tests frequently miss real issues, encourage the bad habit of blindly fixing tests, and cost substantially more to maintain than the value they provide.

Now consider if we rewrite the test to just test the expected output:

Now my test doesn’t care about the internal details of get_key and will break only if get_key returns an incorrect value. I can change the internals of get_key as I see fit without having to update tests (unless I change the public behavior). My test is also succinct and easier to understand.

While this is a contrived example, it’s easy to find examples in real code that — for example — assume the output of an external service matches the implementation’s expectations just to increase code coverage.

How to Find Tautology Tests

  1. Tests that get updated much more frequently than the code they’re testing when they fail. Every time this happens it’s part of the cost we pay for test coverage. If that cost starts to exceed the value derived from the test it’s a strong hint that our test code is too tightly coupled to the implementation. A related problem: a small production code change requires updating a much larger number of tests.
  2. Test code that’s impossible to edit without looking at the implementation. If it’s impossible to modify a test without also looking at the implementation then there is a strong chance you’ve got a Tautology Test. Google’s Don’t Overuse Mocks “Testing on the Toilet” had an all too familiar example of this — you could re-write the implementation from the test:

How to fix Tautology Tests

  • Keep I/O separate from logic. I/O is one of the most frequent reasons engineers have to reach for mocks. I/O is super important…without I/O all we can do is spin CPU cycles and heat up our computers…but it should be pushed to the peripheries of your code instead of being interleaved with logic. The Sans-I/O working group in the Python Community has some great documentation on this topic and Cory Benfield covered this well at PyCon 2016 in his talk Building Protocol Libraries The Right Way.
  • Avoid mocking in memory objects. If you find yourself mocking out dependencies that exist entirely within the confines of memory, there should be some very good reason for using a mock — maybe the underlying function is non-deterministic or takes too long to execute. Using real objects improve the value of the tests by testing more interactions per test case.

    Even then, there should be some tests that ensure our code is using that dependency in the correct way — like a test that ensures the output is within some expected range or matches. As an example below, we have a test that ensures our code works if randint returns a specific value and a test that ensures we’re calling randint correctly:
  • Use fixture data. If the dependency being mocked is an external service, consider creating a common set of fakes or using a mock server to provide fixture data. Centralizing the implementation of the fake allows for careful emulation of the real implementation’s behavior and minimizes how much test code has to change if the underlying implementation changes.
  • Don’t be afraid to leave some code uncovered! Given the choice of testing some code well and no tests the answer is clearly to test it well. Given the choice of a Tautology Test and no test it’s less clear. Hopefully I’ve convinced you that Tautology Tests are a net negative and leaving some code untested is a signal to future developers as to the current state of the world. They can chose to exercise caution when modifying that section of code or — preferably–use some of the techniques above to add proper tests.

It’s better to leave a line of code untested than to give the illusion that it is well tested.

Keep an eye out for Tautology Tests during code review as well. Ask yourself what this test is actually testing and not just “does it cover some lines of code?”

Remember, Tautology Tests are bad because they are not good.

Further Reading

Thanks to Kent Beck, Simon Stewart, Ben Hamilton, and Josh Cincinnati for feedback!