Structuring thousands of automated tests

Vlad-George Ardelean
TrustYou Engineering
13 min readApr 2, 2019

I’m going to write here about the structure of python tests that we have adopted on multiple projects at TrustYou. The architecture has proven flexible, consistent, and has been applied successfully in the context of microservices with 1–2k tests. Our tests are easy to write, easy to place in a file/class, and then easy to find. By “structure”, I refer actually to many things, so let me be more specific:

1. A light theoretical model for testing
2. What do we define as unit, integration and end-to-end tests
3. The file and folder names where the tests are written
4. The structure of one file (python module) containing tests
5. The structure of one test method
6. What we define as a fixture
7. How to mock/stub all the things
8. Exceptions to everything! (the test stubs, the structure of the files, flow tests)
9. What we don’t know, and are probably doing the wrong way

…a lot of meaning in the word “structure” :)

1. A light theoretical model for testing

Let’s start with an image:

Image 1. Simple function call model: F1 and F2 are functions. I1 and I2 are the inputs to those functions (call arguments, and state); O1 and O2 are the outputs of those functions (both return values and modified state)

Here, we have a hypothetical function F1, which is called with some input (I1), and it produces some output (O1). Imagine the input input as consisting of both the parameters it’s being called with, but also all the relevant pieces of state that this function makes use of (database state, global variables, file system, network calls, etc etc). Yes we’re modelling dirty functions here, not pure functions.

In the general case, function F1 is not isolated. It calls function F2 with some input (I2), and requiring its output (O2), it produces its own output (O1) in the end. F1 might also be called by another function itself, but here, we’re only considering the functions called by F1. Of course, in real life any function might call zero or more functions, but that’s irrelevant to this model, and adds no value. Again, think of all the inputs and outputs as consisting also of global state which was changed, along with actual function arguments and return values.

2. What do we define as unit, integration and end-to-end tests

Unit tests are the category of tests whose definition is the most unambiguous. A unit test is a test which checks the behaviour of a function in isolation.

We consider 2 types of unit tests:

a. Those which, starting with I1, check that the function F2 is properly called with I2. These we call “mocking tests”. These check how our function behaved to its callees.

b. Those which, starting with I1 and assuming F2 returns O2, check that F1 returns O1. These we call “stubbing tests”. These check how our function behaved to its caller.

Ideally, unit tests will just test a specific scenario whereby a specific function is called. Any calls that it makes would (again, ideally) be mocked/stubbed away. Any functions it calls will be replaced with artefacts whose behaviour we control completely.

From another angle, we regard unit tests as at least not requiring any database connection (or network access). This definition lets us skip some of the mocking/stubbing when we consider that the test is good enough. Having all tests respect the ideal case doesn’t really add that much value, and so we try to apply these techniques when they most make sense.

Integration tests also test function calls, but here the purpose is to see that data actually go through our system. Here we don’t mock/stub functions, unless they’re “dangerous” (for example, sending out emails, making API calls to other services), or expensive and irrelevant (to speed up the run time of the test suite).

End-to-end tests are similar to integration tests. Originally, the purpose here was to try and use the system as our clients would. Since we’re writing microservices, this is convenient and easy to do. For the end-to-end we always need a web server to be started, and we call it with simple flows, like:

1. authenticate

2. make a HTTP request

3. verify the HTTP request was successful

4. try to clean up the resources (maybe with another HTTP request, or database queries)

3. The file and folder names where the tests are written (the test package structure)

The file/folder structure took us some time to figure out. We were inspired by previous experience with JUnit, NUnit, and random articles. Being inspired by these function-less languages made us have to also improvise a little, when testing functions defined at the module level, and stuff in the __init__.py module (We don’t really define much in __init__.py modules, we just import things there, for convenience)

Writing tests within the described file/folder structures allows us to find tests for a given class/function very fast, and to always be certain where to write those tests. Also, no more duplicated tests (well… at least within their category of unit/integration/end-to-end).

Maybe there’s too much going on in this image, so let’s break it down.

On the left side, we have the packages, modules, classes ( C), methods(m) and functions(f) in the app code.

On the right side, we have the tests corresponding to all of these.

You’ll notice an interesting thing, which I call “promotion”.

App functions have a test class corresponding to them in the tests.

App classes have a test module.

App modules have a test package.

App packages also get a package in the tests.

So every language construct in the app code gets a “bigger” language construct in the tests

Functions/Methods -> classes

Classes -> Modules

Modules -> Packages

You’ll notice also that the module-level functions, since they don’t belong to any class, will just get placed in a module called “test_functions”, inside the corresponding package.

Also, you’ll notice that stuff in the __init__.py module, will get placed in a corresponding test__init package.

This simple set of rules will allow a 1:1 translation between the vast majority of code constructs (functions/methods/classes, etc) and a corresponding place in the tests package. Thus, no more test duplication, and we’ll instantly know exactly what functionality is at least touched by tests.

You’ll also notice from the image that I presented only the structure of unit tests here. We use the same structure for integration/end-to-end tests, it’s just that maybe some modules are tested more thoroughly in the unit tests (parsers, validators, utilities), some in the integration (database access, external services, simple API operations), and some in the end-to-end test package (simple flows: authenticate, create resource, retrieve resource, cleanup).

4. The structure of one file (python module) containing tests

One python test module will generally represent the tests written for either a certain class, or for all the functions in a certain module.

Bringing your attention back the the test classes (which are used just as simple namespaces, nothing fancier), you’ll see that the methods in a test class can now be freely named to be as descriptive as possible. As such, you can tell what’s being tested without even looking at the tests. All that’s needed is for the test methods to be as well-named as possible. Down to the specific scenarios that a function is tested in, we don’t need to be creative with the names at all! No decisions! hooray! :D

As test methods, we usually have:

a. one or more tests that the function works successfully. These are very important. If we don’t write code in a TDD fashion, then we’ll at least make sure to write these tests. TDD usually will require one to write a lot of these, in a “test works until here, test works until there, test works a little bit more” fashion.

b. probably even more tests for edge cases and scenarios where stuff didn’t go well. For APIs, it’s common that the failure scenarios are multiple: 400, 401, 403, 409 error codes vs just the 200 or 201 success codes.

5. The structure of one test method

One unit test method usually begins with a “mocking” part. If you remember the function call model explained at point 1, we’ll be talking about how we’re having F1 under test, and how we’ll be replacing F2 with a “test double”.

Martin Fowler’s article “Mocks aren’t stubs” ( https://martinfowler.com/articles/mocksArentStubs.html ) defines common meanings for the words mocks, stubs and test doubles:

  • Mocks are what we are talking about here: objects pre-programmed with expectations which form a specification of the calls they are expected to receive.
  • Stubs provide canned answers to calls made during the test, usually not responding at all to anything outside what’s programmed in for the test.
  • Test Double — the generic term for any kind of pretend object used in place of a real object for testing purposes.

Remembering point 1 again, since we’re testing F1, we’d ideally mock out every function called by F1. Specifically, by mocking, we mean replacing all F2 functions with objects that later, we’ll use in assertions. We’ll assert the number of times and the parameters that these were called with.

For integration tests, we just use less mocking, and for end-to-end tests we might not mock at all, or just minimally.

Inside the test function, we then generally distinguish 3 parts, the AAAs of testing:

Arrange: This first part can be understood as the test setup. Here’s where we set the behaviour the test doubles (the objects that replaced functions/objects), and prepare other objects that were not replaced with doubles.

Things that we commonly replace with doubles:

  • Instance/static/class methods of classes
  • The default “self” and “cls” arguments for methods
  • Objects in any module’s scope: Functions, classes, module-level variables (constants)
  • Attributes of instances
  • Coroutines, generators
  • context managers

Things that we can’t replace with test doubles, unless under special conditions, if at all:

  • Inner functions (functions defined inside other methods/functions)
  • Inner classes (defined inside functions/classes)

…we don’t use a lot of inner functions/classes. When we do, they’re small and not complex, and we usually can test those by testing the functions/methods that define them. This is mostly a non-issue. However you might have noticed that at point 3, I haven’t even bothered to address where we place the tests for these kinds of objects. This is because we usually just test these together with the functions/methods that create them.

Act: This part consists usually of a single instruction, namely calling a method/function with the prepared setup and arguments.

Assert: Here we ensure that the actual outcome is how we expect it to be. We check the output value of callables (methods/functions) and any side-effects that they had.

6. What we define as a fixture

In Python, it’s very easy to think about a fixture as either a prepared response from a callable (most often), or a prepared set of input parameters for that function (less often).

More concretely, let’s think about database access, and the usual ORM-style classes which wrap it.

To start, here’s a watered-down database model of a hotel. It has 2 instance methods (def activate and def deactivate) which execute some logic related to a hotel instance, and its field status. Then we have a class method (def get_all_active_hotels) which logically wouldn’t make sense to be an instance method, and a “batch update” method def deactivate_hotels_by_city which just uses 2 of the previous methods, so we can illustrate a simple mocking technique

class Hotel(Model):
id = StringField()
name = StringField()
zip = StringField()
city = StringField()
country = StringField()
phone = StringField()
status = StringField() # active/inactive
def activate(self):
"""Sets the status to "active" and calls other services"""
pass
def deactivate(self):
"""Sets the status to "inactive" and calls other services"""
pass
@classmethod
def deactivate_hotels_by_city(cls, city):
"""Deactivates all hotels from the given city.
This code is severely stripped down. This is not how we write code - we have heard about SQL WHERE clauses, of course, and batch updating. This serves just to illustrate a testing technique.
"""
try:
hotels = cls.get_all_active_hotels()
except:
hotels = []
for hotel in hotels:
if hotel.city == city:
hotel.deactivate()

@classmethod
def get_all_active_hotels(cls):
"""Does DB querying, and returns a list of 0..N hotels"""
return [Hotel(...), Hotel(...), ...]

The most appropriate example of a fixture here is related to the output of method def get_all_active_hotels, where we imagine it could:

  1. Return an empty list
  2. Raise an exception
  3. Return one hotel
  4. Return many hotels

Then, we can imagine that we might need the output of this function in many tests. That’s for us a fixture, first of all, just a way to reuse (mostly) example output from a function. There is a more subtle benefit to fixtures, when they are being used in a lot of places, but let’s first show a fixture:

class HotelFixture: 
class get_all_active_hotels:
class empty:
@staticmethod
def output():
return []
class exception:
def output():
return IOError('whoops') # we just return the error here, but we'll actually raise it during tests
class one:
@staticmethod
def output():
return [Hotel(id='a', city='London', ...)]
class many():
@staticmethod
def output():
return [Hotel(...), Hotel(...), ...]

It looks a little weird to nest classes like that. We’re using classes here as just simple syntactic namespaces (code completion features work well with this structure)

Anyway, this fixture class is a good place to save behaviour examples from a real function.

If we have this, it’s then easy to reuse and check against these canned-responses

7. How to mock/stub all the things

Now that we have our example model class, some business methods, and a fixture, we can show real-ish examples of mocking/stubbing.

# file tests/unit_tests/test_models/test_hotel.pyfrom unittest import mock
class TestDeactivateHotelsByCity:
@mock.patch('project.models.Hotel.get_all_active_hotels')
@mock.patch('project.models.Hotel.deactivate')
def test_ignores_exceptions(self, get_hotels_mock, deactivate_hotel_mock):
# ARRANGE
# When Hotel.get_all_active_hotels() is called,
# it will raise the exception we prepared in the fixture
get_hotels_mock.side_effect = HotelFixture.get_all_active_hotels.exception.output()

# ACT
# Notice, calling this, we already know it
# won't raise exceptions
Hotel.deactivate_hotels_by_city('London')
# ASSERT
deactivate_hotel_mock.assert_not_called()

Remember from point 1, the F1/F2, I1/I2, and O1/O2 model? (functions, inputs and outputs). We have just tested the scenario where def deactivate_hotels_by_city ignores the errors it receives when calling another method.

But many of you probably wonder by now about the following line

# It's a little long, right?
get_hotels_mock.side_effect = HotelFixture.get_all_active_hotels.exception.output()

You’d probably imagine this line works as well:

get_hotels_mock.side_effect = IOError('whoops')

And you would be totally right. However, fixtures are about reuse! Let’s imagine now that it’s not the def deactivate_hotels_by_city method that we’re testing, but the “problematic” method def get_all_active_hotels. When writing tests for this method, we’d at some point want to assert that an IOError is raised. Then probably later, we could change the error type that’s being raised, and then we’d have to go to all the tests and change the hardcoded IOError(‘whoops’) instance. That’s not that bad really, especially for just this exception. However, for more complex objects (like a list of hotels), modifying hotels in tens/hundreds of places tends to suck more.

Ultimately, the decision of what behaviours of a function to “save” in a fixture, is up to the team. They know best what objects will be used more or less often. We however use these for database model instances (intensively) and calls to internal/external services (not quite so intensively) quite successfully.

Going further, we shown how to mock

  1. Instance methods
  2. Class methods

These are the most common use-cases for mocking. However we use asynchronous code quite a lot, so we also quite commonly

  1. mock coroutines
  2. iterators
  3. And then, a quite interesting technique that Python makes quite easy, is the mocking of the object self-reference

For mocking coroutines we either wrap the return values of functions in a Future object (for testing tornado code), or use the CoroutineMock in the library asynctest package. To be honest, this was done just out of commodity. We’d probably have been able to use unittest.mock to mock some magic methods like __aiter__, __aenter__, __anext__, __aexit__, and would have eventually ended up with the same thing that asynctest just gives us out of the box. So thank you to Martin Richard for allowing us to write less code! :)

For iterators, those were simpler to just “figure out”. On our MagicMock object, we just create an __iter__ method where we return stuff like…lists :P

from unittest import mock


class Hotel:
def get_next_hotels_iterator(self):
return (Hotel(...) for x in range(9)) # dummy implementation

def activate_next_hotels(self, number):
"""Activate a certain number of hotels that come "next" in
some order. Mark then work as done.
"""
# yes, a while loop would make more sense.
# Let's focus on the tests though :P
for hotel in self.get_next_hotels_iterator():
if number < 0:
break

hotel.activate()
number -= 1
self.mark_work_done() @mock.patch('Hotel.get_next_hotels_iterator')
def test_activate_hotel_works(self, hotel_iterator_mock):
hotel1 = mock.MagicMock()
hotel2 = mock.MagicMock()
# This is how we mock an iterator
hotel_iterator_mock.return_value.__iter__ = [hotel1, hotel2]

# This is how we mock the self-reference
# we call an instance method directly on the class, and
# explicitly pass our magic mock as the reference to the
# instance! I know, Python is crazy (fun)! :D
mock_self = mock.MagicMock()
Hotel.activate_next_hotels(mock_self, 1)

hotel1.activate.assert_called_once()
hotel2.activate.assert_not_called()

mock_self.mark_work_done.assert_called()

8. Exceptions to everything! (the test stubs, the structure of the files, flow tests)

We’re not committed to purity, because practicality should win, so here are exceptions that we allow, and even encourage to our testing system.

First of all, like in the example shown where we saved the behaviour for Hotel.get_all_active_hotels in the fixture HotelFixture.get_all_active_hotels.exception.output, we wouldn’t really save all the exceptions raised by a function in a fixture. We do it sometimes, but life’s too short to catch all the exceptions :) Plus, creating fixtures does take time. For trivial things, like lambda x,y: x+y+1, we wouldn’t bother creating complex fixtures, if at all.

Second, it might be the case that some classes get a little too many methods. We might then have test modules for each method. The rule was that all tests for a class should be put in the same file. We’re breaking that rule here, because then that single file would get a little too big, and it’s simpler to have a few more files.

Third, when hunting down really complex bugs, or when the HTTP APIs do more complex things than just CRUD, we might create something called “flow” tests. These tests can have whatever crazy structure needed by the guy searching for the bug: create X, call Y, delete Z, try to create X again unsuccessfully, try to delete Z again unsuccessfully, try to delete Z yet again, ERROR! (systems will get complicated at some point, so you’ll have stuff like this).

9. What we don’t know, and are probably doing the wrong way

We have heard about cool and shiny libraries and tools, and do have the developer-crave to learn everything, but so far we didn’t have time to seriously check out:

  • hypothesis (library that auto-generates tests. Seemed fine for apps that do numeric calculations, or process strings intensively)
  • entity generators with dependencies (like factory-boy)…. we created our static fixtures, and never investigated
  • we don’t have UI tests; this is mostly due to the fact that we don’t have a UI (on my team), and I am not sure about how they do it in other teams.
  • we don’t have integration tests between server applications and cron-jobs.
  • we also don’t have tests that spam multiple projects
  • we’re probably not that good at creating database-level fixtures, and because the database is a complex beast for now, we’ll probably not be able to create too complex database setup soon.

--

--

Vlad-George Ardelean
TrustYou Engineering

Python programmer, I like technical things and I hope we solve the climate challenges soon enough, so we can chill :)