Unit Testing: Our Tests Are Too Big and Some Ways to Fix Them

Published in

Dm03514 Tech Blog

11 min readApr 28, 2019

Unit Testing has a much smaller useful scope than industry suggests. Poorly scoped unit tests inhibit software evolution by providing false positives. Poorly scoped unit tests increase complexity and slows development down by inducing unnecessary cognitive overhead when updating and understanding tests. This post aims to provide hands on examples of how to narrow the scope of unit tests that operate at behavior based tests allow for evolving software designs, while minimizing false positives and test complexity. This is achievable when unit tests scoped is narrow and tests are focused on exercising behavior.

Terminologies

Before we continue there are a couple of terms that need defining:

Unit test: This discussion uses google’s definition of a unit test (shown below as a “small test”); all decisions and discussions take place in relation to the constraints that google outlines:

Solitary Unit Tests: Solitary tests are tests that use test doubles to stub out their dependencies. This will be discussed in detail later. Solitary is in comparison to Sociable unit tests which call into their dependencies. These terms were defined by Jay Fields in his amazing book Working Effectively With Unit Tests. This post strictly focuses on scoping Solitary unit tests.

Behavior/Functionality: This is the computation, action or behavior of a component. The purpose of most tests is to provide feedback on the accuracy or correctness of a certain behavior/functionality. Choosing the correct amount of behavior/functionality is one of the most difficult parts of testing.

An Example of a Broadly Scoped Test

Let’s start by showing a test with too broad of scope:

class PriceAnalytics_Redis:
    def __init__(self):
        self.redis = Redis()

    def avg_price(self) -> float:
        prices = self.redis.prices()
        return sum(prices) / float(len(prices))

Next is the Redis class:

class Redis:
    def __init__(self):
        self._client = redis.StrictRedis()

    def prices(self) -> List[float]:
        """
        Prices returns an array of all prices in the system.

        :return:
        """
        # makes call to redis...
        return self._client.lrange('prices', 0, -1)

Finally, the unit test to verify that avg_price:

class PriceAnalytics_Redis_TestCase(unittest.TestCase):

    @patch('testingtutorials.behaviortesting.prices.Redis')
    def test_avg_price_success(self, mock_redis):
        redis_instance = mock_redis.return_value
        redis_instance.prices.return_value = [1, 1]

        analytics = PriceAnalytics_Redis()

        self.assertEqual(
            1,
            analytics.avg_price(),
        )

        redis_instance.prices.assert_called_once()

And to show that this does work:

$ nosetests tests.behaviortesting.test_prices.PriceAnalytics_Redis_TestCase -v
test_avg_price_success (tests.behaviortesting.test_prices.PriceAnalytics_Redis_TestCase) ... ok----------------------------------------------------------------------
Ran 1 test in 0.001sOK

What is scope?

Choosing a correct scope is essential to maximizing unit test value. Scope is everything that a test touches, both logically (in terms of number of assertions) or structurally (in terms of number of dependencies) in order to exercise a target. Broad scoped tests often touch implementation details or have too many assertions. Narrow scoped tests explicitly configure their dependencies and exercise their target while minimizing the number of components they have to touch in order to verify the desired behavior/functionality.

Scope has two components, what is being tested (functionality/behavior) and what’s required in order to test them.

Behavior/Functionality

A narrow unit test operates at the client interface level. It focuses on behaviors in terms of inputs and outputs, and not on how those inputs and not the sequence of events leading to those inputs and outputs (implementation):

Compare this to a broadly scoped test which often touches implementation details, or tests too much behavior/functionality:

When scope is narrow it allows for the implementation to change but still allows the tests to provide useful focused feedback:

A narrowly scoped test focused on behavior is decoupled from implementation. It protects the engineer and provides high fidelity feedback on a higher level (that of inputs and outputs) functionality. When a narrowly scoped test fails it indicates an important invariant or piece of application functionality is broken, not that something has changed or been moved around.

Testing Behavior rarely, if ever, involves testing interactions. Interactions are more effectively tested at a higher level tests (ie Sociable unit tests or integration tests). At the Solitary unit test level there is almost no value in asserting that one component calls the correct methods of another component. In narrowly scoped unit tests feedback on structure is already implicitly enforced through instantiation and usage of components (ie the compiler/interpreter), so they shouldn’t require many explicit structural assertions. Broad scope extends beyond behavior and a narrow scope isolates behavior. Narrow scope allows for evolution to take place whereas broad scope tests inhibit evolution.

Test Dependencies

The second aspect of broadly scoped tests are the dependencies required in order to execute a test. The diagram below shows the components and their relationships for the broadly scoped test code above:

UnitTest has a hard dependency on the unit that it is exercising (PriceAnalytics). To test PriceAnalytics it dynamically patches the PriceAnalytics import of Redis. The test also needs to know about the Redis implementation in order to provide a valid test double! This dependency graph is crazy which should be super concerning because this example is so contrived. This is a test with a single patch but in my experiences, tests with multiple patches are common.

The broad test example oversteps its bounds in exercising a simple average calculation. It pulls in a lot of dependencies which trigger false positives on test failures and induce unnecessary cognitive overhead. Appropriately scoped tests provide “just enough” feedback to notify when important functionality or an invariant has been violated but will not inhibit refactoring or changing of an implementation.

Contrast the dependency diagram above with a narrowly scoped solitary unit test:

Look at the dependency graph! We’ll go into detail how this test is achievable through restructuring the code.

In comparing scopes of tests fewer assertions is more narrow than more assertions and fewer component dependencies is more narrow than more component dependencies, finally a narrower scope will involve fewer changes when an implementation changes than a broad scope.

So, What’s the Problem with Broad Scope?

Most code bases that I’ve seen that support dynamic loading/patching have tests very similar to the one above. In my experiences this is a very common pattern. While I do think that in most cases these tests are better than no feedback at all, there a couple of huge issues with broadly scoped tests:

Inhibit Evolution

Evolution in terms of testing is when tests allow implementations to change while still providing feedback on expected behavior. Tests that encourage evolution are those that allow for aggressive refactoring

Consider the company wanted to switch from Redis to Memcache so a new Memache datasource is added:

class Memcache:
    def __init__(self):
        self._client = Client(('localhost', 11211))

    def prices(self) -> List[float]:
        """
        Prices returns an array of all prices in the system.

        :return:
        """
        # makes call to memcached...
        return json.loads(self._client.get('prices'))

An engineer updates price analytics to use Memcache:

class PriceAnalytics_MigrateMemcache:
    def __init__(self):
        self.memcache = Memcache()

    def avg_price(self) -> float:
        prices = self.memcache.prices()
        return sum(prices) / float(len(prices))

and then executes the test suite, which displays the following error:

$ nosetests tests.behaviortesting.test_prices.PriceAnalytics_MigrateMemcache_TestCase -v
test_avg_price_success (tests.behaviortesting.test_prices.PriceAnalytics_MigrateMemcache_TestCase) ... ERROR======================================================================
ERROR: test_avg_price_success (tests.behaviortesting.test_prices.PriceAnalytics_MigrateMemcache_TestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/Cellar/python/3.7.2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/unittest/mock.py", line 1195, in patched
    return func(*args, **keywargs)
  File "/Users/dm03514/code/github.com/dm03514/testing-tutorials/tests/behaviortesting/test_prices.py", line 35, in test_avg_price_success
    analytics.avg_price(),
  File "/Users/dm03514/code/github.com/dm03514/testing-tutorials/testingtutorials/behaviortesting/prices.py", line 51, in avg_price
    prices = self.memcache.prices()
  File "/Users/dm03514/code/github.com/dm03514/testing-tutorials/testingtutorials/behaviortesting/prices.py", line 20, in prices
    return json.loads(self._client.get('prices'))
  File "/Users/dm03514/.envs/testing-tutorials/lib/python3.7/site-packages/pymemcache/client/base.py", line 450, in get
    return self._fetch_cmd(b'get', [key], False).get(key, default)
  File "/Users/dm03514/.envs/testing-tutorials/lib/python3.7/site-packages/pymemcache/client/base.py", line 751, in _fetch_cmd
    self._connect()
  File "/Users/dm03514/.envs/testing-tutorials/lib/python3.7/site-packages/pymemcache/client/base.py", line 274, in _connect
    sock.connect(self.server)
ConnectionRefusedError: [Errno 61] Connection refused----------------------------------------------------------------------
Ran 1 test in 0.005sFAILED (errors=1)

The computation itself isn’t broken but the test is failing. Nothing has changed as far as the application’s ability to calculate average prices! This test is failing for a reason unrelated to the computation being performed. It’s failing because its scope of the test is too broad. Any time associated with updating, understanding, or refactoring this test is wasted time.

The test above results in a False Positive. There is no actual issue with the application’s ability to calculate average prices. In addition to triggering a false positive having to reason about datasources when dealing with averaging prices induces unnecessary cognitive overhead. All of these issues are fairly manageable for small test suites but for large test suites they can hit critical mass: where making small implementation changes results in sifting through tens or hundreds of tests and trying too understand if errors are real or false positives. This decreases velocity, and can often require many mechanical rote implementation updates.

Interactions (integrations) between production like components using production protocols should take place at higher level of the testing pyramid.

A Narrowly Scoped Test

Now that we’ve bashed broadly scoped tests, how do we fix the example test? The tricky part is fixing tests non-intuitively happens at the program design and/or structural level and often not the test level. The program structure needs to be modified in order to support narrowly scoped tests:

class Datasource(ABC):
    @abstractmethod
    def prices(self) -> List[float]:
        pass


class PriceAnalyticsEvolvable:
    def __init__(self, datasource: Datasource):
        self.datasource = datasource

    def avg_price(self) -> float:
        prices = self.datasource.prices()
        return sum(prices) / float(len(prices))

Using an interface decouples the PriceAnalytics from any specific implementation supporting a reduced test scope:

class PriceAnalyticsEvolvableTestCase(unittest.TestCase):

    def test_avg_price_success(self):
        ds = MagicMock(
            prices=MagicMock(
                return_value=[1, 1]
            )
        )

        analytics = PriceAnalyticsEvolvable(
            datasource=ds
        )

        self.assertEqual(
            1,
            analytics.avg_price(),
        )

This is really amazing because the test can be reordered to become almost declarative:

class PriceAnalyticsEvolvableTestCase(unittest.TestCase):

    def test_avg_price_success(self):
        self.assertEqual(
            1,
            PriceAnalyticsEvolvable(
                datasource=MagicMock(
                    prices=MagicMock(
                      return_value=[1, 1]
                    )
                )
            ).avg_price()
        )

Since the PriceAnalytics and the avg_price are now decoupled the calculation of avg_price is no longer dependent on any specific implementation; it has a much narrower scope now that a concrete data store has been removed (redis/memcache). There is exactly a single test providing feedback on a single thin slice of functionality. The calculation sits on top of the datasource and can be thought of as decorating a source of prices.

Narrowing Scope

Achieving narrow state/behavior based tests requires explicit control of first-degree dependencies. Achieving narrowly focused tests that exercise behavior and functionality is a process of software design and structure. This is unfortunate because software design and structure is a gigantic field and makes for a much longer blog!

Step 1: Define Scope

What is the functionality/behavior that is being tested? What business logic is being tested? Unless your business is moving data around data stores behavior is often the logic between data persistent and retrieval. In our case the scope of the test is described by the method name avg_price. In order to calculate averages we need a collection of 0..n prices. Where those prices come from is no concern to the logic. The only reason a datastore would be considered is because the CODE has a structural dependency on a datastore. We could completely reimagine this as standalone function, and require an adapter to create a connection between it and a datasource:

def avg_price(prices: List[float]) -> float:

I have found when thinking in terms of behavior/functionality it’s easier to identify where business logic can live. In the examples case the goal will be to have the minimal code necessary in order to execute avg_price . In order to do this prices are necessary and the PriceAnalytics module. There shouldn’t be anything else required to verify avg_price.

Another helpful approach to defining scope is to use the constraints listed in the definition of “Small Tests” above, namely: “No Network Access”, and “No External Systems”. Because of these constraints a test that mentions “Redis” is usually a smell, and is a candidate to be refactored in order to focus on testing behavior/functionality.

Step 2: Abstract From Implementation

The next step is to separate the functionality/behavior from any implementations. This decoupling is most commonly performed through abstraction. By using an interface PriceAnalytics is decoupled from any concrete implementation, meaning it can operate on ANY implementation that fulfills the Datasource interface! Redis, Memcache, or even our test stub!

Redis can now change independent of PriceAnalytics or of the unit test. Memcache can be added tested and verified independent of the unit test and PriceAnalytics. At some point the interaction between PriceAnalytics and a concrete datasource should be verified but it’s out of scope of the unit test.

Step 3: Explicitly Control Dependencies

The final step is to enable explicit configuration of all dependencies. The most common way to achieve this is through dependency injection. This is shuffling around implementation configuration allowing the caller to configure the analytics with any of the implementations from above. The broad unit test operated on code with a tight coupling to redis:

class PriceAnalytics_Redis:
    def __init__(self):
        self.redis = Redis()

In order to instantiate PriceAnalytics redis has to be instantiated creating a very tight coupling.It just happens that python has facilities to provide its own definition for what Redis is dynamically, which is accomplished in the original test using patch:

@patch('testingtutorials.behaviortesting.prices.Redis')

Most of the complexities of the test stems from this interaction (see first dependency graph above). I think our code provides far too much value to deserve to be tricked, which we enable through dependency injection in:

class PriceAnalyticsEvolvable:
    def __init__(self, datasource: Datasource):
        self.datasource = datasource

This makes PriceAnalytics flexible and configurable by its clients. It becomes much more powerful and flexible. Since the test operates as a client the test is free to instantiate PriceAnalytics with its own datasource implementation.

Playing around with these dependencies enables the implementation detail to be an explicit configuration detail.

Conclusion

Valuable unit testing must be designed in. It’s a product of code structure and not something that’s added around it. Choosing an appropriate scope for unit tests, and minimizing the number of components necessary in verifying that scope can provide huge benefits over the course of a product in terms of bug reduction and velocity of refactors. Having a suboptimal scope can end up costing a huge amount of time due to false positives, unnecessary updates, and inhibit the ability to refactor. Favoring fewer unit test dependencies, less dynamic patching and fewer assertions usually pays off when compared to the alternative.