Quantifying our testing pyramid

Published in

Vimeo Engineering Blog

6 min readDec 17, 2019

You’ve heard of the testing pyramid, right? The idea, introduced by Mike Cohn, is that the cheaper a class of test is, the more that test should be represented in your codebase.

On this pyramid we see clear boundaries between unit tests, service tests, and UI tests, but in reality the number and composition of these tiers depends on the mission of your team and the organization you’re working within.

In this post I’ll talk about what our pyramid looks like on the Vimeo Search and Recommendations team and how it came to be. While some of our testing problems are specific to a search team, most of the decisions we face are borne out of being a small team writing microservices that fit into the architecture of a large software product.

Building our testing pyramid

Our testing pyramid looks like this:

The first thing that might stand out is that each tier is defined by the number of containers required to execute the test suite. This makes the base tier unit tests, since they only test the functionality of one specific component of our infrastructure. As we go up in the pyramid, tests become more integrated — focused on how each component fits together rather than the logic of a given component.

Now that we have a high-level view of our pyramid, let’s dive into each level to understand more about what kinds of tests live at each level and our motivations for writing each test type.

But first

Have a look at this (really high-level, oversimplified) architecture diagram:

Whenever someone uploads a video (or anything else) to vimeo.com, our Indexation Queue Worker detects it and prepares the corresponding video document for indexation into Elasticsearch. Then, when a visitor performs a search, vimeo.com calls out to our search API, which formulates a query to issue to Elasticsearch to retrieve the relevant documents for that query.

As we go through the testing pyramid, keep this information in mind.

Building our pyramid: Tier 1

The most interesting test variation on this tier is our Elasticsearch query validation strategy. Much of the job of the search API is formulating the query to issue to Elasticsearch. For this, we use a PHP templating engine called Twig. To demonstrate how we use and test with Twig, included below is sample code used in our marketplace search. Vimeo’s marketplace connects video creators available to hire with people who are seeking to hire video creators. The skills filter enables them to narrow down their search to creators with a certain skill (like acting or editing).

{
 "bool": {                
   {% if request.query.get('skills') | length > 0 %}                               
     "should": [                                  
       {% for value in request.query.get('skills') %}                                            
       {                                           
         "match": {                                               
           "{{ 'profile_skills.id' : }}": {{ value }}                                              
         }                                       
    }

To test this skills filter, we populate a request object, render the Twig template, and assert that the output contains the skills block that we expect.

$request = self::request(['skills' => '12,3']);                       $template = $this->twig->load('marketplace/index.twig');                       $output_string = $template->render(['request' => $request]);                       
$this>assertRenderedOutputExpected(self::EXPECTED,$output_string);

Using a templating language like Twig enables us to keep our code modularized in a way that makes adding tests for new queries effortless, which in turn encourages us to have a large number of tests at the base of our pyramid.

Building our pyramid: Tier 2

Our first level of integration tests verifies that when we do issue a query to Elasticsearch, we receive back the data that we expect. You might notice that, in the query from our Tier 1 test, we specify a minimum_should_match value…it wasn’t always there. A couple of weeks before launch, we noticed that although our query to filter by skills would correctly return results with the specified skillset at the top of the result set, we also noticed that the query wouldn’t actually filter out creators without the specified skillset. Easy enough to fix, but the fact that we didn’t catch this in an automated way didn’t sit right with us, which is why we built filter tests. Filter tests enable us to load a controlled dataset into an Elasticsearch container and, for a given query, define which documents should be excluded and included in the result set.

In the following diagram, notice that three containers get launched as part of this test suite: Elasticsearch, Search API, and Elasticsearch Executor, which initializes our indices:

In this model, our test suite not only runs the test cases and assertions, but it also loads our test data. This tier helps us to avoid the headache of pulling in all of the dependencies that our indexer requires while also enabling us to verify that our search API utilizes the Elasticsearch cluster to query and filter on documents correctly.

Building our pyramid: Tier 3

Too often, we’d push a (well unit-tested) change to our indexer, see the dreaded indexation errors graph spike, and have to roll back. Because our indexer sources data from so many different services within the Vimeo ecosystem, building a meaningful integration test for the indexer wasn’t something our team could do alone. Luckily at Vimeo, our Developer Experience team maintains a dev tool, which is a utility that enables engineers to bring up a suite of services defined in Docker Compose that is a close representation of the entire Vimeo technology stack. These services include MySQL, HAProxy, our queueing infrastructure, caching infrastructure, and many microservices that power Vimeo. Though we paid a big upfront cost in building these tests, having a small but complete top level of our pyramid routinely helps us to catch indexation bugs. When we’re working on a new feature and tests fail at this level, we usually will also add in a unit test to address this error. Bugs that are caught only by the top level of the pyramid indicate potentially dangerous edge cases. Explicitly addressing these states early on makes the code more readable and easily iterated on.

Closing thoughts

The concept of the testing pyramid is a useful framework for trading off among:

Achieving maximum confidence when releasing new functionality
Isolating tests to just the affected systems
Maintaining fast iteration times

Docker and all its surrounding tools have made integration testing significantly easier, but, when working in a large organization with hundreds of microservices, it’s difficult to know which to include in your integration test and which to leave out. It can be tempting to test only at the top level — if we know a feature works end to end, we can be maximally confident in releasing that functionality, right? Well, yes, but say we want to change that functionality — changing a bunch of unit tests is easier than changing a bunch of heavyweight integration tests.

Furthermore, when working in the context of a large engineering organization, tests that are introduced at the top of the pyramid affect the iteration cycles of hundreds of developers. Applying the pyramid philosophy here can help teams find a middle ground by building a few tests that include the maximum number of company-wide services, building a few more that integration-test all of your team’s services, and leaving the rest to unit testing.