Deep Purple: Typeform’s production smoke testing

Published in

Typeform's Engineering Blog

8 min readJun 26, 2019

By Paulina Górecka and Toni Feliu

If only automated tests could cover all the possible issues! But even then, automated tests would not run under production traffic, which makes the system behave differently than on pre-production environments.

That’s why we decided to start testing…in production. This is how we built our solution and how it works.

Always test, please

New features require development, and development should always include tests. Code without tests becomes a hypothesis, which means it’s missing the proof that the code actually works.

Still, fully tested code may behave unexpectedly in production. Moreover, moving to a microservices infrastructure, as we did at Typeform, increases the complexity of communication between services. We can’t just rely on system error rates because they are relative to the general traffic.

Does that mean that we should only test in production? No. And to be clear, we should test our features before AND after the code is released. That means implementing both tests based on the test pyramid created during development and tests in production (which is what this article is about).

Know your system is down before your customers do

The Customer Success team and their user tickets are a great source of awareness about production incidents. However, the existence of those incidents means that we discovered them too late. They increase the number of support tickets, trigger unexpected critical alerts, and cause extra engineering team work on functionalities that we considered complete. On top of that, production incidents can be difficult to reproduce and debug.

Accepting that reality allows us to start asking the right questions, the first of which is, “How often is our platform down?”

Critical user actions

The first step in getting some answers is to define what “down” means for us.

Not all issues are equally important. For example, users not being able to edit their company names may be annoying, but users not being able to log in is a much bigger problem.

That leads us to a cool exercise. Reversing the “What does ‘down’ mean?” question produces a new question that sounds even more frightening! How often does our platform work?

To define what it means to be “up” for Typeform, we joined forces with our design and product teams to analyze our user journey. Our priorities are our users’ needs and experiences. We want to make sure that our product actually allows users to use our services in the production environment. By considering what makes up a Typeform user’s critical path, we identified 10 breaking points on the user journey, including visiting our homepage, logging in, submitting a form, and reading form answers. These are our 10 critical user actions.

In other words, our system is really “up” when all 10 critical user actions are happening. The critical user actions are the requirements for our production test suite, which now runs periodically and provides data that allow us to observe system reliability, which allows us to be the first to know if we have a problem in production.

The next sections explain how we implemented this test suite.

Introducing deep-purple 🎉

Because we want to look at our system through the users’ eyes, the most important part of implementing it is to keep as close to human-computer interaction as possible.

To automate this behavior scenario, we use Puppeteer to interact with the Typeform UI, the Performance.timing API to evaluate the results, and Jest to manage the test suite. This setup allows us to recreate critical user actions precisely and catch failures based on page content loading duration. We call the setup deep-purple because it allows us to detect smoke on the water. 😎

Once ready, deep-purple runs regular checks to visualize Typeform’s health. A Jenkins machine executes the test suite every minute. Results are sent to a Datadog dashboard, and in case of failure, alerts are sent to the Slack channels for the teams responsible for the affected functionalities.

Knowing about failures is the first step, but it’s vital to be able to recover quickly by understanding the cause of the issue. To support this investigation, we capture visual screenshots, network HAR files, and console outputs. To allow deeper insights, we add the Jenkins build number to the User Agent request field so the system logs can be traced in Kibana.

Hands on: Production smoke testing 🌪

According to Wikipedia
“Smoke tests […] cover the most important functionality of a component or system, used to aid assessment of whether main functions of the software appear to work correctly.”

Our goal is to know whether main functionalities work in production, so we call our solution Production smoke testing.

Continuing with Wikipedia’s definition, “…smoke testing…is preliminary testing to reveal simple failures severe enough to, for example, reject a prospective software release.” Combining that concept of simple failures with our critical user actions list, we focus only on two kinds of tests:

Page X should load
Page X should be “interactable”

NOTE: Instead of testing a Typeform page, the following examples use Google.com for illustrative purposes. Please check the whole solution here.

Check that a page loads 🔍

This is how the load page test looks:

Several technologies are used here. To start, tests are defined using Jest, which allows us to describe a Google smoke test. The answer to the generic question, “What does it do?” is also in the code: it loads google page. In other words, if Google.com main page cannot load, the test will fail. Simple enough, right?

To detail it a bit more, the test returns the result of going to Google.com and waiting for it to load. If the page loads within 5000 ms (that is, 5 seconds), the test will pass.

To go even deeper, Google.com has been implemented following the PageObject pattern, which offers a single reusable way of performing page actions. For instance, the goto function is defined in the pages/page_object base class:

And waitForPageLoaded is defined in the child class pages/google_page:

Thanks to Puppeteer’s native function waitForSelector, we can wait until the searchbox CSS selector exists in the page. This tells us that the Google page did previously load before the timeout value passed (by the test options object).

Let’s check the result of that test execution!

$ yarn jest googlePASS test/google_smoke.test.jsconsole.log test/google_metrics.test.js:12google page took 0.613s to load

So far, so good. Now, we still want to check that our test will fail in case the Google page cannot load. A simple way to achieve that (have you ever seen Google NOT load?) would be a temporary change the CSS locator, making it impossible for the test to find the page.

Here is the result:

$ yarn jest googleFAIL test/google_smoke.test.js (17.035s)● Google smoke test › loads google pageTimeoutError: waiting for selector “input[name=qxxxx]” failed: timeout 5000ms exceeded

As expected, the test fails and shows the reason why, which is that the search box cannot be found. When the test is passing, it offers information about how long it took for Puppeteer to load the Google page:

google page took 0.613s to load

However, we should treat that information as a bonus because the smoke test will always pass as long as the page does not timeout.

Check that a page is “interactable” ☝️

Besides loading, we want to test whether our page is ready to interact with the user. To answer that question, we can just perform some action on the page and then check that the page responds accordingly.

What is the first interaction users typically do on Google.com?

They typically search. More specifically, they insert text inside the searchbox.

What do users see when they do that, even before submitting their search?

They see search suggestions:

suggestions when typing on google search box

That looks like a good proof that Google.com is interacting with the user.

Now, how could we implement that check? First, we declare a new test case:

That test delegates all the page interactions to the suggest function in the Google page object:

suggest works as follows: it types some text into the searchbox and waits for suggestions to appear. That last piece is done by waitForSuggestion, which uses Puppeteer’s waitForFunction, passing an inline function as an argument.

That function is based on two conditions: the first suggestion page element exists, and the text of that element is not empty. If Google.com does not send any suggestions, waitForFunction will fail after the timeout, and that’s precisely what we want to test: whether the page interacts with the user.

Facing challenges 🤹‍♂️

deep-purple has been running for 6 months now. During that period, we faced a couple challenges that we are happy to share.

Challenge #1: Data pollution

At Typeform we try to understand the behaviors and needs of our users by collecting and analyzing data that feeds business metrics. And those business metrics are used to guide company development.

Our tests run every minute. That means 1440 times a day or more than half a million times per year. In other words, we are polluting real users’ tracking data. Keeping everyone aware of that fact is not always straightforward.

Luckily, our data team manages this issue through collaboration. They make sure to sort the data deep-purple generates from real users’ data to keep business metrics clean. Yay!

Challenge #2: Bypassing security

Typeform added CAPTCHA checks on user login. That was a very wise move from a security point of view, and very troubling for deep-purple at the same time.

CAPTCHA functionality is designed to prevent automated logins, which is exactly what our deep-purple tests need to do. That means that once CAPTCHA is enabled in production, deep-purple will start failing…forever.

We solve this issue with communication. Again! 😃

We collaborated with the user accounts and security teams, explained the situation, and came up with a solution that excluded only deep-purple users from the CAPTCHA policy without compromising security.

Challenge #3: Metrics don’t “perform”

It is not easy to describe production conditions. There is a lot of randomness in both user actions and system performance. When we go to the wild, we lose control over our scenarios. We can observe what’s going on in production, but it is really difficult to look at things in detail. That makes it hard to test against numbers.

The metrics we get from tests are synthetic because tests only run inside a fast machine, which lives geographically near our production servers.

We cannot take any conclusion from the numbers themselves. They only make sense when we consider them as a trend. It is important to stress that our numbers are not performance metrics. Treating our numbers as performance metrics could be very misleading and cause a lot of confusion.

Conclusion

Testing in production provides data that allow us to observe unpredictable errors and discover patterns that are not obvious without production traffic.

At Typeform, our deep-purple tool helps us know how often our system works and alerts us about failures before our customers do.

That does not mean that we abandoned regular software testing before code releases. On the contrary: we want both!