Shaping Experimentation Culture on Engineering Teams

Published in

Thumbtack Engineering

9 min readApr 7, 2023

In the past year, I had the opportunity to chat with research scientists, machine learning engineers, and technology leaders from technology companies of different sizes. One of the topics that caught my attention was the culture of experimentation within their teams. I found that experimentation culture is heavily influenced by the underlying engineering culture and the maturity of experimentation frameworks within their respective organizations. A recurring challenge that arose was the lack of a systematic culture of experimentation, with teams often investing in one-off ideas that sometimes worked and sometimes didn’t.

Thumbtack is a platform where millions of customers seek help from local businesses and professionals to care for their homes. We offer access to professionals across 500 different categories of services through our website and customer-centric iOS/Android apps.

For Thumbtack’s marketplace, my team focuses on improving how customers can discover and book local businesses using data science and machine learning. We are a cross-functional team consisting of engineers, applied scientists, and data scientists who heavily rely on experimentation. In 2022, we conducted around 25–30 experiments across different problem spaces, and over the past two years, we have fostered a culture of experimentation that has enabled us to consistently drive business impact, particularly in areas such as search experience and ranking. In this blog, we will be sharing some of our experiences in shaping our team’s experimentation culture and what we have learned along the way.

Introduction

When it comes to experimentation, most teams fall into one of two categories. The first are those who are able to continuously conduct A/B tests in production, while the second wishes they could do so but for whatever reason cannot.

As a data-driven technology company, Thumbtack places a strong emphasis on improving the experience for our customers and professionals through controlled online experimentation (e.g A/B tests). Teams start by formulating a hypothesis, conducting research to validate their assumptions, building the necessary changes to test their hypothesis on a subset of users, and then running experiments to validate the hypothesis. At any given time, we may have numerous experiments in progress across multiple teams and user experiences.

There is a failure to understand that you can actually run an organization thinking like a scientist. And by that I mean, just recognizing that every opinion you hold at work is a hypothesis waiting to be tested. And every decision you make is an experiment waiting to be run
— Adam Grant, in an article from The Wall Street Journal.

In 2020, my team was equipped with a data-driven engineering culture and the ability to conduct A/B tests across various user experiences. However, like many other teams, we sometimes made opportunistic, one-off investments in interesting ideas. When these ideas failed to produce the desired results, we would move on to the next one. It became apparent that simply having a framework for A/B testing was not enough. We needed to establish a culture of systematic experimentation.

The remainder of this blog post will delve into the steps we took to create this culture, as well as the lessons we learned along the way. We assume that our readers are already familiar with the basic concepts of why and how companies conduct A/B tests. If you are new to online experimentation or are hoping to learn how to establish a foundation for controlled online experimentation within your team or organization, we recommend reading the book “Trustworthy Online Controlled Experiments” [1].

Shaping Experimentation Culture

1. Take a longer term view

If you’re building a product experience for the very first time, you’d likely have to take a milestone based incremental approach to building it. But once a product experience is built, teams are expected to refine and optimize these experiences via controlled experimentation by identifying key customer problems and trying out various solutions. For some customer problems, there might be clear solutions. For customer problems that are associated with various user or business metrics, oftentimes the solution takes the shape of an ongoing investment in the problem space.

Here’s an example of this:

At Thumbtack, our platform enables customers to find and book local businesses for a wide range of services, such as house cleaning. When a customer searches for house cleaners in their neighborhood, we provide them with a list of local businesses that meet their needs. The order in which these search results appear is determined by machine learning algorithms that take into account factors like the customer’s location and search criteria. To help customers make informed decisions, we also provide helpful information such as the average customer rating for each business. Once a customer has selected a professional, they can provide additional details about their house cleaning job, such as the number of bedrooms in their home and then book a professional. Helping customers book professionals is a user problem that directly relates to business metrics like the number of projects being done on our platform.

*Fig 1. Customer searching for house cleaners in their neighborhood*

Small changes to the user interface (e.g. changing what information we show about a business), or small changes to the order of the search results (e.g. using a new machine learning model to rank search results), can have a significant impact on how easily customers can find the right pro for their job.

When you think about a problem like this with the perspective that you can quickly identify and fix the biggest customer problems, you often end up setting yourself up for failure. Sometimes testing out changes that involve adding more features for your ML model or testing out user interface changes work and sometimes they don’t.

Given there is always going to be room to improve product experience on a critical business problem, shift your perspective by viewing key product surfaces & core optimization challenges as a means to testing 5–20 hypotheses, over the next 1–2 years. On our team, we created an explicit roadmap that would systematically explore various dimensions of the problem space e.g. different types of features or different types of algorithms. Each experiment we tried was with the explicit goal of furthering our learnings in that problem space as we explored the impact of different changes using A/B testing as illustrated below:

This allowed us to think about success & failure over a longer time horizon than the immediate outcome at hand.

2. Don’t shy away from larger investments

Let’s say you’re trying to teach a monkey how to recite Shakespeare while on a pedestal. How should you allocate your time and money between training the monkey and building the pedestal?
— Astro Teller, CEO of X, the moonshot factory [2]

In his blog, Tackle The Monkey First [2], Astro Teller argues that even though the core problem in the above scenario is based on the success or failure of training the monkey, some people are biased towards building the pedestal since to them it represents a quick win.

In 2020, we were using a simple ranker consisting of an ensemble of logistic regression models for search ranking. Our experiments testing out new features had stopped demonstrating any wins. Offline testing with tree-based non linear models had shown some promise, but online testing necessitated serving complex nonlinear models at scale in production, which needed larger investments in ML infrastructure.

We wanted to be able to easily use and maintain near-state-of-the-art neural networks to model the complex underlying interactions. Aiming for quick wins and incrementally iterating towards various goals is so ingrained in engineering teams, that we sometimes bias away from larger investments.

By investing in ML infrastructure around serving complex non-linear models at scale, first for tree-based models in 2021 & eventually for deep learning models in 2022, we were able to unlock the ability to model complex interactions in a much richer feature space. This led to many experiment wins in the last 2 years, driving significant business outcomes. We tend to overestimate what can be achieved in the short term but sometimes underestimate what can be achieved in the longer term.

3. Build internal & external alignment

As you might expect, aligning expectations with leadership was a crucial first step. At Thumbtack, we use OKRs to define goals for teams in conjunction with leadership. By sharing how long term investments would create capability unlocks for the business, we aligned on the need for taking a longer term roadmap based view of our experiments. We sequenced each milestone in new experiment capabilities (e.g. creation of a feature store, inference service to serve nonlinear ML models) with a series of experiments that leveraged the capabilities (e.g. addition of lots of new features, experiments with different non linear models) to demonstrate the value to the business.

Internally, within the team, we shared the vision for how we want to evolve experimentation capabilities, and how they could contribute to this shared vision. We then created cross-functional working groups to execute against the longer term roadmap of experiments, so they could build deeper expertise across the problem space.

When you launch experiments as A/B tests, it is quite common for a majority of experiments to fail. It’s essential to internalize the idea that failure is expected and that systematic exploration of various hypotheses is necessary to achieve success in experimentation. Thus, we also built alignment around outcomes. We set up expectations around launching a certain number of experiments every six months; with a clear expectation that a majority will likely fail and that it is okay to fail so we can learn from each successive experiment.

4. Enable experiment velocity

As mentioned in the previous section, it is crucial to internalize that a majority of experiments will likely fail. On the flip side, this means that to drive successes around experimentation we need to enable driving experiment velocity. Removing bottlenecks can speed up end-to-end experiment times, but they are unique to your team.

Below we illustrate how we dealt with two major bottlenecks in our goal to improve greater experiment velocity on ranking:

Experiment launches required significant effort
Resource constraints on experiment analysis delayed ship decisions

In some cases, the time taken to launch an experiment for long-term problems exceeded the actual effort required to make the product change (e.g. testing a new feature). To us this indicated automation was necessary. We streamlined our A/B testing process and simplified feature creation, leading to a reduction in the time taken to launch experiments. More information on these investments can be found on our blog on search ranking [3].

Another bottleneck we faced was around how our team of ML engineers and applied scientists needed to wait for our data scientists to analyze the A/B test data, determine the effect on different customer segments and then provide a recommendation on whether we can ship the change. This waiting period could be as long as a few days to a couple of weeks, which was a significant delay in the decision-making process. To address this, we created a plug-and-play analytics dashboard (for long term problems like ranking) that allowed non data scientists to input details like experiment name, experiment dates and generate the experiment analysis. Doing so enabled engineers and applied scientists to take ownership of the post-experiment analysis and speed up ship decisions.

Both these changes had a significant impact on our experiment velocity, reducing the end-to-end experiment time and enabling us to launch more experiment variations.

Conclusion

Shaping a culture of experimentation is indeed an ongoing journey that requires reinvention. It’s crucial to remember that it cannot be a one-size-fits-all approach. We hope these insights from an experiment-heavy team can inspire other engineers, applied scientists, and technology leaders in the industry to shape their unique experimentation culture.

Interested in staying connected with Thumbtack? Follow us on LinkedIn or check out our latest openings at thumbtack.com/careers.

Acknowledgement

I’d like to acknowledge input from the following folks on helping think through how we have shaped the experimentation culture on our engineering team: Vishrut Arya, Richard Demsyn-Jones, Brandon Sislow, Tom Shull

References

[1] Kohavi, Ron, Diane Tang, and Ya Xu. Trustworthy online controlled experiments: A practical guide to a/b testing. Cambridge University Press, 2020.

[2] Teller, Astro. Tackle The Monkey First, X, the moonshot factory blog, 2016.

[3] Rao, Navneet. Evolution of Search Ranking at Thumbtack, Thumbtack Engineering Medium Publication, 2022.