A/B testing at Thumbtack

By: Dima Kamalov

Published in

Thumbtack Engineering

10 min readJun 8, 2018

Thumbtack currently runs about 30 A/B tests per month, ranging in duration from a week to six months. We experiment on virtually every area of our product — customer signup, pro signup, matching algorithm of customers with pros, messaging features, reviews features, SEO traffic, and many more. Our experiment analysis computes 600 different metric definitions defined by both analysts and engineers.

This blog post is in seven parts:

Experiment Design
Experiment Configuration
Experiment Assignment Service
A brief overview of our Data Platform
Experiment Result Computation
Metric Definitions
Experiment Result Visualization Service

Here’s how these parts are connected:

Experiment Design

About half of Thumbtack engineers will own at least one A/B test within their first 6 months. Many of them have not previously designed any A/B tests. As a result, our culture of experimentation starts with education:

We host a one hour workshop on data and experimentation at Thumbtack for all new engineers about once per month. We host additional brownbags on more specific topics about once per year.
We have extensive (about 20k words) internal documentation which we strive to keep relevant and up to date. We’ve explicitly spent two weeks of time in the past two years focusing on organizing and updating our documentation.
For each experiment, we fill out this template to make sure we’re thinking about some common experiment design questions and to keep notes for posterity.
We provide a sample size calculation tool so people think about expected power of their planned tests
There is a group of experiment reviewers who take turn reviewing pending experiments, working especially closely with first time experimenters. This group is composed of a mix of data scientists, analysts, and engineers with substantial A/B testing experience. Some common mistakes caught in review include assigning an experiment too late in the funnel to track a metric, or picking metrics inconsistent with the randomization unit.

While our experiment design is quite custom tailored to product needs, there are some common characteristics:

We typically use an alpha of 0.05 and a beta of 0.2. Because about one in four experiments at Thumbtack leads to a significant improvement, the alpha is representative of our false positive rate (if this was closer to one in 20, we’d need to reevaluate). We have stricter criteria for alpha (false positive) than beta (false negative) primarily because additional features take effort to maintain.
We run two-sided tests except for experiments that only look to prevent regressions
We pre-calculate experiment durations based on a sample size necessary to achieve the desired power.
We generally use seedfinder for experiments on pre-existing users, and have not yet had a need to balance pre-existing metrics for other types of experiments.
We typically assign users in one of about fifteen standardized places to keep track of common metrics across our product, rather than only assigning users exposed to a particular niche feature

Experiment Configuration

We have a separate Git repository for experiment configurations. An experiment is configured with a YAML file template. The configuration is reviewed by the experiment reviewer.

A simple configuration file looks something like this:

name: Get bid email UI change (July 2016)
description: |
  Move "view custom quote" button to the top and rename as "view quote(s)"
  <JIRA link>
  Experiment design doc: <google docs link>
owner: lingduo.kong@thumbtack.com
buckets:
  baseline:
    name: baseline
  button_modified:
    name: Button Modified
metric_set: quotes_view
randomization_unit: user_id

This experiment will be set up randomized on user_id, with users equally assigned to each of the buckets.

We support much more advanced configuration. For instance:

An experiment can be randomized on one of eight different randomization units. Randomization can be set to avoid pre-existing bias using seedfinder
There can be any number of buckets with any traffic allocation; experiments can be set up not to overlap with some specific other experiments.
An experiment can be configured to be exposed only to Thumbtack employees (“dogfooding”)
Experiment assignment and data to count for analysis can be decoupled, for when we need to e.g. have a pre-announcement about an experiment a user is taking part in

There is a build job on the experiment-definitions managed by Jenkins which uploads additions and changes of experiment configurations into the Experiment Assignment Service.

Experiment Assignment Service

Thumbtack is at an adolescent stage of a service oriented architecture — we have a couple dozen microservices alongside a PHP monolith. Experiment Assignment Service (EAS) is a microservice which is called by other services to get and override experiment assignments through the following Thrift definition:

service Assignments {
   void storeAssignment(
        1:TExperimentContext context,
        2:TExperimentAssignment assignment,
        3:TExperimentStoreAssignmentOptions options
   ) throws (
        1:exceptions.ServiceException serviceException,
        2:exceptions.InputException inputException,
   )   TExperimentAssignments getAssignments(
        1:TExperimentContext context,
        2:list<string> experimentIds,
        3:TExperimentAssignmentOptions options
   ) throws (
        1:exceptions.ServiceException serviceException,
        2:exceptions.InputException inputException,
   )    map<string, TExperimentAssignment> getAssignmentsBulk(
        1:set<TExperimentContext> contexts,
        2:string experimentId,
        3:TExperimentAssignmentOptions options
   ) throws (
        1:exceptions.ServiceException serviceException,
        2:exceptions.InputException inputException,
   )
}

TExperimentsContext is information about the experiment participant — e.g. their id, location, and so on.

EAS is a Scala Play service. Its job is to store configuration about an experiment, answer requests for assignment and experiment configurations, and log assignments. Configurations are stored in DynamoDB. Most assignments for currently active experiments are stored in a Postgres database to allow assignment overrides through the storeAssignment function in the code above. Every time an assignment is fetched, we log an event through fluentd so the downstream analysis job knows which users were assigned to which experiments.

Algorithmically EAS draws heavily from this Google presentation (see especially Slide 13) to set up a layering system for controlling experiment overlap. EAS has support for custom seeds to enable seedfinder. We withhold 5% of traffic from all experiments.

A brief overview of our Data Platform

Nate Kupp wrote a fantastic description of our data platform as of 2016. Since then we’ve had been one major infrastructure change: we’ve moved to Google Cloud rather than running our own Hadoop cluster.

Our primary data sources are

Postgres tables dumped into GCS through sqoop
DynamoDB tables dumped into GCS through a custom Spark ETL, and
Event logs transferred into GCS and grouped/enriched with custom Spark ETLs.
A SQL-based ETL driven by our analytics team doing some extraction on sources 1–3 using BigQuery SQL

Experiment metrics tend to primarily rely on sources (3) and (4). (3) is more performant for large volumes of event data, while (4) is cleaner and better organized.

Experiment Result Computation

We process on the order of a few terabytes of data for experiment metric calculations. As a result we can’t completely ignore efficiency. A couple of especially important considerations:

Each experiment has many metrics calculated on it. We should determine which data belongs to each experiment only once rather than e.g. separately for each metric.
Metrics for different experiments are independent of each other, so we should do this computation entirely in parallel.

Determining which data belongs to each experiment

In order to achieve the first goal, we create an ExptRows dataset.

Each ExptRow has:

The randomization unit value and experiment id. These are the key on the dataset — there is one ExptRow for each randomization unit (“participant”) and experiment pair.
Some metadata about assignment, derived from the Experiment Assignment Service event logs. This includes things like when the randomization unit was first assigned to the experiment, when the experiment ended, and what bucket the participant was in.
Arrays of many data source objects — there is one array per event grouping (source (3) mentioned earlier) and one array per table from the SQL-based ETL (source (4)).

This dataset is created by a Spark job written in Scala. This it is currently bottlenecked on reading data from GCS into local RAM — we’ve experienced a throughput on the order of 100MBPS per n1-standard-16 instance using Google’s Cloud Storage Connector, and substantially slower for a few small sources of data we read directly from BigQuery. As a result, we preload data to memory or the workers’ local disk for all experiments.

Once the data is preloaded, we filter it for experiments as necessary. The metric_set field defined in the experiment configuration determines some settings on how to filter the data. For instance, our quotes objects get matched to some experiments using the customer’s user_id and to other experiments using the professional’s user_id. At this stage we also look for timestamp matches — whether a quote happened after a user was assigned to the experiment and before the experiment ended.

Note that a single participant’s data may belong to multiple experiments. Currently the organizational benefit here outweighs the data duplication costs, but we may need to revisit this as we scale.

Metric Computation

We group the ExptRows by experiment, and then we have just a single operation to run on each group of rows for an experiment. This is what Spark truly shines at.

Group by keys are expensive compared to reduce by keys. This is why it was very important that we pre-grouped the ExptRows by randomization unit values earlier — we want to do this grouping only once. We can now take advantage of the grouping: for each Expt Row and for each metric, we compute a statistical object containing values like numerator, denominator, squares, and cross terms. The object can then be reduced across many workers. The code is quite straightforward (though some date, segment, and bucket comparison logic redacted for simplicity):

val metrics = exptRows
  .flatMap { case (exptRow) =>
    metrics.flatMap { case (metric) =>
      generateStatisticsObjects(metric, exptRow) // stats objects keyed by (experiment, metric, bucket)
    }
  }
  .reduceByKey(_ combine _) // combine is a custom-defined aggregator function which mostly just sums various properties
  .mapValues(statisticsObjectToMetric)

We then store these computed metrics into DynamoDB for quick lookup by our visualization tool. We precompute all of our metrics — the visualization tool only looks up the results of these computations. As a result, we get much quicker lookups at the price of more limited customization.

Metric Definitions

A metric definition needs to know how to extract a statistics object out of each participant’s data. This mostly boils down to being able to extract a numerator and a denominator — for any metric for which the mean is distributed normally, we use Welch’s T-Test. (We do support some covariance corrections in our metric definitions, but that is beyond the scope of this post.) Note that a distribution does not need to be normally distributed for its mean to be normally distributed! Despite the very longtailed nature of many Thumbtack metrics, we’ve verified that their means are nevertheless normally distributed: the empirically computed standard error of 10k subsamples means lines up nearly exactly with the theoretical standard error based on bell curve variance.

As a result, a metric definition needs to only define two functions: ExptRow -> Double for numerator, and ExptRow-> Double for denominator. When we had 30 metrics, we just had these as hardcoded Scala functions. As we grew, we abstracted this to a templating language — rather than writing a scala function, we only specify which part of ExptRow to use as a source, and which one or more of a couple dozen pre-existing transformations to apply. Here is what a definition looks like in practice:

case object CustomerContacts3DayPerOnePlusRequest extends MetricDefinition{
 val metricName = "customer contacts (3 day) per 1+ request"
 val numerator = AQuotesSources.CustomerContactsWithin3Days
 val denominator = ARequestsSources.CountOnePlusQuotes
}

With the extractors defined as follows:

val timeToCustomerContact = timeDiff[AnalyticsQuote, Long](_.sent_time, _.first_customer_contact_time)
val customerContactWithin3Days = lessThan[AnalyticsQuote, Long](timeToCustomerContact(_), 86400*1000*3)
val CustomerContactsWithin3Days = countIf[AnalyticsQuote](customerContactWithin3Days(_))

We also have functionality to split a metric by some segments of users. Each experiment’s metric set specifies a set of one or more segment extractors. A segment extractor looks very similar to a metric extractor — it just transforms an ExptRow → String rather than ExptRow → Double.

We are in the process of migrating these from Scala to a UI so they look less scary, but even the current iteration has enabled us to have 600 metrics with a dozen different authors .

Experiment Results Visualization Service

We have a lightweight metric visualization service that reads stored metrics from DynamoDB and displays them. We group experiments by metric set, with a couple key metrics visible on the front page:

Clicking through to each experiment displays more detailed metrics which you can look at various user segments and dates:

The backend of the visualization service is in golang, primarily because we have many other golang services and so the infrastructure was readily available. The backend’s job is to determine which keys to fetch metrics for from DynamoDB, as well as obtain configuration metadata from the Experiment Assignment Service.

The frontend is in Angular, using a somewhat outdated snapshot of our website frontend framework. The charts are made with highcharts. We note a number in green or red when it is up or down and statistically significant, or in grey when it is not significant. The results on the charts are cumulative, so you can see an experiment achieving significant results over time.

Some Upcoming Work You Could Do If You Join Thumbtack

We’re in the progress of building the visualization service into a more fully fledged experiment and metric definition control UI.
We’ve started thinking about measuring and reducing interference between experiment units, particularly through market-based testing. While we’ve manually run a few of these tests, we’d like to scale it with the rest of our infrastructure

Acknowledgements

Our experimentation would not be possible without:

Carolina Galleguillos, Dima Kamalov, Niranjan Shetty, and Yu Guo contributed extensively to the experiment design process.
Andrew Lam is the primary author of the Experiment Assignment Service.
Andreas Sekine, Andrew Lam, Dima Kamalov, Erica Erhardt, Ihor Khavkin, Nate Kupp, Stanley Ku, Venky Iyer and Yuehua Zhang are the primary contributors to the Data Platform
Dima Kamalov and Stanley Ku are the primary authors of the metric definition and analysis computation code
Andreas Sekine is the primary author of our metric visualization tool
Thumbtack established a healthy culture of focusing on data and experimentation even before we had A/B testing at scale. This is thanks to evangelism from several analysts and engineers: Matt Isanuk, Nate Kupp, and Steve Howard particularly come to mind
Writeups that came before ours. Here are a few we found particularly helpful:

This Lyft post focused on theory
This Microsoft paper on their process infrastructure
This Pinterest post focused on their system infrastructure

Originally published at https://engineering.thumbtack.com on June 8, 2018.