Validating Search Ranking with the Simulator

Mark Andrew Yao
Dec 4, 2020 · 5 min read

At Thumbtack, we’re focused on becoming the best platform to fix, maintain, and improve your home. We achieve this by connecting customers to the right home service professionals for the task at hand. To deliver a great experience to both our customers and professionals, our engineering teams make sure that the algorithms matching customers to professionals are reliable even though they tweak and refine them every day. The simulator is a command line tool we built that helps us detect ranking regressions and maintain ranking quality.

Our ranking algorithm takes in a customer’s search and outputs a ranked list of professionals to show the customer (details in this post). It also returns extra data to show in the UI, such as relevant review snippets or tags to highlight professionals that stand out. Because our ranking algorithm is complex, developers can introduce regressions when making changes. Regressions cost Thumbtack revenue, give customers poor experiences, and hurt professionals’ lead prospects. Hence, it’s very important to us to avoid them as much as possible using tools like the simulator.

The Simulator

To catch bugs and regressions, we built the simulator. The simulator is a command line tool that replays historical requests and stores the output for inspection.

How We Use It

We use the simulator by running a command like:

This replays the given number of requests from the chosen date window against the specified service endpoint. It then writes the responses to the given file. To check for ranking regressions, we deploy our production branch to a development instance of our ranking micro-service. Then, we replay a set of historical requests against it. Afterwards, we re-deploy the development instance, but this time with the feature branch containing the changes we want to test. Then, we replay the same historical requests against the re-deployed instance and analyze the results.

The simulator has built-in tools to compare results (e.g. master branch vs. feature branch). For ranking, the following comparators are useful:

  1. Response Comparator: compares the entire response body and flags any differences
  2. List Comparator: compares the ranked order of professionals returned
  3. Set Comparator: compares the set of professionals returned, ignoring order

An Example Use Case

The simulator has been indispensable in helping us flag and resolve potential regressions. One such example was when we refactored the code path to calculate a professional’s ranking score (which we use to sort the pro list). Some of our machine learning models use logistic regression. This involves calculating a dot product between feature coefficients and feature values. We store coefficients and values in two separate maps keyed by feature name. Then to compute the dot product, we iterate over the coefficient map like in the snippet below (interactive version).

Sample code snippet for computing a dot product for a regression model

While using the simulator and list comparator to test this refactor, we noticed slight ranking changes. Using the set comparator, we found no significant differences in the set of returned professionals. This suggests an issue with ranking, instead of fetching candidates or filtering those that don’t match requirements. Finally, the response comparator revealed that the changes were swaps between adjacent professionals.

Eventually, we pinpointed the root cause to the fact that we were iterating through Go maps in calculating the dot product. According to Go documentation, iteration order for maps is not specified and is not guaranteed to be consistent. The IEEE Standard for Floating Point Arithmetic (IEEE 754) does not guarantee consistency when there’s a different order of operations for two mathematically equivalent expressions, like in the example below (interactive version).

Even if two professionals should have had identical scores, this inconsistent iteration order resulted in the observed ranking differences. After identifying the root cause, we stored a sorted list of feature names on the model configuration and added terms in that order. These kinds of bugs are easily introduced and are tricky to debug without a tool like the simulator.

How It Works

For a micro-service to be eligible for simulation, it must record the HTTP requests it receives and their responses. At Thumbtack, we built a small wrapper around our HTTP endpoint handlers that records these in events, with fields such as:

  1. Metadata
  • Run-ID: A unique identifier for each request received by a micro-service and its corresponding response
  • Timestamp of the request
  • Name of the Micro-service

2. Request Data

  • Request Path
  • Request Query String
  • Request Body

3. Response Information

  • Status Code
  • Response Body

All these simulation events are then stored in an events store (we use Google BigQuery) where the simulator can query them.

Data flow for the simulator. Micro-services record requests and the simulator queries for them to replay them.

When we run the simulator, the main thread queries BigQuery for requests to replay. Then, it spins up the requested number of workers, and spins up a single response collector worker. We wrote the simulator in Go. Thus, we use Go concurrency constructs like goroutines and channels to manage the workers’ inputs/outputs. More concretely, workers read from a request channel populated by the main thread. They issue requests to the micro-service, and write responses to a response channel. Supporting a variable number of workers lets us replay more requests concurrently and simulate faster. We can also lessen throughput to avoid overwhelming development instances of our micro-services. Finally, the response collector pulls these responses from the channel and outputs them to a file.

Internal architecture of the simulator tool

Next Steps

Even though we’ve benefited a lot from the simulator, there’s still so much we can add to it. For instance, we can run the simulator alongside automated tests. This would inform developers when their change might cause a regression. Currently, developers judge which code changes they should test with the simulator. More automated testing would catch less obvious regressions and promote thorough testing. Also, as mentioned above, sometimes it’d be helpful to modify the historical request. That way, we can also simulate various scenarios instead of ones we’ve encountered in the past.

The simulator is one of many tools we’ve built at Thumbtack to help create a more effective marketplace for local services. If this sounds interesting to you, come join us!

From the Engineering team at Thumbtack