Qualitative evaluation of search ranking algorithms

Bharadwaj Ramachandran
May 17 · 7 min read
Photo by Agence Olloweb on Unsplash

When a customer lands on the Thumbtack homepage, they implicitly trust our product to help them find the best business for the job they have in mind — be it painting their home, hiring a photographer, or finding an accountant to do their taxes. The first step in this experience is interacting with our search bar and sifting through a list of ranked businesses.

Browsing search results is an integral part of the customer journey towards finding the right pro to hire on Thumbtack. It stands to reason that this surface is the subject of many of our experiments. However, iterating on search ranking poses a unique challenge at the time of development. We rely heavily on machine learning to power search ranking, a topic we’ve discussed in a previous post. As we develop a new search ranking algorithm, we need to verify that our new ranking algorithm is working as intended, both qualitatively and quantitatively.

Figure 1: A search results page on Thumbtack

To evaluate results quantitatively, we use a simulator tool that replays historical search requests on two different algorithms and compares their results. We’ve covered the details of this in a previous blog post. On top of that, new models are quantitatively evaluated against a test dataset and a validation dataset on offline data. However, offline evaluation using Jupyter notebooks and the use of the simulator tool are not designed to help us with the following:

  1. Quickly debug why a particular model ranks a business at a specific position in the results, relative to other businesses
  2. Perform qualitative (human) evaluation of the new ranking model’s overall performance on a set of searches, relative to the baseline model
  3. Visualize how different parts of the ranking algorithm interact with each other
  4. View model inputs and outputs for a particular search result in one place

To address these problems, we built an internal tool that we call the “side-by-side evaluator tool”, or “SxS tool” for short. To understand how the tool works, it helps to understand the basics of our ranking architecture. Our modular ranking system allows us to compose different ranking algorithms and additional components such as filters on top of one another. To illustrate how this works, let us examine the structure of a ranker.

The structure of a ranker

Figure 2: The structure of an example ranker

As can be seen in Figure 2, the example ranker runs three steps sequentially.

It first runs an ensemble of machine learning models. In this case, there are two models to run: the contact model and the response model. Each model has a list of features nested beneath it, and the results of the models are combined to produce a ranking score. The list is reordered based on the ranking score and passed into the next step.

Next, it runs a series of filters. Each filter in this component is responsible for removing businesses from the list if they satisfy a certain condition. In this example, there are two filters: the DedupeFilter and the TruncateFilter. The first is an example of a filter that ensures that there are no duplicates, and the second is an example of a filter that truncates the list of businesses returned to reduce the size of the response we return to the front end.

Lastly, the re-ranker step is responsible for reordering the list based on heuristics that aren’t used in our machine learning models. In our example, there is a single re-ranker called the SortByReviewsReranker. It gives customers the ability to reorder the list based on an additional search term that searches over business reviews.

Now that we know the rough structure of a ranker and understand this example, let’s take a look at what the side-by-side evaluator tool allows us to do.

The SxS tool

Figure 3: Parameters of the SxS tool

The image above shows the top half of the SxS tool that we have built. Manual testing and evaluation of new algorithms is a core use case of the SxS tool. To support this, we added a drop-down at the top of the page to vary the geographic region and search term for which we’re comparing results. In this case, the page we’ve selected is roofers in Dallas, TX. The tool also allows us to choose which rankers we’re evaluating “side-by-side” in the two “Ranker” drop downs. The remaining options allow us to parametrize various aspects of the request, as well as verify if either list is sorted according to a specific field. These tools are especially useful when debugging in our development environment. However, the two most important parts of the tool are hidden behind the “show ranker config” and “show metadata selector” buttons.

Ranker configuration

Figure 4: Editing the ranker configuration on the SxS tool

Metadata selector

Figure 5: Side-by-side comparison with metadata selections made

Conclusion and next steps

  1. The ranking metadata inspector gives the developer the ability to inspect the data behind each result at a granular level. This fulfills the first of our wants, which was to make debugging easier.
  2. Viewing changes in the SxS makes human evaluation of a particular search easier. This is especially true when the evaluation is done across members of the ranking team, and when we use analytics tools at our disposal to ensure that we cover a wide variety of search terms, geographies, and markets in our human evaluation.
  3. The ability to change rankers and toggle their sub-components in the ranker config allows us to visualize how different parts of the ranker affect search results.
  4. The metadata selector tool functions as a shortcut to see how model inputs translate to model outputs for a particular search result.

As for next steps, there are many quality-of-life improvements we could make to the tool. Right now, developers need to do a significant amount of scrolling to compare ranking metadata for a business whose rank changed between the ranker on the left and the ranker on the right. For example, if a business was ranked in the 2nd spot on the left, but the 8th spot on the right, the developer would have to scroll between these two positions to compare ranking metadata for the same business across two rankers. On the more quantitative side, we might want to view some descriptive statistics about each ranker at a glance to make it easier to spot data issues. Going forward we not only want to be able to visualize the features, the configurations and the machine learning model outputs but also introduce machine learning model explainability into the SxS tool, so we can better visualize model predictions.

If problems like search ranking, machine learning, and model evaluation interest you, join us as we build out a robust marketplace for local services!

Special thanks to Navneet Rao, Joe Tsay, Richard Demsyn-Jones, Mark Andrew Yao, Karen Lo, and Dhananjay Sathe for feedback on this post.

About Thumbtack

Thumbtack Engineering

From the Engineering team at Thumbtack

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store