Evolution of Search Ranking at Thumbtack

Navneet Rao
Thumbtack Engineering
11 min readDec 8, 2022
Photo by Marek Piwnicki on Unsplash

Millions of customers use Thumbtack every year to find and book professionals and local businesses to help them care for their home. Customers can find professionals for around 500 categories of services on Thumbtack. When a customer searches for professionals (pros) in their neighborhood in a specific category, we visualize an ordered list of search results of professionals in their neighborhood.

At Thumbtack, search ranking refers to the problem of finding the most relevant professionals for a customer’s search request and ordering them based on their relevance. The more relevant the professionals, the easier it is for customers to find and book the right professionals for their job. Search ranking is thus an important optimization problem that has a direct impact on both the number of projects being completed via Thumbtack and the global conversion on our platform.

Recently, we successfully tested and productized an ensembled Deep Cross Network (DCN V2) [1]. This was the first time we tested a near state-of-the-art neural-network based machine learning (ML) model to improve search ranking. In this blog post, we describe our experience significantly evolving search ranking using ML over the past 2 years and some of our learnings along the way.

Fig 1. Customer searching for house cleaners in their neighborhood

Problem Context

In 2019, we first transitioned to using machine learning to power search ranking. A handful of carefully crafted features were used to power a set of logistic regression based ML models to predict the relevance of professionals in a specific category. As we continued to experiment with a few new features in 2020, ranking experiment iterations weren’t resulting in A/B test wins. Improvements in offline metrics had not translated to wins in online A/B tests and it felt like we had hit a brick wall.

Fig 2. Search ranking phases in 2020

The various phases involved in retrieving the most relevant professionals for a search request included:

  1. Candidate Selector: Chooses the initial set of professionals in a category
  2. Candidate Enricher: Enriches the set with information from other microservices e.g., Available budget from the Budgets service
  3. Candidate Filterer: Filters the set based on filtering policies, e.g., filter out professionals who aren’t targeting customer specified preferences
  4. Featurizer: Compute or retrieve features needed by the ML models
  5. Simple Ranker: Predict the relevance of each professional to the search query using a simple ensemble of logistic regression models and order based on relevance
  6. Policy based Re-ranker: Re-rank the list of professionals based on certain marketplace policies

In this post, we will be focusing on the evolution of the featurization and ranking phases of the overall process. The candidate selection/enrichment/filtering and policy layers are beyond the scope of this post.

Challenges

As mentioned above, small improvements in offline metrics using new features had not translated to online A/B test wins in 2020. Since we were using a relatively simple ranker consisting of an ensemble of logistic regression models, with a small set of handcrafted features, there were clearly many avenues we could have explored. Here are some of challenges that were discussed:

A. Experimentation Process

There was significant effort involved in the various phases of ranking experimentation: data creation, offline experimentation, model productization, online experimentation, etc. This resulted in longer experiment iterations. Since even after putting in the effort involved with each iteration, the experiments hadn’t quite panned out, it was becoming harder to justify larger investments in this space.

B. Model Complexity

Early offline experiments with non-linear tree-based machine learning models like Random Forests had shown promising results. But at the time our machine learning algorithm (Logistic Regression) was written in Go and baked into our ranking service. We lacked the infrastructure to serve non-linear models in production at the scale required for ranking.

Non-linear models thrive when they have access to large datasets with a rich feature space. With millions of customers we potentially had access to larger datasets, but new feature creation at the time was a time intensive process. Even if we had the underlying events logged, we still had to build the feature computation and then wait at least a month to have enough data to train the model using that feature.

C. Position Bias

Position bias relates to the problem of how customers tend to interact more with search results at the top of the list irrespective of their relevance to job preferences. Though we knew this was likely a problem, we hadn’t yet experimented with ways to measure and mitigate the effect.

Evolving Experimentation Processes

Offline Experimentation: The Python notebooks used to experiment with new features & new models can get bloated over time. By investing in streamlining the notebooks, introducing experiment tracking using MLFlow, and abstracting away model training code into a standardized library, we were able to speed up offline experimentation.

Feature Creation: Creating a new feature in production required us to log event data, create the feature computation and then wait at least a month to gather enough samples to train our ML models. For features that were based on events that weren’t being logged in the past, this was a given. But for features that were based on events that were already being logged (which were a large portion of our feature ideas), we decided to create a simple feature store that could compute new features and backfill feature values based on past events. More details on this are shared here. This greatly accelerated our ability to test out new features in production. In conjunction with this, we also introduced new feature validation checks to reduce training-serving skew.

Feature Space Evolution: Armed with the capability to more easily test new features in production, we expanded our feature set from a handful of carefully handcrafted features (<10) to a sandbox of over a 100 features.

At Thumbtack, there are 3 broad feature categories that can be used to rank search results:

Customer Features: Features that relate to the customer making the search request

Pro Features: Features that relate to the pro being ranked in the search request

Request Features: Features that relate to the search request involving a specific customer and a list of eligible professionals

Fig 3. Features for a customer’s search request

Each feature category has 2 subcategories of feature types:

Customer Features

Descriptive Customer Attributes: Attributes about the customer

Aggregated/Learned Customer Features: Aggregated or learned customer behavior

Pro Features

Descriptive Pro Attributes: Attributes about the professional

Aggregated/Learned Pro Features: Aggregated or learned pro behavior

Request Features

Descriptive Request Attributes: Attributes about the request

Request Specific Pro Features: Computed pro features specific to the search request

As one might expect, feature exploration and testing initially involved features that related to the professional like are they a top pro, or a pro’s average response time. This gradually evolved to include request features to create more dynamic search ranking that adapts to the session specific nature of the search request like page source for a request. More recently, as efforts are underway to personalize the results to individual customers, we have successfully introduced customer specific features like the number of past requests by the customer to further improve relevance.

Model Productization: Because of the various Thrift & Go changes required to configure and deploy new ML models to production for testing, it took multiple days to launch an experiment even if there were no feature changes involved. By building automation to streamline this, we were able to bring down the time required to start an experiment to under a day (after we had a candidate model identified during offline experimentation).

Post Experiment Analysis: On a cross-functional team like ours, product analysts are responsible for thinking through post experiment analysis in conjunction with engineers/data scientists. Since we wanted to be able to bring down experiment iteration time, we created a comprehensive analytics dashboard for ranking that analyzed factors like the effect on different segments of customer traffic, the effect on all key downstream metrics, etc., to a point where anyone could just plug in the experiment name and get the analysis necessary to make a ship decision.

Experimentation Velocity: By evolving our offline experimentation, model productization, post experiment analysis and feature creation process we have been able to drive faster iteration for our ranking experiments. But this alone was not enough.

  1. We coupled this with evangelizing the uncertainty of ranking experimentation and embracing the fact that only 1 in 3 or 1 in 4 experiments may succeed. This meant evangelizing the idea of systematically exploring our hypothesis space as we rapidly tried new ideas.
  2. We created goals around experimentation velocity for the team as a whole, e.g., 6+ ranking experiments every 6 months hoping 1 or 2 might succeed rather than hoping that specific experiment ideas will succeed.

Evolving Modeling

In 2020, knowing that we wanted to be able to explore complex non-linear models, we wanted to quickly move away from ensembled logistic regression models that were originally baked into our ranking service in Go. This led to us building the feature store mechanism we mentioned earlier. In order to serve complex non-linear models we also built a model inference layer backed by AWS SageMaker, thus extracting out model inference from the ranking service. In the diagram below you can see the current evolution of the phases that involve search ranking, which now include feature stores for customer and pro features and a model inference service that can serve complex non linear models.

Fig 4. Search ranking phases in 2022

Tree-based ML models: Tree-based ML models are a class of algorithms that tend to create tree-like decision structures to make predictions. Since we were using Scikit Learn, initial offline exploration of tree-based models started with Random Forests but quickly proceeded to the use of Gradient Boosted Decision Trees. This was the first non-linear model we successfully productized in early 2021, lifting global conversion via search ranking by 1.4% relative to baseline (this also included a set of new features).

As we added more features and trained more complex models, we quickly assessed 2 factors:

  1. We needed to understand the latency bounds our models need to operate within. To address this we ran simulated latency tests by artificially introducing latency to understand the effect of higher latency on customer experience and thus learned the latency threshold we had to operate under.
  2. We needed a more scalable & efficient modeling framework that could deal with more complex models. For this we ran performance tests with XGBoost and immediately saw significant gains in inference latency, leading to us productizing XGBoost right away and creating headroom around adding further model complexity.

Model Optimization: Exploring an offline feature space of over 200 features and optimizing our tree based models, here are some of our practical learnings:

  1. As we used model explainability techniques like Partial Dependence Plots and SHAP, we noticed model output changes weren’t more linearly correlated with certain numeric features in ways that aligned with the expected product experience. We addressed this by imposing monotonic constraints on those features and quickly noticed performance gains.
  2. Feature selection over 200 features wasn’t going to be easy, so we started with Recursive Feature Elimination. But Boruta SHAP, a robust feature selection strategy that uses the Boruta feature selection algorithm with Shapley values for calculating feature importance generally outperforms other methods. It has now become our go to feature selection strategy.

Mitigating Position Bias: Customers tend to interact more with search results at the top of the list irrespective of their relevance to job preferences, which is referred to as position bias. In order to measure and mitigate the effects of position bias, this is what worked well for us:

  1. Measure: In late 2020, we created an adapted version of the RandPair algorithm [2], so we could randomly swap a few search results and estimate the effect of the position bias within our platform.
  2. Mitigate: In 2021, using the position bias estimates, we created a set of features that calibrated for position bias (e.g. click rate) which led to small performance gains for our model. We also investigated and modified the loss function for XGBoost using Inverse Propensity Scoring [3], which led to a pronounced improvement in performance.

Neural-network based ML models: In early 2021, when we were initially optimizing tree-based models, we invested some effort around exploring neural networks for our problem space. At the time, we were using ~20 features and tree-based models outperformed the neural network models on metrics like NDCG and log loss. Given the smaller feature space, the effort involved in training neural models and the lack of infrastructure to support scalable production deployments we decided to push back further exploration. This year, with a richer feature space of over a 100 features and improved model inference capabilities we decided it was a good time for us to re-invest in neural networks.

The DCN V2 model is able to effectively learn predictive feature interactions in a more resource efficient manner [1]. Following the exciting results from the research, we decided to test this out. Here were some of our initial testing outcomes:

  1. Offline evaluation: Surprisingly, during offline evaluation though the DCN V2 outperformed our XGBoost based baseline on log loss by 0.7%, it didn’t improve other key metrics like NDCG or MRR. Though this would have normally given us some pause, we decided to productize the model for online testing since it would also serve as a way to build a foundation for serving neural models in production.
  2. Online evaluation: A/B testing DCN V2, we demonstrated that it could lift global conversion by 0.5% when added to our ranking ensemble. This led to us successfully productizing our first near state-of-the-art ML model for search ranking at Thumbtack!

Conclusion & Future Direction

In the last 2 years, we have worked on over 20 ranking experiment iterations and shipped 6 of them to production. Search ranking improvements across time have cumulatively resulted in a relative 4.5% conversion rate lift, as measured by A/B tests.

This was a result of tactical and strategic investments that spanned across experimentation & modeling processes.

Going into 2023, now that we have productized a near state-of-art machine learning model for search ranking we plan to iterate and optimize it further, while also improving our experimentation infrastructure. Personalization is also a dimension where we have only scratched the surface and improving the customer experience via more personalized search results is something we look forward to.

Interested in staying connected with Thumbtack? Follow us on LinkedIn or check out our latest openings at thumbtack.com/careers.

Acknowledgement

The evolution of search ranking at Thumbtack is a result of direct & indirect contributions from current and former employees including but not limited to: Samuel Oshin, Tim Huang, Derek Zhao, Richard Demsyn-Jones, Mark Andrew Yao, Oleksandr Pryimak, Amber (Anyu) Wang, Dhananjay Sathe, Ling Xie, Sophia Bussey, Wade Fuller, Tom Shull, Karen Lo, Eric Ortiz, Bharadwaj Ramachandran, Abhilash Arora, Michael Anthony, Adam Hollock, Joseph Tsay, Alpesh Gaglani.

References

[1] Wang, Ruoxi, et al. “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems.” Proceedings of the Web Conference 2021. 2021.

[2] Wang, Xuanhui, et al. “Position bias estimation for unbiased learning to rank in personal search.” Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 2018.

[3] Hu, Ziniu, et al. “Unbiased lambdamart: an unbiased pairwise learning-to-rank algorithm.” The World Wide Web Conference. 2019.

--

--