How We Personalized Recommendations for Professionals on Thumbtack

Published in

Thumbtack Engineering

11 min readJun 17, 2024

Authors: Aishwarya V Srinivasan and Shishir Dash

Hundreds of thousands of professionals (pros) host their businesses on Thumbtack. They service millions of customers every year looking to get their jobs done in over 500 categories of home services (House Cleaning, Plumbing, etc). A single pro can offer any number of services on Thumbtack.

Each service category has its own set of preferences for the pro to choose from. For example, in the case of House Cleaning, pros can choose to offer only cleaning services for small apartments, or that they do not offer carpet cleaning. There are several preferences that vary based on service category. Pros need to carefully fill these in to ensure they get matched with customers who align with their preferences.

We recommend actions that pros can take to potentially grow their businesses on Thumbtack. In the past year, we decided to personalize the actions we recommend to our pros using a data-driven approach rather than a rules based heuristic approach. In this blog, we will first set context around the problem, then walk you through how we addressed it. We will also touch on the impact it had on our product and our users.

The Problem

Pros want to grow their business on Thumbtack, as in, they want to get more leads/bookings that will get them more jobs. There are several actions they can take to boost their potential to get more leads such as:

Expanding their current within-category preferences.
Expanding the service categories and/or geographies they offer.
Getting more competitive by setting higher bids etc

Often, it is not very obvious when they need to take specific actions in order to maximize the chances of growing their business on Thumbtack.

We partly addressed this problem by launching a recommendations carousel for pros on the app. We tailor data-driven recommendations to each pro with specific calls to action (see example below).

Figure A. Recommendations Carousel for Pros

We have many recommendation types, each with its own eligibility criteria, to recommend to pros. Until recently, for ease of implementation and quicker additions, we used to rank these recommendations based on product intuition, with the same ordering for every pro. However, different pros might have (a) different needs for their business, and (b) different sets of actions that are more suited to these needs. Some pros may optimize for more leads on Thumbtack, while others might be more interested in higher quality leads, or higher chances of job conversion or something else. Thus, we realized that there is opportunity to prioritize and personalize these recommendations to individual pros.

We hypothesized that each pro has their own preferences, and treating all pros equivalently with a static set of recommendations is likely sub-optimal. Our pros are unique, just like our customers, and one size does not always fit all. Thus by surfacing timely, valuable and personalized recommendations to pros, we can help them get more leads/bookings on Thumbtack

Data Challenges for Personalization Model

To test our hypothesis, we needed to track and store signals which could be helpful as potential features for our personalization model. This meant adding tracking for key recommendation metadata like position and rendition.

“Position” signals are about tracking the position of the recommendations pros viewed, clicked and enabled. Pros prefer the top results and aren’t willing to scroll indefinitely. Thus the position of a recommendation in a list of recommendations is a key factor that influences pro behavior. If we trained a model that does not account for position, we risk learning from a dataset heavily influenced by pro preferences for the top recommendations. Interested readers can learn more about position bias from this article.
“Rendition” signals are metadata of what recommendations were rendered (irrespective of whether they are viewed/clicked upon) at a certain position to a certain pro and at a certain time.

Taken together, such tracking helps us coalesce a clean, structured, training dataset to build a personalization model. However, a key hurdle for this first personalization model: we don’t have any historical metadata to learn from. We thus first needed to curate a historical dataset with proxy metadata. Our chosen proxy leveraged a quirk of our legacy method of launching recommendations in the non-personalized world. In these launch tests, we had always used position 1 of the carousel to try out a brand new recommendation. We spliced together view and engagement data from all such “first launch” experiments for every card. Since every recommendation, in its first experiment-launch, was shown to pros in the same top position, the effect of position on pro behavior is reasonably well controlled for in this spliced-together dataset, and in effect, gives us a reasonable estimate of position bias. While not a perfect de-biasing heuristic, it was a good approximation for what would have happened had each recommendation been observed by customers at equivalent positions.

Model Design & Inference

Since our goal was for pros to get more leads/bookings on Thumbtack, we decided to rank recommendations by estimated improvement to lead volume given that the pro views the recommendation. This estimate can be obtained by multiplying two sub-component estimates: one for how likely the pro is to enable the recommendation, and one for the expected lead volume on enablement.

Personalization model: We created a personalization model for predicting the probability that a pro will enable a recommendation — P(Enable recommendation k | view recommendation k). For this, we used a tree based binary classification approach where the label is a simple binary outcome: 1 when a pro enabled a recommendation, 0 when they did not.
Value Given Enablement (VGE) model: We created a separate model for estimating the expected lead volume for each recommendation type given the pro enabled the recommendation. This estimates the expected lead volume from enabling said recommendation i.e. E(Lead volume | enabled recommendation k)

At inference time, for each pro-recommendation combo, we’ll get predictions from each model. When we multiply the outputs, we get rank scores in the order of estimated leads for each pro individually.

Figure B. Model for Estimated improvement to lead volume *given* that the pro views the recommendation

Personalization Model

As mentioned previously, we created a personalization model for predicting the probability that a pro will enable a recommendation. This model is a boosted tree-based binary classification model which includes recommendation type as a feature and predicts whether the recommendation will be enabled or not. Boosted tree based models are effective because they combine the strengths of multiple simple models (weak learners) into a more accurate ensemble. Each weak learner focuses on correcting the errors of the previous model, progressively refining the overall prediction. This iterative process allows the model to capture complex patterns in the data and improve its accuracy over time, much like a group of individuals with different skills collaborating to solve a problem. The final ensemble model is a weighted combination of these weak learners, resulting in a robust and accurate predictor.

We curated our dataset on a per pro and date level. All features in the dataset were calculated specific to that date in order to prevent leakage of information while training. Each sample in the dataset was a view of a given recommendation type, by a given pro on a given date with all features; the target being whether the pro enabled the recommendation or not. We split our data based on date and by pro such that 1) pros in training dataset were not included in test dataset, 2) most recent dates were set aside for testing. This split mechanism ensured that we assessed the model’s ability to generalize to new unseen pros and its ability to perform in the real-world production environment where we want predictions for future given past training data.

We compared this model’s performance to that from other algorithms such as logistic regression, bagged tree based models and neural networks. Boosted trees ended up striking the best balance between ranking performance, ease of training and ease of inference. Model interpretability was less of a need as the motive was to improve pro engagement and not to understand why the model recommended a certain card type. In the end, we went ahead with this one as we wanted to first validate our hypotheses with a solid baseline with relatively fast inference.

VGE Model

As mentioned previously, we created a separate model for estimating the expected lead volume for each recommendation type given the pro enabled the recommendation. For this piece of the puzzle, we estimated the expected increase in lead volume using a Fixed Effects Modeling (FEM) approach. FEM models an outcome we’re interested in (in this case, lead volume) as a linear combination of various inputs, with these inputs subsetted into “Fixed effects” and “Exogenous variables”.

Fixed effects represent effects in the data that are common to many entities (such as industry) and thus create some natural groupings.
Exogenous variables represent the feature vectors specific to each pro. In particular, we are interested in pro specific characteristics, and pro behaviors or actions.

A pro’s intrinsic traits, like the state they live in or the broad category of services they offer — like plumbing or carpentry — could be modeled as fixed effects, since they define a market and create a shared effect across various pros. A “pro specific history of reviews” falls under the pro characteristics subset of exogenous variables. On the other hand, the logged event “pro enabled a recommendation” falls under the pro behaviors subset. And the outcome we’re interested in quantifying is each recommendation’s enablement impact in units that matter to the pro: lead volume in this case.

Once we represent all our data in this manner, we can use simple regression techniques to estimate the relative impact of each input on the outcome. We were specifically interested in the impact from the pro’s enablement action. If we use a boolean to represent this action, the difference in estimated lead volume when it’s 1 vs when it’s 0 can give us a direct quantifier for the causal impact of enabling the recommendation.

All else being equal, this difference can be obtained by simply estimating the coefficients c1, c2, … And it can be straightforward to interpret as a value of the enablement action, or VGE.

One might reasonably ask: why not just use simple regression? Why use these fixed effects? The answer: fixed effects modeling allows us to control for the shared effect of the natural grouping from the pro’s market or industry using numeric techniques. This decouples the within-market variation from the between-market variation, and we are more interested in the within-market variation (i.e., between pros) . Imagine a simpler problem with just one outcome and one feature x for each pro. Different pros end up as points on this two-dimensional space, and estimating the simple regression amounts to fitting a straight line. If there were multiple markets in this world, it’s likely the scatter would be very “lumpy”, with, say carpenters represented as a smear in one corner and plumbers in another. Estimating a single straight line across all points would likely not be a truthful representation. Instead, we should maybe estimate different straight lines for different category-specific smears. Fixed effects modeling essentially allows us to do this using efficient algorithmic techniques. The coefficients “c” we get from this FEM tool (as opposed to those obtained from a non-FEM regressor) are implicitly “adjusting” for the known presence of different groups.

We used all our available data (recall our proxy dataset) to train the above linear model. For every recommendation type, the corresponding coefficient served as our estimate of leads volume given the pro enables the recommendation.

Inference

We decided on offline inference with a weekly refresh cadence due to the nature of recommendations and pro behavior. Pros typically use the carousel at a slower cadence than for more well known carousels like content or movie recommendations. So, super real time activity tracking and inference could help increase performance, but is probably not necessary for a first attempt at personalization.

The following diagram depicts how we did offline inference.

Figure C. Offline inference for personalizing pro recommendations

Training the model was a one-time run whose output model object was stored in Google Cloud Storage (GCS).
All features come from an array of tables in BigQuery (BQ).
We deployed the offline inference job in Airflow using PythonVirtualEnvOperator and stored weekly predictions back into BQ tables.
We used Airflow for scheduling and setting dependencies with downstream backend tasks that consume the output predictions.

Performance Evaluation

Estimating the business impact of any model offline is a slightly challenging task. Often, several evaluation metrics such as AU ROC, Average Precision etc. do not correlate well with online AB testing experiments. In this case, this was our first model based initiative and we did not have any information on historical correlation of offline vs online performance.

From a binary classification model perspective, we chose Average Precision and AU ROC as important metrics to look at. From a ranking evaluation perspective, we chose Mean Rank as it is simple and effective. We also estimated the lead volume impact that we might observe in an AB test by comparing baseline and variant rankers offline. It helped us decide the Minimum Detectable Effect (MDE) for our AB test.

Outcome

As measured by our A/B tests, personalizing recommendations significantly improved leads/bookings per pro by +1.5%. The increase in leads/bookings for pros was primarily driven by pros expanding into more categories, adding more job preferences and serving more geographies.

Future Work

This work has inspired us to take up future iterations on not just personalizing the ranking of recommendations, but also personalizing eligibility and content of our recommendations. Next steps include exploring an embedding model for pros which would utilize all the text information (intros, reviews, messages text etc) to see if it makes our models better.

Acknowledgements

This work would not have been possible without the immense support and teamwork from all our backend engineers — Maaz Mohamedy, Michael Cao, Bharat Nayak who evolved the backend architectures to support model based recommendations, improved the runtime of backend jobs by 25x and put an amazing signal tracking framework in place for future model iterations. We deeply appreciate the help from our product and engineering partners Michelle Pan and Ewa Chang for facilitating this project. Last but not least, thanks to Xueying Yan for her depth of expertise on Pro recommendations and helping us understand the existing system well, to shape the lane of work for the future.