Review Relevance @ Thumbtack

Navneet Rao
Jan 27 · 9 min read
Photo by Michael Dziedzic on Unsplash

Every year, millions of customers use Thumbtack’s marketplace to find the right local professionals for pretty much anything. Some use Thumbtack to quickly find a reliable plumber to fix their leaky garbage disposal, while others might use Thumbtack to consult with an interior designer when remodeling their home. Customers can find professionals for around 500 categories of services on Thumbtack. Based on factors like the type of job, their budget or the urgency of their need, a professional who would be right for one customer may not necessarily be the right one for another.

In this post, we will outline how we define the problem of relevance at Thumbtack, and how we leverage our review data to improve review relevance. We will also dive into how we built a snippet extraction pipeline for customer reviews of professionals on Thumbtack using machine learning.

Review Relevance @ Thumbtack

The term relevance is traditionally used in the context of search engines. Christopher Manning refers to relevance as the art of ranking content for a search based on how much that content satisfies the needs of the user and the business. We always strive to improve search relevance. We also use the term relevance to refer to the content that we surface to help customers narrow down on the right professional for their needs.

Review relevance refers to the content we surface using Thumbtack’s review data to help customers narrow down on the right professional for their needs. Thumbtack has over 5 million 5-star reviews written by customers detailing their interactions with professionals on our platform. When customers search for a professional in a specific category like Handyman, they see a ranked list of the professionals in their area. Reviews are one of the data sources we leverage to then help customers decide which professionals to contact. Extracting a review snippet for a professional based on past customer reviews is an example of a review relevance problem, which we will describe in the next section.

Review Snippet Extraction

Customers need to know that a professional is accountable, reliable, timely, easy to communicate with and safe to bring into their home. One of the ways in which we addressed this user problem was by shipping the review snippets feature. On the page that displayed the ranked list of the professionals in an area we displayed a short snippet from the top review for a professional. You can see an example of these snippets for a set of professionals in the Music Lessons category in Fig 1.

Fig. 1: Example list of professionals with review snippets for Music Lessons

We hoped that showing review snippets demonstrating past experience would help customers build confidence in contacting a professional for their job, and it did. The feature led to a significant increase in customer conversion.

Review Data Characteristics

When customers write a review for a professional, they can optionally tag each review with up to 3 attributes they liked about a professional from amongst: professionalism, work quality, punctuality, value & responsiveness.

Fig. 2 Write a review page

Previous Approach

At Thumbtack, we use ElasticSearch (ES), to store review data. The feature worked in the following way. Using a category name (like Music Lessons), we used a custom ES query to rank the reviews for a professional. We also used ES’s highlighting capabilities to extract and highlight the top review snippet. Though this worked well enough at the time (e.g. shown previously in Fig. 1), there were some issues with our approach:

  1. Due to the highly variable nature of how customers write reviews for a professional, what customers expressed in the review text did not always match the attributes they had tagged.
  2. The ES based approach only highlighted terms matching the category name e.g. Words like music or lessons for the category Music Lessons.
  3. The snippet picked by the ES based approach was sometimes neutral in tone, and did not always contribute to providing relevant information to a customer to help them make a decision.

New Approach

Recently, we decided to revisit this user problem and see how we can further improve the customer experience. We were not just interested in the problem of extracting high quality review snippets, but we also wanted to explore whether we could successfully extract review snippets that illustrated 1 of the 5 attributes (professionalism, work quality, punctuality, value & responsiveness) that the reviews were tagged with.

Some customers may write a review in just 1 sentence, some may write entire paragraphs expressing their satisfaction or dissatisfaction. We wanted to pick sentences that illustrate specific attributes, regardless of whether or not the customer actually mentions that attribute by name. For example, a sentence like “They were really quick in getting back to me” can illustrate responsiveness. Using machine learning and natural language processing for this problem allowed for us to model this better than using ES.

Extracting the best review snippet from the many reviews for a professional could be conceptualized as an extractive summarization problem over multiple documents, where each review represents a separate document. Since we have up to 3 attribute tags associated with each review, we also have at our disposal a multi-label text classification dataset.

There have been many recent advances in the area of extractive text summarization. Zhong et al. model the problem in terms of a matching problem, where they try to select the candidate summary which most closely resembles the representation of the source document. Section 2 of their paper also describes more recent work in this problem space. Since this was our first attempt at applying machine learning to our problem space, we chose to go with a simpler approach by conceptualizing the review & candidate snippet ranking problem using a more classification based approach.

Initially using supervised learning, we train classification models that can predict how indicative a review text is of illustrating each of the 5 attributes. We then apply a 2 stage approach to extract a snippet from all the reviews for the professional.

Stage 1: Top K Reviews Generation

In stage 1, we generate the top k reviews for a professional using all of their reviews, for a specific attribute. This involves two steps:

  1. Attribute Selector: The attribute selector counts the number of times a professional’s reviews are tagged with each attribute. It then selects the attribute with the highest count.
  2. Review Attribute Scorer: The review attribute scorer uses the trained classification model for the selected attribute and ranks the reviews on how well it illustrates that attribute. It then gets the top k reviews for the same.

For example, in the case of a moving company named XYZ Moving, if we assume they have 50 reviews and the count of the attribute tags is highest for the responsiveness attribute, then we use the responsiveness classification model and rank the reviews by the highest scores for that attribute.

Fig. 3: Top K reviews extraction

We then use the top-k reviews to extract the best snippet using our snippet extraction pipeline in stage 2, as described in the next section.

Stage 2: Review Snippet Extraction

We define a snippet extraction pipeline consisting of 4 phases:

Fig. 4 Review Snippet Extraction
  1. Snippet Candidate Generation: Each of the top k reviews are first split into sentences using NLTK (a text processing framework). A custom sentence merging algorithm then acts on the sentences for a review to appropriately merge shorter sentences into a set of candidate snippets. The classification model is re-applied on the snippets to get the likelihood scores associated with each candidate for a specific attribute.
  2. Snippet Candidate Highlighting: A linear classification model can generate feature importance scores for terms in the model’s vocabulary. We create a feature importance threshold above which we deem a term important enough to be highlighted. During runtime, we highlight the terms in the candidate snippet whose scores are higher than the threshold. E.g. For the responsiveness model, terms like quickly or responsive might have high feature importance scores, and would thus be highlighted in a sentence.
  3. Snippet Candidate Ranking: After candidate generation and highlighting, we rank the snippet candidates based on their likelihood scores.
  4. Snippet Candidate Filtering: In this phase, we apply the LanguageTool (a Python based Grammar Checker), to detect spelling errors within the candidate snippets. If the snippet has spelling mistakes, it’s filtered out. We also filter out snippets that are shorter than a predetermined length. We then select the top snippet candidate as our final review snippet.

Experimentation Strategy

As firm believers in Agile iteration, we built up our eventual approach over 3 iterations. We initially started experimentation by focusing on just 1 attribute (professionalism) and once we were confident that we had a good extraction pipeline, we extended our work for all attributes and introduced the attribute selection mechanism described earlier.

For our classification model, we initially started with a Logistic Regression model, and then experimented with a few model families including Support Vector Machines, Gradient Boosted Decision Trees, Convolutional Neural Networks etc. We used standard evaluation metrics like Precision, Recall & F-1 to evaluate review classification performance.

Snippet Evaluation

Since there was no ground truth, we knew trying to evaluate the extracted snippets was going to be a little tricky. We decided that we would consider an extracted snippet high quality if it demonstrated:

  1. Appropriate Length — Not be too long or too short, ideally 1–3 short sentences
  2. Representation — Feature space representation strongly indicative of at least one of the five attributes
  3. Appropriate Spelling — No spelling mistakes in the snippet picked
  4. Narrative Cohesiveness — Has meaningful content that tells a good story within 1–3 sentences, and is not just a set of positive phrases

After each iteration, prior to the launch of an A/B test we generated snippet variations that were evaluated by the team using the criteria defined above. The best variant was then picked for the A/B test. Since it’s hard to objectively define the goodness of a variant using these criteria, we used a relative evaluation mechanism where an evaluator decides which version of the variant they like better.

Results

For the first 2 iterations of our experiment, there was no significant improvement in our key metrics. In our 3rd A/B test, we finally saw our efforts bear fruit. The test resulted in a significant improvement over baseline for our key customer conversion metric.

Fig. 5 illustrates an example of review snippets generated using the new approach for catering professionals in Cambridge, MA.

Fig. 5 Review snippets for catering professionals in Cambridge, MA

Future Work

There is more we can try by framing the problem as an extractive text summarization problem. We also hope to build more powerful BERT based classification models that might improve the underlying snippet representation in the semantic space. Another variant for us to try would be the use of attention mechanisms with our neural network based approach to improve the highlighting of important terms in a sentence.

As we build the best platform to help homeowners fix, maintain or improve their home, there are many interesting machine learning challenges facing our two-sided marketplace. This includes problems related to search, ranking, relevance, monetization and recommendations. If the problem of review relevance seems interesting, or if you would love to tackle some of the other challenges mentioned above, come join us! We would love to have you aboard.

Acknowledgement

The new review snippet extraction approach was jointly investigated by the author and Richard Demsyn-Jones along with Tom Shull, Joe Tsay & Wade Fuller. This work would also not have been possible without help from Mark Andrew Yao, Yibai Shu as well as others in the Marketplace org at Thumbtack.

About Thumbtack

Thumbtack (www.thumbtack.com) is a local services marketplace where customers find and hire skilled professionals. Our app intelligently matches customers to electricians, landscapers, photographers and more with the right expertise, availability, and pricing. Headquartered in San Francisco, Thumbtack has raised more than $400 million from Baillie Gifford, Capital G, Javelin Venture Partners, Sequoia Capital, and Tiger Global Management among others.

Thumbtack Engineering

From the Engineering team at Thumbtack