Leaves to Neurons: Using Deep-Cross Networks for Ranking Search Results

Tim Huang
Thumbtack Engineering
10 min readMay 19, 2023
Photo by Liam Pozz on Unsplash

Ranking at Thumbtack is an integral part of helping consumers find the perfect professional for their job. Millions of consumers search for professionals each year across various job categories on Thumbtack and choose from a ranked list of relevant professionals in their area. Professionals are ranked according to a relevance model that predicts the probability of user action, and so the quality of the ranking model plays an essential role in whether consumers click through to contact a professional and get a job done.

Thumbtack’s ranking architecture has evolved rapidly in the past 3 years, as detailed in the previous Evolution of Search Ranking blog post [1]. For a long period, variations of gradient boosted decision trees (GBDT) were the primary ranking model and previous attempts at testing neural networks for ranking did not produce the desired impact. However, after expanding our feature set and trying new architectures, we recently were able to successfully train and implement our first neural network ranking model using the novel Deep-Cross Network (DCN-V2) architecture [2]. In this blog post, I will dive into our interpretation of this architecture and our experience with applying neural networks to a large-scale ranking problem.

Problem context

Ranking is a unique and difficult problem at Thumbtack. With an extensive variety of professionals on our platform as diverse as house cleaners to landscapers, user behavior can vary significantly based on many factors. The optimal ranking to provide the best consumer experience and maximize conversion can depend on many different features, including:

  • Consumer job needs
  • Preferences & skills associated with the professional
  • Consumer features based on past behavior
  • Contextual features specific to the search request

Given our diverse and expanding feature set, GBDT models continually performed well as the team iterated on new features, different training approaches, and position de-biasing training data. During this time, the team had evaluated neural networks to try and further utilize the capabilities of modern modeling techniques, both in better matching consumers to great professionals and in unlocking the doors for more advanced techniques (such as supporting embeddings, text data, etc). The team had tested standard feed forward deep neural network (DNN) architectures, but were unable to beat the performance of GBDTs in offline analyses.

How features affect model type

Why did GBDTs exhibit better performance over neural networks? GBDTs are not to be overlooked in ranking despite being an older technique, as their performance on tabular data in ranking applications is still competitive to neural networks, such as in “Are neural rankers still outperformed by gradient boosted decision trees?” [3].

For our analyses, we hypothesize that GBDTs outperformed the DNNs because GBDTs more easily learn implicit feature interactions. In a product environment like ours, feature interactions play a crucial role in our product; for example, certain job type and zip code combinations may intrinsically have lower conversion rates. Feature interactions can be learned (implicit) or they can be manually captured with feature engineering (explicit). As mentioned in [3], tree-based models are known to learn and partition feature spaces effectively, since a decision tree essentially separates the data based on branches of conditionals to create clear feature interactions. Traditional DNNs, on the other hand, are known to struggle with learning higher order feature interactions [3] and need assistance from engineered features that explicitly capture feature interactions. With limited engineered features, our GBDT models were likely able to learn more implicit interactions on that feature set than DNNs.

A common way of capturing feature interactions through manual feature engineering is by multiplying features together to create higher order polynomial feature crosses. These are relationships that a DNN would typically struggle to model if the features are not already explicitly crossed, as DNNs use a complex series of nonlinearities to approximate functions. This is referenced in Table 1 of the DCN-V2 paper [2], where a DNN struggles with modeling polynomials of increasing difficulty. The diagram below further illustrates a roundabout way for DNNs to approximate polynomials through a series of ReLUs. Note that this limitation of DNNs applies to polynomials in general, but [2] focuses specifically on polynomial feature interactions.

From Brendan Fortuner on “Can neural networks solve any problem?” [4]

This isn’t to say that GBDTs can easily model polynomials either. To model a generic polynomial, a GBDT would use a combination of data splits to simulate the function. But going back to why we are focusing on polynomials — these are a mechanism for capturing feature interactions through feature crosses. However, rather than relying on multiplying features into polynomials, perhaps GBDTs can better model interactions with a different mechanism by using the branches of decision trees to create logical “AND” conditions that also capture interactions. The exact reason why GBDTs better model feature interactions still needs further testing, but [3] does state that GBDTs handle feature interactions better than DNNs, and [2] suggests that polynomial modeling as means for DNNs to improve interaction modeling.

Introducing the Deep-Cross Network (DCN-V2)

The DCN-V2 is a new architecture introduced by researchers at Google [2] to directly target these limitations of neural networks in large scale click-through-rate (CTR) applications. The heart of the architecture is the cross layer, which allows the model to multiply each input by each other. The equation governing a single cross layer is shown below for reference.

From Wang et al, 2020 [2]

By allowing the model to multiply inputs together, the model is able to explicitly create polynomials to replicate feature crosses. One interpretation of the cross layer is that it provides the neural network a mechanism for simulating the process of manually creating feature crosses. Traditionally in feature engineering, we would manually multiply two features together to get a cross capturing that feature interaction. However, manual feature exploration and engineering becomes increasingly difficult in a large-scale environment, given that this process “involves a combinatorial search space” and “often requires domain expertise” [2]. Now with a cross layer, a neural network can automatically learn and “explicitly” create these pseudo-feature crosses where it deems necessary, akin to manual feature engineering. Compare this to a traditional DNN, where at no point is an input directly multiplied by another input, but instead this multiplicative relationship has to be approximated through complex non-linearities.

This idea is similar to factorization machines (FM), which are also motivated by automating feature interaction learning within models. However, the DCN-V2 paper argues that these models such as DeepFM have practical limitations, such as high computational costs and requiring all feature embeddings to have equal dimensions, that make it “impractical for industrial-scale applications” [2]. Additionally, the papers compares FM architectures and AutoInt (automatic feature interaction learning) with DNNs, and found that DNNs surprisingly “performed neck to neck with most baselines and even outperformed certain models” [2]. DCN-V2s, on the other hand, consistently outperformed DNNs.

A single cross layer is able to explicitly model quadratic relationships between two features, so a full network is able to explicitly model polynomials of order D+1, where D is the number of cross layers. Additionally, with the residual input being added back in, certain inputs that don’t necessarily need polynomial interactions can be passed through the cross layers. Having multiple cross layers also allows for not only interactions between features, but also interactions between features and intermediate representations.

The DCN-V2 paper proposes a stacked and a parallel architecture for this neural network. The stacked architecture consists of a series of cross layers, followed by a traditional dense network, while the parallel architecture has the cross layers and dense layers work separately to produce a crossed and non-crossed set of representations. Note that in the proposed architecture, all categorical features also go through an embedding layer first before being inputted to the cross/dense network.

This combination of cross and dense layers is likely motivated by the idea of maintaining the advantages while solving some of the limitations of a traditional DNN. A DNN is still a very powerful model with its ability to model complex non-linearities and shines especially with continuous features. The cross network tries to tackle the limitations mentioned earlier of DNNs when it comes to feature interactions by providing additional expressiveness through feature crosses. However, a cross network alone is restrictive, since it can only model up to polynomials of a certain order depending on the number of cross layers, which is why the DCN-V2 is a combination of deep and cross networks. The interpretation that cross layers allow the model to simulate manually creating feature crosses is particularly apparent in the stacked architecture, where the structure suggests that the cross network serves to perform some “feature engineering” to explicitly model feature crosses and prepare the inputs for the dense network. The cross and dense networks work together to fill the gaps in each other’s limitations.

From Wang et al, 2020 [2]

Experiment setup

We revisited neural networks with a push to apply DCN-V2 and analyzed how it learns compared to both a traditional DNN and the current production GBDT model. For purposes of making a fair comparison, a new training dataset was pulled to train the DCN-V2 and DNN models as well as creating a retrained version of the production GBDT. By this time, our GBDT feature set had evolved from under 20 features to over 60. The same data and feature set was used with the DCN-V2, DNN, and GBDT to isolate the impact of the architecture change.

Each categorical feature was mapped to an embedding layer. We set the embedding dimensions roughly following the rule-of-thumb: dimension = (number of unique values)**(0.25). We then further fine-tuned and added an embedding L2 regularization, which was not mentioned in the DCN-V2 paper, to each embedding feature that exhibited strong overfitting. This is in addition to the L2 regularization we applied to the cross network and dense network. We also added a dropout to the dense network, which was not mentioned in the paper, but helped us further reduce overfitting.

Results

A key behavior we noticed immediately from the DCN-V2’s performance was its predilection to overfit on search-category (type of job the consumer is searching for) and zip code features. We attributed it to these features by testing removing all regularization, allowing the model to aggressively overfit, and examining feature importances from these overfitted models. Afterwards, we tuned the embedding regularization for search-category and zip code to tackle the overfitting that was causing poor validation performance.

Through feature importance analyses (permutation importance and Shapley plots), the DCN-V2 exhibited strong interactions between the search-category, zip code, and the historical click-through-rate (% of impressions that lead to clicking on a professional) of that search-category. This was further corroborated by visualizing the actual kernel weights of the cross layer matrix in a heatmap, following the example shown in Figure 6 of the DCN-V2 paper [2]. The historical CTR has traditionally been among the strongest features for us, but the search-category feature increased in importance with the DCN-V2. Interactions with search-category should be key in providing better ranking quality given the diversity of services on our platform. This seems to be in line with our expectations, given the diversity of services on our platform, search-category interactions should help provide better ranking quality. Additionally, when looking at features that moved the most in terms of feature importance between the DCN-V2 and GBDT, the device feature (whether the user is on mobile or desktop) stood out with notably increased importance for DCN-V2.

We optimized and evaluated the models with log loss. The DCN-V2 exhibited a 0.59% lift (log loss reduction) over the retrained GBDT model. However, we observed that even just training a traditional DNN with the expanded feature set produced a 0.32% lift over the retrained GBDT. These metric lifts seem on par with the experimental results provided in the DCN-V2 paper [2], where in Table 6 they saw the DCN-V2 exhibited log loss improvements over a DNN ranging from 0.3–0.9% in their test datasets, though the paper does not provide a GBDT comparison. Our DNN’s lift over GBDT when more features were added, supports the hypothesis that GBDTs are able to learn implicit interactions better while DNNs need engineered features that explicitly capture interactions. This expanded feature set includes more engineered features, such as the historical search-category-specific CTR. Having more precomputed features reduces the need for the DNN to learn implicit interactions. The addition of the cross layer then further pushes the capabilities of neural networks relative to GBDTs in capturing more interactions.

After running a visitor-randomized A/B test with the new DCN-V2 ranker at Thumbtack, we noticed a significant improvement in conversion specifically on native app users. We hypothesize this is due to native users being logged in to the app, whereas a large portion of web users are not logged in. This provides the ranker model additional data about the consumer’s past behavior on native platforms, which likely allows the DCN-V2 to learn consumer-specific interactions better than the GBDT and provides more personalized rankings, but this requires further analysis.

Next steps

Shipping our first neural network ranker is a milestone for ranking at Thumbtack. Not only does this help consumers find more relevant professionals more often, it sets the precedent to devote more time to neural networks and unlocks many doors for more projects involving interesting near-state-of-the-art technologies. Now that we have this neural network infrastructure, some potential new ranking projects to experiment with include: training ranker models with listwise metrics instead of log loss, using raw text data from reviews as model input, and trying more advanced architectures beyond DCN-V2. With exciting ranking projects on the horizon, stay tuned for more ranking blog posts to come.

Acknowledgement

This milestone was achieved with contributions from Thumbtack’s ranking group including but not limited to: Samuel Oshin, Derek Zhao, Richard Demsyn-Jones, Navneet Rao, Dhananjay Sathe, Wade Fuller, Tom Shull.

References

[1] Rao, Navneet. “Evolution of Search Ranking at Thumbtack”. 2022.

[2] Wang, Ruoxi et al. “DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems.” Proceedings of the Web Conference 2021. Association for Computing Machinery, 2021.

[3] Qin, Zhen et al. “Are Neural Rankers still Outperformed by Gradient Boosted Decision Trees?.” International Conference on Learning Representations. 2021.

[4] Fortuner Brendan. “Can neural networks solve any problem?”. 2017.

--

--