5 Minute Paper Summary: Predicting Latent Structured Intents from Shopping Queries, by Google Shopping

Published in

TP on CAI

6 min readJan 30, 2020

During an internship at Google Shopping, Chao-Yuan Wu joined forces with other Google employees to write this paper (I’m guessing) which was very satisfying to read. Right before I found this paper, I was working on multi-label query classification and helping my own co-workers write a paper about extracting named entities from queries. I was thinking, “I bet we don’t need to hand-label any entity names in query text given all the user behavior data we have access to. I wonder if anyone else has done this before.” I didn’t even go looking for a paper; it just appeared as a reference in another paper I was reading that very day, at which point a lightbulb above my head turned on. The paper is 9 pages long, and I read it immediately.

This paper is interesting because it shows one way to identify multiple attributes of products a user of an e-commerce site might be targeting by inferring them from their search query. They don’t need to actually extract any named entities from the query. And even though they use deep learning, they don’t need to hand-label any training data, given that they have access to a lot of e-commerce search users’ behavioral data.

The same techniques could — in principal — also be used for intent recognition and slot filling in a conversational AI (CAI) system, though you’d need a lot of dialogs with some kind of implicit feedback from a lot of users to do so.

Below I detail kinds of information I look for when adding a paper like this to a “related works” section of my own publications.

Task
Data
Approach
Results
Contributions
Insights
Concerns
Conclusions
Resources

Task

Their task is general query-to-product-attribute mapping, i.e. predict latent structured intents from shopping queries.

The goal of these models is to learn a function that maps a query to the set of attributes relevant to the intent of the user who issued this query.

Data

Needed data consists of the following.

E-commerce queries issued to two search engines, Google and Google Shopping.
Product attributes in a catalog of “millions” of products in the Google Shopping e-commerce website.
The clicks users make on product pages after issuing a specific query.

Types of attributes:

Feature tags like “queen size” as a property of mattresses and “waterproof” as a property of cameras.
Age Groups like “adult”.
Categories like grocery, electronics, and clothing.
Brands, product lines, and merchants.

For each query, we define a set of implied (associated) attributes to be the attributes contained in items clicked by users who issued the query. … These query-attribute-set pairs are then considered as ground-truth examples for our model.

Approach

They treat query understanding as a multi-label text classification problem using only implicit user feedback as ground truth. That includes predicting one or more product categories per query. But it also — and most notably — replaces NER. No need to hand-label any named entities in training data. They do this using a combination of two Bi-LSTMS (character and word) and an autoencoder network.

A key idea of our model is to jointly train a query network, that learns the query-to-attributes mapping from past users’ interaction responses to the presented results, with a product network, that learns attribute correlations from product metadata. Joint training is achieved by using a shared layer of attribute embeddings. To model unstructured queries in the query network, a highly flexible function class is needed. In this paper we adopt Long Short-term Memory (LSTM) bidirectional recurrent neural networks (BRNNs) and to achieve both robustness and generalizability, we propose a hybrid word-level, character-level approach, that effectively ensembles a word-level model, which works well for head queries, and a character-level model, that works well for tail queries…
…
Jointly trained hybrid RNN & autoencoder. We propose to jointly train a metadata network that models the correlations between labels. We seek to augment the query with terms learnt from user consumption behavior.
…
The product attribute network in our model is one form of autoencoder. However, our goal is to jointly train a better attribute embedding, instead of obtaining the representation as in traditional settings.
…
We consider a much simpler and efficient alternative that trains both the character-level and word-level RNNs on full queries.

In the extrinsic evaluation (DCG scoring), they re-rank results from a legacy production retrieval system using the top K identified latent attributes. I gather that during extrinsic evaluation they did allow the new system to make use of pseudo-relevance feedback, query expansion, and all the other parts of the legacy system that they did not mention, though this is not clearly stated. It is not very completely stated what the legacy production model is.

Results

2 to 4 point gain in F1-score from jointly training the autoencoder compared to using only a query model. 2 to 4 point gain by using hybrid character-word model compared to character and word alone.

Also significantly improves DCG score compared to legacy ranking.

Interesting that an F-measure of 0.544 in attribute extraction produced an 11% gain in DCG@1 in production while an F-measure of 0.473 (their MLP baseline) produced no gain in DCG@1.

Their ability to improve DCG compared to production is very sensitive to the number of extracted attributes to use. Use too many or too few, and the system hurts ranking results. They tuned this number.

We hypothesize that in future work where we use soft predictions [confidence scores] for score-boosting, the need to find an optimal k can be eliminated.

Contributions

The approach handles any kind of important query attributes or latent product attributes including categories and named entities. I believe it performs entity recognition and resolution in one step and should also be able to handle subjective/qualitative/relative attributes like “big” as well as spelling errors without any additional effort.

I especially like their contribution to unsupervised ML:

Here we consider this problem in an unsupervised setting where correlations are learned instead of being given via human annotations as in knowledge graphs. Our solution is thus more general and scales better.

…

we do not assume that any terms in our query refer to any specific entities. Instead, we want to understand the latent intent of a query and find the implied attributes. In other words, note that in query “high-end bike”, none of the terms in this query refers to “21 speed” or “carbon frame” but they are the likely attributes implied.

Insights

Using an auto-encoder to learn the natural correlations among attributes, then the low-frequency attributes have a better chance of being predicted, as demonstrated.

Concerns

Their F1 score improved by only 2 points using the autoencoder, which is about what you might get by balancing (up-sampling) the infrequent attributes. It is also the infrequent attributes that get the most boost in F1. Did they try to up-sample?

Regarding their learning co-occurrence pattern of product attributes: If they train the product network on the product metadata alone (not weighted or up-sampled by purchase frequency), will the correlations really reflect those of the latent attributes implied by queries? I wonder if they might get better results by weighting latent attributes by query frequency.

How was F-measure computed? Macro-averaged over all attributes? Weighted? I assume macro-averaged based on how the numbers seem to correspond to the figures, but I’m not sure.

Qualitative evaluation shows us only positive (correct) examples with no mention of selection process. Are these cherry-picked? What do mistakes look like?

They would probably get a stronger signal of latent attributes if they use data generated when a user specifically clicks on product type or other attributes explicitly in a left navigation filter.

Conclusions

Their approach is based on pretty basic deep learning that intelligently identifies correlations between parts of a query (using both a word- and character-based model) and the attributes of products in implicit user feedback which an e-commerce site generates every day without extra cost.

They also jointly train a product-attribute autoencoder to find correlations among attributes. This improves F-measure of tail-queries attributes.