How to make your assistant proactive

Making Walmart’s Shopping Assistant Proactive

Saving time for our customers using proactive actions, predictive suggestions, and feature discovery.

Komal Dhuri

Published in

Walmart Global Tech Blog

6 min readJun 1, 2020

Source: https://giphy.com/gifs/season-8-the-simpsons-8x14-3oriffOF385el6esSI/links

We introduced Walmart’s cutting-edge retail voice agent, currently available on Google Assistant and Siri, to reduce friction during online shopping. The assistant allows users to conduct seamless conversations and smartly offers products based on personal preferences and past orders. Users prefer a faster voice-based experience to search or reorder products by simply saying something like “get me the milk I usually buy”. Even though we are still in the early stages of our voice assistant development, we have seen a steady growth of new and engaged users using the assistant. As the number of users has grown up, we have observed that new voice assistant users need help in discovering the capabilities of the assistant. Surfacing various capabilities by prompting the user and guiding them has always proven to drive better user engagement.

To enable such feature discovery we built the proactive module using an ensemble model that utilizes the user’s external environment interactions and current conversational interactions as context to provide personalized hybrid recommendations, predictive suggestions, and proactive actions. These act as visual aids for a user to complete their task. In the rest of the article, we will dive deeper into the concepts and NLP algorithms we used to solve this problem.

What are proactive actions and predictive suggestions?

Proactive actions are tasks performed right away at the very beginning of the conversation. They are performed with minimum interaction between the user and the assistant. These actions provide a starting point for a conversation with the assistant, give quick access to important pending tasks, and help track previously completed tasks. For instance,

If a user has a pending order with Walmart or if an order was successfully delivered, picked up, or canceled recently the assistant will prompt the user to do an “Order status check” as soon as the assistant is invoked.

Currently, the user is prompted to perform such tasks but over a period of time, the assistant will be trained with more such instances of proactive actions successfully used by users and other personalized features. The assistant will then become more confident in taking such actions by itself thereby providing a refined and enhanced user experience.

Predictive suggestions are triggered as follow-up suggestions for the main proactive action. They help a user to navigate through a task thereby reducing conversational confusion. For instance,

Along with a follow-up confirmation question to a request of removing an item from the user cart, the assistant also prompts predictive suggestions such as “Affirmative”, “Negative”, and “Show my cart”.
For FAQs related to pickup, the assistant prompts predictive suggestions like “Book a time”, “Store pickup hours”, and “Store location”.

Model Architecture

Our intent and entity classification module previously described here acts as the main NLU architecture of our voice assistant. We enhanced this architecture by incorporating a proactive module. This module automatically generates diverse predictive suggestions, performs proactive actions, and sends reminders to the user. It is an ensemble of a real-time, external, event-based model, and a dialogue-based query model.

Typically, users use a combination of devices, apps, and other modalities in their shopping journey. They might add products to their cart using the voice assistant, checkout using the app, and check order status on Walmart chat. To proactively assist users for better task satisfaction, we focus on the information coming from all these modalities of a user’s shopping journey and call it an external context. This information was used as contextual signals or features in an event-based model to assist a text-based query model.

The query model is a multi-label classification task of conversations. This model is trained to classify all possible next actions based on the dynamically changing conversation of the user session. A deep learning approach with contextual embedding instead of a traditional one-vs-all approach was used for this multi-label classification.

The outputs from these two models are combined with a filter layer which includes a weighted combination of business ranks and confidence scores from the query model and real-time external context model.

BERT Model Architecture for Next Sentence Classification — BERT Next Sentence Classification: Image Source https://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html

For contextual embedding in the multi-label classification task, we used the pre-trained BERT language model. It is a multilingual transformer-based model that has achieved state-of-the-art results on various NLP tasks (Devlin et al.,2018). This model was fine-tuned on synthetically generated user turn dialogues. The original pre-trained model learns (next sentence prediction task) to draw a relationship between two sentences, sentence_a, and sentence_b. Our multi-label classification model further learns to use this information to get all the next possible actions for the current user dialogue up to this time instant. This deep learning architecture was trained with sigmoid_cross_entropy_with_logits().

Arranging Data for Model

The data used for the above classification task is an “80 -to- 20 ratio” of synthetically generated in-house dialogue data and dialogue from live logs respectively. Apriori algorithm was used on the live logs to analyze patterns in the usage of different task flows. This analysis of permutations and combinations of task flows was used to synthetically generate a huge set of data for over 20 task flows and respective sub-flows. The large Walmart catalog was used to ensure that the fine-tuned model also learns some domain-specific jargon. Below is an example of user turn dialogue synthetically created.

Invocation: Talk to my assistant.
Add to cart: Add <product_x>.
Affirmative: yes, add 1 from <brand_x> please.
Remove an item: remove <product_y> from my cart.
Show me my cart: What are the products in my cart.

Evaluation Metrics

Accuracy_score was used as one of the stricter evaluation metrics for this multi-label classification task. In addition to this Hamming loss and F1 score with samples were also used.

Acuracy_score:
It is the strictest metric, indicating the percentage of samples that have all their labels classified correctly.
F1 samples:
Calculates metrics for each instance, and finds their average (only meaningful for multilabel classification where this differs from accuracy_score).
Hamming loss:
Hamming-Loss is the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels.

Due to the smaller BERT models open-sourced by Google, we could experiment and evaluate their performance and impact on latency.

Model Accuracy scores — Multi-label Classification Metric Comparisons

Other evaluation metrics like user hit ratio, click rate, positional click rate, etc were used to gauge this enhanced user experience with the assistant. We noticed that,

20 to 30% of daily user queries are clicked suggestions.

This further strengthened the hypothesis that proactive actions and predictive suggestions are key for reducing friction and building a stronger user-assistant relationship.

Conclusion

The main focus of this blog was to introduce proactive actions and predictive suggestions in the conversational assistant to bridge the gap between user expectations from their voice assistant and technical capabilities of Natural Language Processing. In the future, Walmart’s assistant will keep getting smarter by leveraging more contextual signals of the user. We are just getting started on a multi-year journey to make Walmart’s voice assistant truly proactive to enable seamless shopping experience for our customers.

References

[1] https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b

[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[3]https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits

[4]https://github.com/google-research/bert