Knowledge distillation of multilingual BERT for Walmart’s conversational AI assistant

Published in

Walmart Global Tech Blog

7 min readMay 25, 2021

Introduction

Walmart’s conversational AI assistant has been enabling voice-based shopping through Google Assistant and Siri platforms for over two years. We extended the AI assistant for our store associates as they could tremendously benefit from its Natural Language Understanding (NLU) capabilities and serve our customers better. The store managers have been using the assistant to get the location of products in the store, check stock for different items, query top sales for departments, and even check on fellow associate’s schedules.

The core component of our askSam AI assistant is the NLU pipeline which is an ensemble of robust, state-of-the-art machine learning models. Intent classification and Named Entity Recognition(NER) are vital NLU tasks that enable understanding of our associates’ intent and extracting relevant entities in their text and speech queries.

Product Entity tagging in the Walmart askSam app

The voice assistant has been recently adopted by our store associates internationally as well. We had to extend our assistant’s NLU capabilities to be able to understand and converse with our Spanish speaking associates in the U.S. and Mexico. This lead to the development of a multilingual retail voice assistant that would enable us scale in more countries like Canada and Chile. The following table summarizes some of the Spanish voice queries commonly seen by our assistant and their corresponding NER tags.

Walmart askSam voice queries and NER tags

Our conversational assistant is an extremely latency sensitive application. It was very critical for us to investigate deep learning model architectures that were optimized for latency and memory footprint, so we could efficiently scale and serve over a million multilingual queries worldwide efficiently. In this blog, we discuss how we enabled fast NER specifically on Spanish and English queries with focus on multilingual BERT knowledge distillation.

Most of the open-source multilingual transformer models are massive, with over 110 million parameters and computationally expensive. Lack of labeled retail data in Spanish language, huge amount of unlabeled data from our Walmart Mexico catalog, diversity of native language accents, cost of deployment infrastructure were a few factors we considered in making architecture choices for our multilingual downstream NER task. Knowledge distillation of multilingual BERT helped us come up with a light weight, super fast production friendly version that retained over 97% of the F1 score compared to the teacher model for most product attribute related entities and showed around 4x inference speedup with 2x compression.

Fine-tuning multilingual BERT

We initially fine-tuned the pre-trained multilingual BERT released by Google research on our Spanish and English conversation data. For Spanish data, we prepared a dataset that consisted of manually curated voice query templates that is reflective of our associates’ usage patterns in the stores. Using Walmart products and brands offered by our Mexico Walmart catalog, paraphrasing techniques and language translation APIs, we were able to generate around one million labeled samples for English and Spanish. This was considered as a baseline experiment for our multilingual NER task. Though this model architecture performed surprising well for unseen Spanish queries in the live logs, we saw that the model latency was still very high compared to the production distilled BERT models that were deployed for our English NLU systems.

After researching several distillation techniques and cloud offerings, we found that the research published by Microsoft Research, Xtremedistil, showed promising results with respect to performance, model compression and speedup for the multilingual NER task. With close collaboration with the Microsoft Research team, we applied the multi-stage distillation techniques to the massive transformer models to derive student transformer models for our Walmart assistant’s Spanish NLU use case.

Transformer Distillation architecture

The distillation process detailed in the paper offers flexibility in selecting the following configurations:

Architecture of the teacher model(mBERT, XLM-RoBERTa, etc.)
Architecture of the student model(TinyBERT, miniLM, BiLSTM, etc.)
Tokenizer(BERT WordPiece tokenizer, SentencePiece tokenizer, etc.)

We experimented with both mBERT and XLM-RoBERTa for the teacher model and TinyBERT and miniLM for the student model weighing the performance loss due to distillation and the allowable inference latency gains.

The training objectives of the teacher, number of layers in the deep network, number of attention heads, data that the teacher models were pre-trained on, amount of context they can keep, the embedding sizes and other hyper-parameters greatly influenced the distillation objectives of the student model.

Tokenization

We experimented with mBERT as teacher model with WordPiece tokenizer as well as the SentencePiece tokenizer. We also evaluated XLM-RoBERTa with SentencePiece tokenizer. In-house normalization was also performed before the tokenizer for retail specific punctuations, ignoring language specific accents and case.

Dataset

We performed our distillation experiments on three different in-house labeled Walmart conversational datasets for NER namely, search refinement for shopping, multiple product entity recognition for text shopping, and askSam product search use case. The percentage of labeled english data was more than the labeled Spanish data due to constraints such as annotator availability, inaccuracies of language translation APIs for retail data, and bootstrapping our assistant for Spanish language understanding for the first time. Our entities were broadly classified into entities specific to product attributes such as product, brand, as well as generic entities such as zip code, time reference. We also leveraged the vast amount of unlabeled data from both our extended assortment and the same day delivery items in the Walmart Mexico catalog for the product specific entities.

Distillation features

The domain specific labeled data helped the teacher model to adapt to the retail domain NER task. The student model was also trained to minimize the cross-entropy loss using the labeled data. The unlabeled data was used to enable the student learn from the teacher by optimizing the following constraints.

1. Minimizing the KL divergence between the internal representations of the teacher and the student — Representation loss on unlabeled data.

2. Minimizing the MSE loss by comparing the classification scores of the student with respect to the teacher logits — MSE logit loss on unlabeled data.

3. Leveraging mBERT embeddings of the teacher, non-linear projection and SVD on the embeddings to align the output spaces of the teacher and the student.

The loss functions discussed in the above section were optimized in the distillation process in a stage-wise fashion with gradual unfreezing so that a stage learns the parameters conditioned on those learned in previous stage. This way we leveraged both labeled and unlabeled data to best learn the student model parameters.

Distillation experiment results

The following table summarizes the performance and latency measurements on the distilled student model.

Teacher model: bert-base-multilingual-cased with WordPiece tokenizer

Datasets: In-house Walmart conversational AI datasets

Student models: TinyBERT_L-4_H-312_v2 and Multilingual-MiniLM-L12-H384.

Teacher, Student Model Metrics Comparison

Student model performance

The teacher and the student models were evaluated on a set of unseen queries from our live logs. We saw that for most of our product and generic entities, the performance drop was less than 4% on an average for english queries for the TinyBERT student. For Spanish queries we noticed a performance drop of around 9% for product entities on the TinyBERT student model. With the MiniLM student model we saw only about 2% drop in F1 score for product entities while we were still within the allowable latency margins. We saw more performance drop for generic entities in the Spanish language, this could be attributed towards low representation and support for these entities in the dataset and our live production logs.

Student model latency

The GPU (Tesla V100) inference latency was around 4ms for the TinyBERT model and around 8ms for the miniLM model. We were further able to improve the latency to less than 1ms by generating an ONNX runtime without any drop in performance compared to the distilled student model.

Student model footprint

In terms of the memory footprint, we were able to reduce the size of the model by 4x using the TinyBERT student compared to the original 12 layer mBERT model and by 2x with the miniLM student.

Summary

With the multilingual transformer distillation process and the miniLM student model, we were able to bring down the inference latency of the multilingual BERT NER model architecture to less than a few milliseconds, while retaining more than 97% of the F1 score. This also resulted in significant cost benefits in compute and hence deployment infrastructure for our multilingual downstream NER task, enabling us to scale and serve our customers worldwide.

Future Work

As we expand in more geographies, we want to enable more languages on our assistant. From distillation perspective, we want to bridge the performance drop of our student models by experimenting with the following techniques:

Experiment with the ratio of task specific labeled vs unlabeled data for native languages.
Distill from larger multilingual models and better performing teacher models.
Improved initialization of the student model.
Investigate SentencePiece tokenizer for specific languages.
Distill attention heads.
Custom distillation for joint classification tasks and other deep learning model architectures for NER.

References

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks…

arxiv.org

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP)…

arxiv.org

Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models…

arxiv.org

Acknowledgements

Walmart Global Tech Conversational AI team and Iman Mirrezaei , Adrian Sanchez, and Simral Chaudhary for helping with multiple experiments during this effort.

Subho Mukherjee and Steven Shi from Microsoft Research team for sharing learnings and best practices for distillation of massive multilingual models.

Knowledge distillation of multilingual BERT for Walmart’s conversational AI assistant

Introduction

Fine-tuning multilingual BERT

Transformer Distillation architecture

Tokenization

Dataset

Distillation features

Distillation experiment results

Student model performance

Student model latency

Student model footprint

Summary

Future Work

References

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks…

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP)…

Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models…

Acknowledgements

Written by Deepa Mohan