The science behind consolidating Answer Bot production Models: Part 1

Paul Gradie
Jul 30, 2019 · 8 min read
Consolidating Learnings Into Answer Bot’s Core Model

At Zendesk, we believe that great customer service builds strong customer relationships and that Machine Learning ( ML ) can play a powerful role in creating good customer experiences. Whether it is enhancing search tools or powering entire self-service systems, ML has the potential to transform the user experience.

Self-Service is hot right now.

A recent Harvard Business Review study showed that upwards of 81% of customers tend to prefer finding answers for themselves before ever taking the next step of reaching out to a real person. In 2018 Zendesk boasted nearly four BILLION ( yes, that was billion… With a B ) knowledge base article views, and this number has been trending upwards each year. Furthermore, results from a Zendesk Benchmark report of over 40,000 companies showed that nearly half of all customers would not be willing to wait for more than an hour before seeking support via a secondary channel. Together, this means that customers want to find their own answers, and want to do it quickly.

To help make finding the information easy for our customer’s customers, Zendesk offers Guide, a place for our customers to put common information in the form of FAQ articles. When someone sends an email to a company in the Zendesk family, their email goes to an agent who may choose to either respond with an answer or kindly direct them to an article in Guide. We saw an opportunity in this to use recent advancements in ML and natural language processing ( NLP ) to develop a system that can ‘automagically’ respond to customer inquiries with relevant Guide Articles. We call this system Answer Bot.

What is Answer Bot?

Answer Bot is a recurrent artificial neural network trained over millions of support conversations that learns to associate questions with answers. Models are built using TensorFlow™, served using Tensorflow Serving. The service is deployed using Docker in AWS by a team of engineers ( #shoutouts team Koal-AI! ). The network consumes text data and encodes it into numerical vectors representing the data, which in turn can be used for a range of downstream tasks. For example, we can use these vectors to measure the similarity between documents.

Answer Bot consumes and encodes ticket inquiries and articles, which are then compared against one other to learn which articles will solve a given ticket.

Answer Bot is a universal model at Zendesk, which means that for a given language one model can serve to encode data for all accounts. This is advantageous for us since it solves the so-called ‘cold start’ problem. If we were to produce models for each individual account, then small or new accounts with little data would be restricted from using Answer Bot since they would lack sufficient data to train a model. By generalizing over all of the accounts, Answer Bot becomes a universal Zendesk encoder, which is capable of encoding sequences from any account, regardless of their size.

Zendesk offers Support in over 30 languages to over 140,000 customer accounts across the globe. With so many customers using Zendesk across so many languages, multilingual support is an important feature for the Answer Bot product. Ideally, we would like to be able to allow any customer to create an account, create some knowledge base articles, turn on Answer Bot, and immediately reap the benefits of the AI-powered Zendesk experience. For this to happen, we need to be able to scale Answer Bot to all of the languages supported by Zendesk, while retaining the universal model characteristic.

How we do things now and why we want to change

Answer Bot currently supports six different languages, and we serve a single model for each language. When an incoming email is selected for routing to Answer Bot, language prediction is performed and a language code is used to route the ticket to the appropriate model.

Incoming tickets are sent to a language prediction service, which returns a code that is used to determine which model the ticket should be used for processing.

Serving a single model for each supported language complicates the infrastructure that supports Answer Bot. It makes deploying and testing new languages difficult and is not practical to scale to the 30+ languages that are supported by Zendesk.

A burning question for us: can we consolidate the models that individually support six languages we currently support into a single model? This question has some significant technical and experimental implications which are the focus of this series.

Understanding the Answer Bot encoder model

For this series, we can think of the encoder models as having two parts: The embedding lookup table and the core model.

Input sentences are split into words, which are converted to vectors via the embedding lookup table. Those vectors are then fed to the core model.

The first part is the word embedding lookup table, which is a large matrix where the number of rows corresponds to the size of the allowed vocabulary. When the model consumes a text document, the word embedding lookup table is used to convert words into vectors that represent the words. The second part is the core model, which is essentially a collection of trainable parameters and mathematical operations performed using those parameters. The core model consumes the word embedding vectors and ultimately returns a single vector that represents the entire document.

All of our encoder models share the same core model architecture. That is to say, they all contain the same number of parameters and the same operations performed in the same order. The critical difference is that between any two models, the values of the core model parameters will be different. When we train models, we typically will not modify the word embedding lookup table, but we will update the parameters of the core model.

What these models do not share is the word embedding lookup table. Each language has its own vocabulary and the matrix that represents that vocabulary is different. These similarities and differences are summarised in the following diagram.

Each model in production shares the same architecture. The parameter values between any two models will, however, be different.

Ultimately, the problem we faced was to train a model using data from all of the six languages while also making available the vocabulary from each language.

Choosing the approach

During our research kickoff, we identified several approaches to solving this problem. In this section, we’ll discuss some of these approaches along with the technical hurdles that accompany them.

1. Learn the word embeddings over all languages together with the model

With this approach, we would follow the basic approach of finding the combined vocabulary set across all six languages and learn the embeddings during the model training process.

Empirically, we have found that our models tend towards overfitting when word embeddings and model parameters are learned together, so our typical approach to creating models includes a pre-training phase where we create the word embeddings separately from the core model. This also helps speed up model iteration as we experiment with different ideas. If we learn the embedding vectors each time we train the model, we may need to learn embedding vectors for upwards of 1 million tokens each time we perform an experiment. (This assumes approximately 150,000–200,000 unique tokens per language distribution and initial experimental design to study the consolidation of six languages).

Learn embeddings and other model parameters during training.

2. Pre-train the word embeddings over all the languages

Alternatively, we could follow our current model training strategy and pre-train the word embeddings using tickets from all languages together with open source tools such as FastText. This approach, however, also comes with some significant experimental overhead.

The training process for Zendesk word embeddings typically requires about a week for a single language, and preliminary attempts to train over six languages required nearly a month. Since we would need to experiment with many versions of these learned embeddings (e.g. various dimension sizes, various min, and max n-gram solutions, etc), we would be looking at several months of embedding building.

3. Route data to language-specific pre-trained separate embedding tables

Another idea we considered was to learn a model that used individual pre-trained embedding tables inside the Tensorflow computational graph. With this approach, incoming language data could be routed to a specific embedding table inside the graph by providing an additional language code feature with the model inputs. For example, if text data was received with an ‘es’ language code, then the text data would be converted to embedding vectors using the embedding table specific to Spanish.

This is a natural extension to the system we currently use in production, where the main difference is that instead of using a different core model for each language, we would use a single core model for all languages. However, the approach limits potentially valuable cross table access. For example, in the case where the Spanish text data contains a word missing from the Spanish lookup table but present in the English lookup table, the word is still considered unknown by the core model. At Zendesk, we find that a lot of text is actually mixed language data. The subtle variations to this approach were identified during our kickoff, which will be the subject of a future blog series.

A model architecture with embedding table routes.

4. Merge pre-trained embedding tables

Lastly, we considered finding a way to merge pre-trained embedding tables. This approach leverages readily available embedding tables, however, it also introduced problems with word ambiguity and information loss when combining the embedding tables.

By merging the tables, the problem of word collision arises. If two different languages contain the same word ( albeit with potentially different meanings ), then during a merge, the resulting table will contain two different vectors associated with that single word. Languages can only have a one-to-one mapping between a word and vector that represents that word, so the term merge used here specifically refers to the process of reconciling the coexistence of words between embedding tables.

A model architecture with merged embeddings.

Making the Call

After considering the pros and cons of each of these approaches, we chose to explore number four: merge pre-trained embedding tables. This approach offered the shortest time to initial result, shortest iteration cycles, and ( intuitively ) had a reasonably high chance of success. Furthermore, the approach retained compatibility with our production system and required no additional features (such as the language code for routing). And like approaches 1 & 2, we gain access to the total lexicon across all of the languages which would not be possible, for example, with an ideal routing architecture, but for a much lower setup cost.

Next Post

In the next post, we’ll get technical and discuss basic ways to compute this merge. We’ll also attempt to visually understand what happens to the information captured by the word embeddings once the merge is complete.


Zendesk Engineering

Engineering @ Zendesk

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store