At Zendesk, we believe that great customer service builds strong customer relationships and that Machine Learning ( ML ) can play a powerful role in creating good customer experiences. Whether it is enhancing search tools or powering entire self-service systems, ML has the potential to transform the user experience.
Self-Service is hot right now.
A recent Harvard Business Review study showed that upwards of 81% of customers tend to prefer finding answers for themselves before ever taking the next step of reaching out to a real person. In 2018 Zendesk boasted nearly four BILLION ( yes, that was billion… With a B ) knowledge base article views, and this number has been trending upwards each year. Furthermore, results from a Zendesk Benchmark report of over 40,000 companies showed that nearly half of all customers would not be willing to wait for more than an hour before seeking support via a secondary channel. Together, this means that customers want to find their own answers, and want to do it quickly.
To help make finding the information easy for our customer’s customers, Zendesk offers Guide, a place for our customers to put common information in the form of FAQ articles. When someone sends an email to a company in the Zendesk family, their email goes to an agent who may choose to either respond with an answer or kindly direct them to an article in Guide. We saw an opportunity in this to use recent advancements in ML and natural language processing ( NLP ) to develop a system that can ‘automagically’ respond to customer inquiries with relevant Guide Articles. We call this system Answer Bot.
What is Answer Bot?
Answer Bot is a recurrent artificial neural network trained over millions of support conversations that learns to associate questions with answers. Models are built using TensorFlow™, served using Tensorflow Serving. The service is deployed using Docker in AWS by a team of engineers ( #shoutouts team Koal-AI! ). The network consumes text data and encodes it into numerical vectors representing the data, which in turn can be used for a range of downstream tasks. For example, we can use these vectors to measure the similarity between documents.
Answer Bot is a universal model at Zendesk, which means that for a given language one model can serve to encode data for all accounts. This is advantageous for us since it solves the so-called ‘cold start’ problem. If we were to produce models for each individual account, then small or new accounts with little data would be restricted from using Answer Bot since they would lack sufficient data to train a model. By generalizing over all of the accounts, Answer Bot becomes a universal Zendesk encoder, which is capable of encoding sequences from any account, regardless of their size.
Zendesk offers Support in over 30 languages to over 140,000 customer accounts across the globe. With so many customers using Zendesk across so many languages, multilingual support is an important feature for the Answer Bot product. Ideally, we would like to be able to allow any customer to create an account, create some knowledge base articles, turn on Answer Bot, and immediately reap the benefits of the AI-powered Zendesk experience. For this to happen, we need to be able to scale Answer Bot to all of the languages supported by Zendesk, while retaining the universal model characteristic.
How we do things now and why we want to change
Answer Bot currently supports six different languages, and we serve a single model for each language. When an incoming email is selected for routing to Answer Bot, language prediction is performed and a language code is used to route the ticket to the appropriate model.
Serving a single model for each supported language complicates the infrastructure that supports Answer Bot. It makes deploying and testing new languages difficult and is not practical to scale to the 30+ languages that are supported by Zendesk.
A burning question for us: can we consolidate the models that individually support six languages we currently support into a single model? This question has some significant technical and experimental implications which are the focus of this series.
Understanding the Answer Bot encoder model
For this series, we can think of the encoder models as having two parts: The embedding lookup table and the core model.
The first part is the word embedding lookup table, which is a large matrix where the number of rows corresponds to the size of the allowed vocabulary. When the model consumes a text document, the word embedding lookup table is used to convert words into vectors that represent the words. The second part is the core model, which is essentially a collection of trainable parameters and mathematical operations performed using those parameters. The core model consumes the word embedding vectors and ultimately returns a single vector that represents the entire document.
All of our encoder models share the same core model architecture. That is to say, they all contain the same number of parameters and the same operations performed in the same order. The critical difference is that between any two models, the values of the core model parameters will be different. When we train models, we typically will not modify the word embedding lookup table, but we will update the parameters of the core model.
What these models do not share is the word embedding lookup table. Each language has its own vocabulary and the matrix that represents that vocabulary is different. These similarities and differences are summarised in the following diagram.
Ultimately, the problem we faced was to train a model using data from all of the six languages while also making available the vocabulary from each language.
Choosing the approach
During our research kickoff, we identified several approaches to solving this problem. In this section, we’ll discuss some of these approaches along with the technical hurdles that accompany them.
1. Learn the word embeddings over all languages together with the model
With this approach, we would follow the basic approach of finding the combined vocabulary set across all six languages and learn the embeddings during the model training process.
Empirically, we have found that our models tend towards overfitting when word embeddings and model parameters are learned together, so our typical approach to creating models includes a pre-training phase where we create the word embeddings separately from the core model. This also helps speed up model iteration as we experiment with different ideas. If we learn the embedding vectors each time we train the model, we may need to learn embedding vectors for upwards of 1 million tokens each time we perform an experiment. (This assumes approximately 150,000–200,000 unique tokens per language distribution and initial experimental design to study the consolidation of six languages).
2. Pre-train the word embeddings over all the languages
Alternatively, we could follow our current model training strategy and pre-train the word embeddings using tickets from all languages together with open source tools such as FastText. This approach, however, also comes with some significant experimental overhead.
The training process for Zendesk word embeddings typically requires about a week for a single language, and preliminary attempts to train over six languages required nearly a month. Since we would need to experiment with many versions of these learned embeddings (e.g. various dimension sizes, various min, and max n-gram solutions, etc), we would be looking at several months of embedding building.
3. Route data to language-specific pre-trained separate embedding tables
Another idea we considered was to learn a model that used individual pre-trained embedding tables inside the Tensorflow computational graph. With this approach, incoming language data could be routed to a specific embedding table inside the graph by providing an additional language code feature with the model inputs. For example, if text data was received with an ‘es’ language code, then the text data would be converted to embedding vectors using the embedding table specific to Spanish.
This is a natural extension to the system we currently use in production, where the main difference is that instead of using a different core model for each language, we would use a single core model for all languages. However, the approach limits potentially valuable cross table access. For example, in the case where the Spanish text data contains a word missing from the Spanish lookup table but present in the English lookup table, the word is still considered unknown by the core model. At Zendesk, we find that a lot of text is actually mixed language data. The subtle variations to this approach were identified during our kickoff, which will be the subject of a future blog series.
4. Merge pre-trained embedding tables
Lastly, we considered finding a way to merge pre-trained embedding tables. This approach leverages readily available embedding tables, however, it also introduced problems with word ambiguity and information loss when combining the embedding tables.
By merging the tables, the problem of word collision arises. If two different languages contain the same word ( albeit with potentially different meanings ), then during a merge, the resulting table will contain two different vectors associated with that single word. Languages can only have a one-to-one mapping between a word and vector that represents that word, so the term merge used here specifically refers to the process of reconciling the coexistence of words between embedding tables.
Making the Call
After considering the pros and cons of each of these approaches, we chose to explore number four: merge pre-trained embedding tables. This approach offered the shortest time to initial result, shortest iteration cycles, and ( intuitively ) had a reasonably high chance of success. Furthermore, the approach retained compatibility with our production system and required no additional features (such as the language code for routing). And like approaches 1 & 2, we gain access to the total lexicon across all of the languages which would not be possible, for example, with an ideal routing architecture, but for a much lower setup cost.
In the next post, we’ll get technical and discuss basic ways to compute this merge. We’ll also attempt to visually understand what happens to the information captured by the word embeddings once the merge is complete.