Adventures in bootstrapping chatbot NLU models
To build a real conversational chatbot, one needs a good dialogue strategy, but also reliable natural language understanding (NLU) support. Ideally, the NLU models supporting chatbot dialogues would always be trained on a large corpus of real user data corresponding to the use cases supported by the chatbot. When creating a brand new chatbot from scratch, such representative data is not always available; in fact, most of the time, it isn’t, which is why we need to bootstrap the NLU with some kind of artificial data (meant to be replaced with real data when users start interacting with the chatbot).
But having good data is not the end of the story; we need to understand how to best use that data to train good NLU models, and obtain the best performance possible from the NLU engine, no matter which engine is used. To truly do that, we need to benchmark several NLU engines using large corpora of user data from different domains; this is the focus of a separate project in our group.
Meanwhile, in order to have a workable version of our chatbot, we needed to find a way to produce decent NLU support with limited or no user data. This post reports on my experience trying to bootstrap and train NLU models using an off the shelf well known NLU engine (let’s call it “engine Z”) by following the provider’s recommendations. My observations are not scientific but provide insights into potentially significant issues with the NLU engines made available to chatbot developers.
Based on the various use cases and sample dialogues that we defined early on in our design process, we created the initial inventory of intents and entities (basic concepts used in most NLU models for task oriented dialogues) that the chatbot would support and the dialogues handle.
To train the intents, we followed guidelines and good practices available in engine Z’s documentation, and which are common to most NLU products currently on the market. In a nutshell:
- Make sure that intents are distinct (to avoid confusion and overlapping between intents).
- Training data should be balanced. In other words, the number of training expressions per intent should roughly be equivalent.
- Training expressions should be diversified, to provide various syntactic forms (e.g. questions vs declarative phrases).
- Training data should be representative of usage; this criterion is much harder to meet when user data is not available, but we used common sense.
Some NLU providers also recommend including valid entities in the training expressions for intents. In certain cases, it is possible to use rules or placeholders to add a certain level of abstraction in the intent training data, but not all engines allow that. When classes or abstract entities cannot be used, actual sample entities can be inserted in training expressions to help the classifier. Engine Z does not allow such abstraction, so we worked with sample entities.
To bootstrap intent models, we decided to rely on ABNF grammars (used in speech recognition) to write rules and generate lists of expressions. Using grammars allowed us to easily create reusable patterns and permutations without having to write down every phrase. We tried to come up with as realistic expressions as possible for each intent, and followed the general recommendations for defining and training intents.
We then fed the NLU models with this artificial data.
We started doing anecdotal testing of initial versions of our demo chatbot. It generally worked pretty well, and we made adjustments here and there when testers started using it to make the NLU more robust.
We observed a fair number of issues with intent classification using engine Z.
Incorrect intent substitution
Our chatbot allows users to check/obtain information, and also to change/modify information, supporting a mixed initiative dialogue strategy. We created a handful of intents for “checking” things, and another set of intents for “changing” things. The expressions used to train each type often included similar portions, although verbs and verb phrases, as well as differentiating nouns or noun phrases, were always distinct. Despite that, the NLU often substituted one type for the other, for example, returning a “check phone number” intent instead of a “change phone number” intent.
No fuzzy matching for intents
Engine Z does not use fuzzy matching on intents, only on entities. This is a limitation that has huge consequences because the slightest typo impacts intent recognition.
All words are created equal
We noticed that if an intent had non-essential words in the training expressions, that intent would be incorrectly returned when the input clearly meant something different, but included those non-essential words. Many different experiments allowed us to deduce that there is no syntactic analysis and that the same weight is given to all the words, no matter the position or the role they play in the sentence. For example, adding an extra pronoun in a training expression for an intent but omitting it in another is enough to make the engine incorrectly return that intent in irrelevant contexts.
More often than not, if the expression was not in the training list, even when similar expressions were included in the training set, it was not recognized. We assumed that the engine would extrapolate based on examples, but it does not seem to be the case.
More often than not, when the user only entered an entity, without any mention of an intent, the system returned some intent with a high confidence score. In most cases, our dialogue can ignore the intent and only process the entity. However, since our dialogue uses a mixed initiative strategy, there may be situations where this “ghost” intent will get the user on the wrong path.
Interpreting confidence scores
Engine Z returns confidence scores for every intent (but unfortunately, not for entities). The value of these scores are between 0 and 1, the higher the better (in theory), as is the case with most NLU engines. However, without extensive experiments, it is difficult to set reliable thresholds to correctly accept or reject the returned results. Moreover, high scores are often returned for incorrect intents, while valid input often returns very low scores (even when similar expressions are found in the training data for that intent).
It goes without saying that setting reliable confidence thresholds is essential to ensure that the bot will react adequately to user queries and provide a decent user experience.
What did work (somehow)
After several trials and errors, we found a “recipe” that seemed to work pretty well, most of the time, for engine Z:
- We stripped training expressions of all non-essential carrier words and phrases.
- We kept essential verbs and nouns that are needed to distinguish one intent from another.
- We even removed subject pronouns.
- We ended up not including entities in context, because that did not particularly help.
When we trained engine Z with this strategy, we obtained pretty decent results when testing with complete and complex phrases, as long as they included the “skeleton” words that we had put in the training expressions. This was not always true, and confidence scores are still a major roadblock, but that method provided the least inconsistent results.
Would this recipe work with a different NLU engine? Unfortunately, initial tests on a second engine showed that it did not. At all.
Questions and concerns
These anecdotal observations raise a few questions:
- If all words are equal for a given NLU engine, does this mean that we need to pre-process training data in order to exclude anything not essential?
- What amount of pre-processing will be required to obtain proper training data for a given engine?
- If different NLU engines use different training strategies, how can we train and benchmark them with the same training corpus?
- If inconsistencies in the training data have significant impacts on classification performance, then how do we use real user data? People are not consistent!
It seems to be necessary to adapt training corpora and strategies in order to get good performance from any NLU engine.
These non-scientific experiments showed that we need to test and benchmark various NLU engines with large corpora and different strategies in order to put together a reliable methodology and select the better NLU engine for our needs. Results from the ongoing benchmarking experiments will hopefully help us move in the right direction.
Thanks to my colleague Guillaume Voisine for his help and insights.