Supervised Word Vectors from Scratch in Rasa NLU
We’ve released a new pipeline which is totally different from the standard Rasa NLU approach. It uses very little memory, handles hierarchical intents, messages containing multiple intents, and has fewer out-of-vocabulary issues. And in principle it can do intent recognition in any language. Go check out Rasa NLU 0.12 now! Or head to gitter if you have questions.
Reinventing how Rasa NLU works under the hood
Since we open sourced Rasa NLU in late 2016 we’ve been blown away by the growth of the community and the diversity of things people build with it. From startups to the Fortune 500, we’ve seen Rasa NLU used in customer service, IT automation, and dozens of other use cases and industries. The academic world has also embraced Rasa, with a number of research papers citing it and universities using it in coursework.
We also developed a course on building chatbots together with DataCamp. This has already reached thousands of data scientists who are learning that there’s nothing mysterious or magical about the machine learning algorithms behind chatbots.
While the APIs are pretty well established, we’re constantly doing research and reinventing how Rasa NLU works under the hood. Today, we’ve released some new tech that we’re really excited about.
Breaking all the rules: no word vectors, no classifier.
The standard way we’ve been doing intent classification since Rasa NLU was released is to represent sentences as a sum of word vectors, and then train a classifier on that representation. We run regular benchmarks on a dozen different datasets, where we try different word vectors and classifiers to see what really moves the needle. Mostly it’s the quality (or appropriateness) of your word vectors that matters, and using a neural net instead of a support vector machine (SVM) doesn’t make any difference.
The bag-of-word-vectors approach is an embarrassingly good baseline, but of course it has limitations. You often don’t have word vectors for some important words, for example jargon specific to your domain, or common typos. That’s especially true if you’re working with languages other than English, or with informal language. With our standard approach, you can’t learn vectors for these words, so they never carry any signal. Another downside is that you have tens of thousands of vectors stored in memory that you’ll never use, since most conversational AI deals with a narrow domain.
Our new TensorFlow embedding pipeline does almost the exact opposite. It doesn’t use pre-trained word vectors, and should work on any language (though we haven’t tried them all yet!). The inspiration for the new method was the StarSpace paper from Facebook.
Our new embedding pipeline doesn’t use pre-trained vectors, but instead learns embeddings for both the intents and the words simultaneously. And instead of training a classifier, these embeddings are used to rank the similarity between an input sentence and all of the intents. This means you aren’t stuck with out-of-the-box pre-trained word vectors, but learn your own specifically for your domain.
We made a few modifications to the original algorithm so that it works robustly on the small datasets Rasa developers usually have, while still scaling well to larger ones. The approach we settled on still uses an embed-and-rank approach, but we changed the architecture, added some regularisation and dropout, and made modifications to the loss function. We also found default hyperparameters that do well on a number of different datasets.
A quick experiment
Because you only have to embed the words and intents you care about, the memory footprint is much smaller. Loading a full set of GloVe word vectors in memory can take up a couple of Gb of RAM. We trained a model on one of our customer’s datasets (6000 labeled utterances) and spun up a Rasa NLU server with this embedding model loaded, and it took up just 120 Mb.
What’s also great is that since you’re also learning embeddings for the intents themselves, you can train a model that knows that some intents are more similar than others. You can even make this explicit and split up your intents into multiple tokens so the model can share information between them. This lets you build up hierarchical intents, like
question+product+warranty. These will each be split into 3 tokens, and we’ll learn an embedding for each of them. Since two of the tokens are shared, it’s easy for the model to capture this behaviour, you can turn on this behaviour with this flag.
You can also use this to model messages that contain multiple intents. For example, it’s pretty common to see a message like this:
“thanks! Oh and what’s the warranty on that?”.
Where a user has clearly said two different things.
With the embedding pipeline, we can model this as two intents:
warranty. That wasn't possible with the standard SVM. We created a synthetic dataset to see how well this works, starting with a small dataset (800 labeled utterances) for a demo we built, and combined messages and their labels to create a multi-intent dataset (3600 utterances in total), here are some examples:
We evaluated two pipelines for this dataset: our standard
spacy_sklearn pipeline with the small English model, and the TensorFlow pipeline. We ran rasa_nlu.evaluate in cross-validation mode with 3 folds.
The results are quite dramatic! The embedding architecture absolutely crushes it, and this is without any hyperparameter search.
Open source is the way forward
We now have two totally opposite, but complementary approaches to intent classification in Rasa NLU. On all of our benchmark datasets, the tensorflow embedding pipeline does well (even the small ones). Please let us know at email@example.com how it performs on yours, we’re really curious to know!
Keep building awesome stuff with the Rasa Stack, and keep us posted!
Originally published at blog.rasa.com on April 18, 2018.