Do-it-yourself NLP for bot developers

I believe in most cases it makes sense for bot makers to build their own natural language parser, rather than using a third party API. There are good strategic and technical arguments for doing this, and I want to show how easily you can put something together. This post has 3 sections:

  1. argues why you should bother
  2. shows that the simplest possible approach works OK
  3. shows something you can actually use

So what NLP do you need for a typical bot? Let’s say you’re building a service to help people find restaurants. Your users might say something like:

I’m looking for a cheap Mexican place in the centre.

To respond to this you need to do two things:

i. Understand the user’s intent: they are searching for a restaurant and not saying “hello”, “goodbye”, or “thanks”.

ii. Extract cheap, Mexican, and centre as fields you can use to run a query.

In a previous post I mentioned that tools like wit and LUIS make intent classification and entity extraction so simple that you can build a bot like this during a hackathon. I’m a huge fan of both of these tools, and the teams behind them. But they’re not the right solution in every case.

1. Three reasons to do it yourself

Firstly, if you actually want to build a business based on conversational software, it’s probably not a good policy to pass on everything your users tell you to either Facebook or Microsoft. Secondly, I just don’t believe a web API is the solution for every problem in programming. https calls are slow, and you’re always constrained by the design choices that went into the API endpoints. Libraries, on the other hand, are hackable. Thirdly, there’s a good chance you’ll be able to build something that performs better on your data & use case. Remember, the general-purpose APIs have to do well on every problem, you only have to do well on yours.

2. Word2vec + heuristics — sophistication = working code

We’ll start by building the simplest possible model to get a feel for how this works, not using any libraries (other than numpy) .

I strongly believe that the only thing you can really do in machine learning is to find a good change of variables. If that seems cryptic, I’m currently writing another post explaining what I mean by that, so check back later. The point is that if you have an effective way to represent your data, then even extremely simple algorithms will do the job.

We’re going to use word vectors, which are vectors of a few tens or hundreds of floating point numbers which to some extent capture the meaning of a word. The fact that this can be done at all, and the way in which these models are trained, are both super interesting. For our purposes this means that the hard work has already been done for us: word embeddings like word2vec or GloVe are a powerful way to represent text data. I decided to use GloVe for these experiments. You can download pre-trained models from their repo, and I took the smallest (50 dimensional) pre-trained model. 
Below I’ve pasted some code, based on the python examples in the GloVe repo. All this does is load the word vectors for the whole vocabulary into memory.

Now let’s try to use these vectors for the first task: picking out Mexican as a cuisine type in the sentence “I’m looking for a cheap Mexican place in the centre”. We’ll use the simplest method possible: provide some example cuisine types, and just look for the most similar words in the sentence. We’ll loop through the words in the sentence, and pick out the ones whose average cosine similarity to the reference words is above some threshold.

Let’s try it out on an example sentence.

Amazingly, this is enough to correctly generalise, and to pick out “indian” as a cuisine type, based on its similarity to the reference words. Hence why I say that once you have a good change of variables, problems become easy.

Now on to classifying the user’s intent. We want to be able to put sentences in categories like “saying hello”, “saying thanks”, “asking for a restaurant” , “specifying a location”, “rejecting a suggestion”, etc, so that we can tell the bot’s backend which lines of code to execute. There are many ways we could combine word vectors to represent a sentence, but again we’re going to do the simplest thing possible: add them up. If you think it’s abhorrent to add together word vectors, I hear you. Read the appendix at the bottom for why this works.

We can create these bag-of-words vectors for each sentence, and again just categorise them using a simple distance criterion. Again, amazingly, this can already generalise to unseen sentences.

Neither the parsing nor classification that I’ve shown is particularly robust, so we’ll move on to something better. But I hope that I’ve shown that there’s really nothing mysterious going on, and in fact very simple approaches can already work.

3. Something you can actually use

There are lots of things we can do better. For example, converting a block of text into tokens requires more than just splitting on whitespace. One approach is to use a combination of SpaCy / textacy to clean up and parse your text, and scikit-learn to build your models. Here I’ll use a nice library called MITIE (MIT Information Extraction) which does everything we need and has python bindings.

There are two classes we can use directly. First, a text categoriser:

Second, an entity recogniser:

The MITIE library is quite sophisticated and uses spectral word embeddings rather than GloVe. The text categoriser is a straightforward SVM, whereas the entity recogniser uses a structural SVM. The repo has links to some relevant literature if you’re interested.

As you might expect, using a library like this (or a combination of SpaCy and your favourite ML lib) gives much better performance than the hacky code I posted at the start. In fact, in my experience you can very quickly beat the performance of wit or LUIS on your particular problem, purely because you can tune the parameters for your data set.


I hope I’ve convinced you that it’s worth having a go at creating your own NLP modules when building a bot. Please do add your ideas, suggestions, and questions below. I look forward to the discussion. If you enjoyed this, you can follow me here on medium or, better yet, on twitter.

Thanks to Alex, Kate, Norman, and Joey for reading drafts!

Interested to join us building more human-like bots? We’re hiring!

Appendix on Sparse Recovery: How can this work?

So how is it possible that you can add up (or average) the word vectors in a sentence, and somehow preserve the meaning? That’s kind of like being told that a class of 10 students scored an average of 75% on a test and trying to figure out what the individual scores were. Well, almost. It turns out to be one of those counterintuitive things about high dimensional geometry.

If you sample 10 vectors from a set of 1000, you can actually find out which ones you picked just from knowing the average, if the vectors are sufficiently high dimension (say 300). It boils down to the fact that there is a lot of space in R³⁰⁰, and so if you sample a pair of vectors at random, you can expect them to be (almost) linearly independent.

We’re not interested in the length of word vectors (they are normalised in the code above), so we can consider the vectors as points on unit sphere in ^d. Say we have a set of N vectors V ⊂ ^d, which are i.i.d on the unit d-sphere. The question is, given some subset S of V, how big does d need to be before we can recover all of S given only x = Σ v_i , the sum of the vectors in S? It’s possible if the dot product between x and any vector v`∉ S is very small, and only vectors in S have v·x ~ 1. 
We can use a result called ‘concentration of measure’ which tells us exactly what we need. For i.i.d points on the unit d-sphere, the expectation value of the dot product between any two, E(v·w) = 1/√d . And the probability that the dot product is greater than a is P(v·w>a) ≤ (1-a²)^(d/2). So we can write down the probability ε that some vector v`∉ S is too close to one of the vectors v∈S in terms of the dimensionality of the space. This gives the result that to reduce the probability of failure to ε, we need d > S log(NS/ε). So if we want to recover a subset of 10 vectors from a total of 1000 with a 1% failure tolerance, we can do it in 138 dimensions or higher.

Going back to the test score analogy, thinking of it this way might make things clearer. Instead of getting the average total score, we now get the average score per question. Now you get the average per-question scores from 10 students, and all we’re saying is that when the test has more questions, it becomes easier to tell which students they were. And that’s not so counterintuitive after all.