Natural Language Pipeline for Chatbots

4 min readNov 6, 2016

Image credit: https://thefilmgeekfiles.files.wordpress.com/2011/06/wall_e_and_eve-wide.jpg

Chatbot developers usually use two technologies to make the bot understand the meaning of user messages: machine learning and hardcoded rules. See more details on chatbot architecture in my previous article.

Machine learning can help you to identify intent of the message and extract named entities. It is quite powerful but requires lots of data to train the model. Rule of thumb is to have around 1000 examples for each class for classification problems.

If you don’t have enough labeled data then you can handcraft rules which will identify the intent of a message. Rules can be as simple as “if a sentence contains words ‘pay’ and ‘order’ then the user is asking to pay for an order”. And the simplest implementation in your favorite programming language could look like this:

Any intent classification code can make errors of two types. False positives: the user doesn’t express an intent, but the chatbot identifies an intent. False negatives: the user expresses an intent, but the chatbot doesn’t find it. This simple solution will make lots of errors:

The user can use words “pay” and “order” in different sentences: “I make an order by mistake. I won’t pay.”
A keyword is a substring of another word: “Can I use paypal for order #123?”
Spelling errors: “My orrder number is #123. How can I pay?”
Different forms of words: “How can I pay for my orders?”

Your chatbot needs a preprocessing NLP pipeline to handle typical errors. It may include these steps:

Spellcheck

Get the raw input and fix spelling errors. You can do something very simple or build a spell checker using deep learning.

2. Split into sentences

It is very helpful to analyze every sentence separately. Splitting the text into sentences is easy, you can use one of NLP libraries, e.g. NLTK, StanfordNLP, SpaCy.

3. Split into words

This is also very important because hardcoded rules typically operate with words. Same NLP libraries can do it.

4. POS tagging

Some words have multiple meanings, for an example “charge” as a noun and “charge” as a verb. Knowing a part of speech can help to disambiguate the meaning. You can use same NLP libraries, or Google SyntaxNet, that is a little bit more accurate and supports multiple languages.

5. Lemmatize words

One word can have many forms: “pay”, “paying”, “paid”. In many cases, an exact form of the word is not important for writing a hardcoded rule. If preprocessing code can identify a lemma, a canonical form of the word, it helps to simplify the rule. Lemmatization, identifying lemmas, is based on dictionaries which list all forms of every word. The most popular dictionary for English is WordNet. NLTK and some other libraries allow using it for lemmatization.

6. Entity recognition: dates, numbers, proper nouns

Dates and numbers can be expressed in different formats: “3/1/2016”, “1st of March”, “next Wednesday”, “2016–03–01”, “123”, “one hundred”, etc. It may be helpful to convert them to unified format before doing pattern matching. Other entities which require special treatment: locations (countries, regions, cities, street addresses, places), people, phone numbers.

7. Find concepts/synonyms

If you want to search for a breed of a dog, you don’t want to list all the dog breeds in the rule, because there are hundreds of them. It is nice if preprocessing code identified a dog breed in the message and marked the word with a special tag. Then you can just look for that tag when applying the rule.

WordNet can be used to identify common concepts. You may need to add domain specific concept libraries, e.g. a list of drug names if you are building a healthcare bot.

After preprocessing is done you have a nice clean list of sentences and lists of words inside each sentence. Each word is marked with a part of speech and concepts, and you have a lemma for every word. The next step is to define patterns for intent identification.

You can invent your own pattern language using common logical operators AND, OR, NOT. The rule can look like this if you create an internal DSL (domain-specific language) based on Python:

Alternatively, you can invent external DSL, which can be more readable, but you will need extra work to create a compiler or an interpreter for that language. If you use a ChatScript language, it can look like this:

Do you use a chatbot engine with hardcoded rules? Have your developed your own? What issues have you encountered when building or using a chatbot engine? Please share in comments!

Natural Language Pipeline for Chatbots

Written by Pavel Surmenok