NLP Behind Chatbots — Demystifying RasaNLU — #1 — Training

Published in

bhavaniravi

7 min readJul 11, 2018

Chatbots & Me

I was one of the early adopters of the chatbot technology. I have been building chatbots for the past two years, consuming NLP platforms like Microsoft LUIS, Dialogflow, wit.ai, etc., When I wanted to understand the bare bones of it, Ashish Cherian came across Rasa, and We built a chatbot using it and conducted a workshop at PyDelhi Conference 2017.

NLP & Me

This year I wanted to sharpen my ML skills, and I narrowed my focus to just NLP. After a round of tokenizing, POS Tagging, Topic modeling, and Text classification it was time to put it all together into a chatbot framework but, I had no idea how to go about it.

It was around the same time Rasa sent me their newsletter with a call for first PRs. I forked the repo and set up the environment and played around with it a bit.

About a week later, The ML expert Siraj Raval announced the 100DaysOfMLCode challenge and that boosted my enthusiasm to a different level making me want to do it for sure. I decided to dedicate the whole 100 days to focus only on NLP starting with reading and understanding Rasa NLUs code base.

Before you jump in

This blog is for the ones who want to understand the ML concepts that drive the chatbot technologies. If you are a complete newbie to chatbots & NLP, I strongly recommend you to go through the following links, understand the basics and build a chatbot using RasaNLU before diving deeper.

Disclaimer:

If you are planning to move forward without trying the links above, it is going to be super hard for you to understand the rest of the blog.

RasaNLU is built using python. It would be better if you have basic python knowledge to understand the code snippets.

Reading code — The starting point

The starting point of any repository can be found by looking at its documentation. It will be the first file you import from the package or the file you hit from the command line.

RasaNLU has two entry points — Train and Server.

The training part generates an ML model when you feed the training data — train.py.

$ python -m rasa_nlu.train \
    --config sample_configs/config_spacy.yml \
    --data data/examples/rasa/demo-rasa.json \
    --path projects

2. The server part is where the generated ML model is served via an API.

$ python -m rasa_nlu.server --path projects

Since the server.py needs the model generated by train.py, let’s start with the training part.

Training

In this blog, we are going to explore the training part. In this part, we feed in the training json file with few configuration details, and we would get trained ML models at the end of training.

1. Configuration

In this step, the command line arguments fed to the train.py file are parsed and loaded into a configuration object cfg. The training configuration is defined by config_spacy.yml . It contains two main info language of your bot and the NLP library to use.

> cat config_spacy.ymllanguage: "en"
pipeline: "spacy_sklearn"

The cfg object also holds the path to your training data and path to store models after the training is complete.

2. Loading the training data

With RasaNLU you can read the training data from your local machine or an external API. This comes in handy when you want to fetch data from a pre-existing database. In that case you need write api that generate data in a format consistent with Rasa’s training data format. You need to write a layer on top of your database to do this.

The load_data function reads the data from the respective paths and returns a TrainingData object.

{
  "text": "show me chinese restaurants",
  "intent": "restaurant_search",
  "entities": [
    {
      "start": 8,
      "end": 15,
      "value": "chinese",
      "entity": "cuisine",
      "extractor": "ner_crf",
      "confidence": 0.854,
      "processors": []
    }
  ]
}

3. Training the ML model

In this step, the loaded TrainingData is fed into an NLP pipeline and gets converted into an ML model. Aspacy pipeline looks something like the one in the image.

The first step is to create a Trainer object which takes the configuration parameter cfg and builds a pipeline. A pipeline is made up of components. Each component is responsible for a specific NLP operation.

The Trainer.train function iterates through the pipeline and performs the NLP task defined by the component. You can think of train function as a controller which handles controls over to different components in the pipeline and updates the context of output or info derived from each component

context = {}
for i, component in enumerate(self.pipeline):
    updates = component.train(working_data, self.config,**context)
    if updates:
        context.update(updates)

Though there is a single pipeline, I am going to split it into three parts.

The preprocessing step — Where the data is transformed to extract the required information
Entity Extractor & Intent Classifier — The preprocessed data is used to create the ML models that perform intent classification and entity extraction
Persistence— Storing the result

Preprocessing

3.1 SpacyNLP

To use spacy we need to create a spacy’s NLP object depending on the language provided in the configuration file. If spacy does not support the language provided, then it throws an error.

>>> import spacy
>>> nlp = spacy.load('en')

3.2 SpacyTokenizer

This step converts each training sample from your training file and converts them into a list of tokens(words). At the end of this step, we have a bag of words.

>>> tokens = nlp("Suggest me a chinese food")
["suggest", "Me", "a", "chinese", "food"]

3.3 SpacyFeaturizer

Now that we have the bag of words we can feed them into the ML algorithms. However, an ML algorithm understands only numerical data. It is the featurizer’s job to convert tokens into word vectors. At the end of this step, we will have a list of numbers which will make sense only for ML models. Spacy’s token comes with a vector attribute which makes this conversion easy.

>>> features = [token.vector for token in tokens][ 1.77235818e+00  2.89104319e+00  1.34855950e+00  4.57144260e-01
 -1.24784541e+00  3.25931263e+00 -6.40985250e-01 -1.46328235e+00
 -5.12969136e-01 -2.17798877e+00 -3.69897425e-01  4.26086336e-01...

3.4 RegexFeaturizer

RasaNLU supports regex in training samples for eg., in case of capturing entities like zip code, mobile number, etc, In such a case, RegexFeaturizer looks for regex patterns in TrainingExamples and marks 1.0 if the token matches the pattern else 0 . This step does not involve spacy as the functionality is particular to Rasa.

found = []
for i, exp in enumerate(self.known_patterns):
    match = re.search(exp["pattern"], message.text)
    if <match_in_token>:    
        found.append(1.0)
    else:
        found.append(0.0)

Entity Extraction

3.5 NER_CRF EntityExactor

NER_CRF is one of the famous algorithm used to perform named entity extraction. NER stands for Named Entity Recognition and CRF is Conditional random fields which drives the whole statistics behind entity extraction.

"entities": [
    {
      "start": 8,
      "end": 15,
      "value": "chinese",
      "entity": "cuisine",
      "extractor": "ner_crf",
      "confidence": 0.854,
      "processors": []
    }

The extractor parameter in training data enables you to choose between extractors supported by Rasa and preprocessor parameter defines the list of operations to be done before NER. RasaNLU’s documentation explains the list of different extractors and its use case.

The above entity example goes through a series of transformation before being fed to the CRF algorithm. At the end of the training, a CRF ML model trained with the entity samples is generated.

NER is such a vast topic which can be covered in a separate blog. To have a deep dive understanding of how this ML model is built refer to the following resources

3.6 EntitySynonymMapper

EntitySynonymMapper generates a mapping between the entity and its synonyms provided by the training file. The chatbot you build should be able to understand every variation of the entity. This mapper handles different variations of a single entity.

# Input -> Training_data.json"entity_synonyms": [
      {
        "value": "vegetarian",
        "synonyms": ["veg", "vegg", "veggie"]
      }# Output -> entity_synonym.json{'veggie': 'vegetarian', 'vegg': 'vegetarian'}

Intent Classification

3.7 SklearnIntentClassifier

This classifier uses sklearn SVC with GridSearch with intent_names as labels after a LabelEncoding and text_features generated by the featurizer as data to generate a ML model.

# training data
>>> X_train = 
[ 1.77235818e+00  2.89104319e+00  1.34855950e+00 4.57144260e-01 -1.24784541e+00  3.25931263e+00 -6.40985250e-01 -1.46328235e+00...# features
>>> Y = ["greet", "bye", "restaurant_search", "greet"...
>>> Y_train = LabelEncoder().fit_transform(Y)
>>> Y_train
[0, 1, 2 ,0...# training the ML model
>>> clf = GridSearchCV(SVC(...))
>>> clf.train(X_train, Y_train)

4. Storing the model in a persisted path

Rasa enables you to store the data in a cloud storage such as AWS, GCS or Microsoft Azure or in your local system. The persisted_path parameter defines that configuration for you and stores the trained model in the respective position. The final output after training through all the pipeline components is an Intepreter object which generates and saves the following files which is later used during the serve step.

crf_model.pkl
entity_synonyms.json
intent_classifier_sklearn.pkl
regex_featurizer.json
training_data.json
model_metadata.json

Was the post useful to you?
Hold the clap button and give a shout out to me on twitter. ❤️

In the next part, I will cover how Rasa NLU enables you to consume these ML models via an API.