Building Rasa NLU custom component for lemmatization with spaCy

Tatiana Parshina
3 min readApr 13, 2019

In this post, we will implement a Rasa NLU custom component with lemmatization using the spaCy library for Natural Language Processing (NLP) in Python.

Lemmatization is the process of converting a word to its base form. For example, the base form for “looking” is “look”, “laptops” is “laptop”

Rasa NLU overview

Rasa NLU is an open-source natural language processing tool for intent classification and entity extraction in AI chatbots.

By default, Rasa NLU comes with a bunch of pre-built components for tokenization:

  • tokenizer_whitespace is tokenizer using whitespaces as a separator
  • tokenizer_jieba is tokenizer using Jieba for the Chinese language
  • tokenizer_mitie is tokenizer using MITIE
  • tokenizer_spacy is tokenizer using spacy

By default, tokenizer_spacy returns verbatim text content.

Let’s create custom tokenizer based on tokenizer_spacy which returns lemma as a token instead of verbatim text.

Lemmatization with spaCy

Before we begin, let’s install spaCy and download the ‘en’ model

python -m spacy download en

spaCy library provides functionality for:

  • Tokenization is segmenting text into words, punctuations marks, etc.
  • Lemmatization is assigning the base forms of words.

Once you’ve downloaded and installed a model, you can load it via spacy.load(). This will return a Language object (let’s call it nlp) containing all components and data needed to process text. Calling the nlp object on a string of text will return a processed Doc.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. Each token has attributes:

  • text is verbatim text content
  • lemma_ is base form of the token, with no inflectional suffixes

Each Doc consists of individual tokens, and we can iterate over them and output text and lemma for each token:

Output: Text (the original word text) → Lemma (the base form of the word)

  • “they” →” -PRON-” (pronoun)
  • “are” → “be”
  • “looking” →”look”
  • “for” → “for”
  • “new” → “new”
  • “laptops” → “laptop”

Rasa NLU custom component with lemmatization

In Rasa NLU, incoming messages are processed by a sequence of components. These components are executed one after another in a so-called processing pipeline.

Rasa NLU allows creating a custom Component to perform a specific task which NLU doesn’t currently offer (for example, lemmatization). Below code is based on Rasa NLU spacy_tokenizer.py where text replaced by lemma_:

Rasa NLU pipeline configuration

You should reference new custom component inside the Rasa NLU pipeline configuration file nlu_config.yml. The name of the component should follow the pattern “module_name.class_name” to reference the component.

The example of the pipeline configuration file with lemmatization component:

Then you can train the Rasa NLU model with the custom component and test how it performs.

Summary

In this post, you have learned how to create a custom component with lemmatization from spaCy and add it to the Rasa NLU pipeline in nlu_config.yml. You can add your own logic in the component to implement custom tokenization rules.

Useful resources:

Follow me on Instagram: tatiana.data

--

--