Building Rasa NLU custom component for lemmatization with spaCy
In this post, we will implement a Rasa NLU custom component with lemmatization using the spaCy library for Natural Language Processing (NLP) in Python.
Lemmatization is the process of converting a word to its base form. For example, the base form for “looking” is “look”, “laptops” is “laptop”
Rasa NLU overview
Rasa NLU is an open-source natural language processing tool for intent classification and entity extraction in AI chatbots.
By default, Rasa NLU comes with a bunch of pre-built components for tokenization:
- tokenizer_whitespace is tokenizer using whitespaces as a separator
- tokenizer_jieba is tokenizer using Jieba for the Chinese language
- tokenizer_mitie is tokenizer using MITIE
- tokenizer_spacy is tokenizer using spacy
By default, tokenizer_spacy returns verbatim text content.
Let’s create custom tokenizer based on tokenizer_spacy which returns lemma as a token instead of verbatim text.
Lemmatization with spaCy
Before we begin, let’s install spaCy and download the ‘en’ model
python -m spacy download en
spaCy library provides functionality for:
- Tokenization is segmenting text into words, punctuations marks, etc.
- Lemmatization is assigning the base forms of words.
Once you’ve downloaded and installed a model, you can load it via spacy.load(). This will return a Language object (let’s call it nlp) containing all components and data needed to process text. Calling the nlp object on a string of text will return a processed Doc.
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. Each token has attributes:
- text is verbatim text content
- lemma_ is base form of the token, with no inflectional suffixes
Doc consists of individual tokens, and we can iterate over them and output text and lemma for each token:
Output: Text (the original word text) → Lemma (the base form of the word)
- “they” →” -PRON-” (pronoun)
- “are” → “be”
- “looking” →”look”
- “for” → “for”
- “new” → “new”
- “laptops” → “laptop”
Rasa NLU custom component with lemmatization
In Rasa NLU, incoming messages are processed by a sequence of components. These components are executed one after another in a so-called processing pipeline.
Rasa NLU allows creating a custom Component to perform a specific task which NLU doesn’t currently offer (for example, lemmatization). Below code is based on Rasa NLU spacy_tokenizer.py where text replaced by lemma_:
Rasa NLU pipeline configuration
You should reference new custom component inside the Rasa NLU pipeline configuration file nlu_config.yml. The name of the component should follow the pattern “module_name.class_name” to reference the component.
The example of the pipeline configuration file with lemmatization component:
Then you can train the Rasa NLU model with the custom component and test how it performs.
In this post, you have learned how to create a custom component with lemmatization from spaCy and add it to the Rasa NLU pipeline in nlu_config.yml. You can add your own logic in the component to implement custom tokenization rules.
- Rasa NLU pipeline component configuration
- spaCy Linguistic Features
- The full source code: https://github.com/TatianaParshina/rasa_chatbot
Follow me on Instagram: tatiana.data