65 Followers
·
Follow

RASA Regex Entity Extraction

RASA, an open source ML framework for building contextual AI assistants and chatbots, continues to improve and building a bot is getting easier and easier thanks to them ❤

I’ve seen questions in RASA forum saying that the regex entity extractor is not working. The doc says

You can use regular expressions to help the CRF model learn to recognize entities. In your training data (see Training Data Format) you can provide a list of regular expressions, each of which provides the CRFEntityExtractor with an extra binary feature, which says if the regex was found (1) or not (0).

For example, the names of German streets often end in strasse. By adding this as a regex, we are telling the model to pay attention to words ending this way, and will quickly learn to associate that with a location entity.

So 2 things you must make sure for this to work. (a) CRFEntityExtroctor is in your nlp pipeline. RegexFeaturizer is in your nlp pipeline and present before CRFEntityExtroctor. (b) have regex entry in nlu.md and associated examples in intent. I created 2 files.

$ tree
.
├── config.yml
└── data
└── nlu.md

My data look like this

$ cat data/nlu.md 
## intent:inform
- [AB-123](customer_id)
- [BB-321](customer_id)
- [ABC-1234](product_id)
- [BBB-4321](product_id)
- [12345](transaction_id)
- [23232](transaction_id)
## regex:customer_id
- \b[A-Z]{2}-\d{3}\b
## regex:product_id
- \b[A-Z]{3}-\d{4}\b
## regex:transaction_id
- \b\d{5}\b

And the config file

$ cat config.yml 
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: "en"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"

Train the data

rasa train nlu

Then start mlu shell to test my input

rasa shell nlu

First, my input is 5 digits “09877”

{
"intent": null,
"entities": [
{
"start": 0,
"end": 5,
"value": "09877",
"entity": "transaction_id",
"confidence": 0.6067366425710821,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "09877"
}

So far so good. As expected, CRFEntityEtracotr successfully extracted my 5 digit input as transaction_id. As in my regex, transaction_id should be 5 digits. My next test input is 6 digits “834343”

{
"intent": null,
"entities": [
{
"start": 0,
"end": 6,
"value": "834343",
"entity": "transaction_id",
"confidence": 0.6067366425710821,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "834343"
}

Hm, it still detected as transaction_id. Not quite accurate. How about customer_id and product_id. They are kind of similar right? Here are the results. See my inline comments (followed by ##).

Next message:
ABC-123
## This is expected. The value even matches to example
{
"intent": null,
"entities": [
{
"start": 0,
"end": 7,
"value": "abc-123",
"entity": "customer_id",
"confidence": 0.48642198021933686,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "ABC-123"
}
Next message:
ABC-12345
## This should not match neither but it says transaction_id
{
"intent": null,
"entities": [
{
"start": 0,
"end": 9,
"value": "abc-12345",
"entity": "transaction_id",
"confidence": 0.42455430023491963,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "ABC-12345"
}
Next message:
## Again, this should not match neither but
## but it says transaction_id
XX-32
{
"intent": null,
"entities": [
{
"start": 0,
"end": 5,
"value": "xx-32",
"entity": "transaction_id",
"confidence": 0.37426077213495307,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "XX-32"
}

I think this is where the confusion comes in. Why it kind of work but non-matched regex pattern is matched here? Then forum people suggest validating this in a custom action. Sure you can do that but why not pure RegexEntityExtractor?
I drafted RegexEntityExtractor. It is straightforward extractor, meaning when a pattern says it should match 3 digits then it looks for 3 digits otherwise it doesn’t extract.

Try Custom Component RegexEntityExtractor

To add, create regex.py on your working directory and paste the following script.

import os
import re
import warnings
from typing import Any, Dict, Optional, Text
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.extractors import EntityExtractor
from rasa.nlu.model import Metadata
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.utils import write_json_to_file
import rasa.utils.io
class RegexEntityExtractor(EntityExtractor):
# This extractor maybe kind of extreme as it takes user's message
# and return regex match.
# Confidence will be 1.0 just like Duckling
provides = ["entities"]def __init__(
self,
component_config: Optional[Dict[Text, Text]] = None,
regex_features: Optional[Dict[Text, Any]] = None
) -> None:
super(RegexEntityExtractor, self).__init__(component_config)
self.regex_feature = regex_features if regex_features else {}def train(
self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
) -> None:
self.regex_feature = training_data.regex_features@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Optional[Text] = None,
model_metadata: Optional[Metadata] = None,
cached_component: Optional["RegexEntityExtractor"] = None,
**kwargs: Any
) -> "RegexEntityExtractor":
file_name = meta.get("file")if not file_name:
regex_features = None
return cls(meta, regex_features)
# w/o string cast, mypy will tell me
# expected "Union[str, _PathLike[str]]"
regex_pattern_file = os.path.join(str(model_dir), file_name)
if os.path.isfile(regex_pattern_file):
regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
else:
regex_features = None
warnings.warn(
"Failed to load regex pattern file from '{}'".format(regex_pattern_file)
)
return cls(meta, regex_features)
def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
"""Persist this component to disk for future loading."""
if self.regex_feature:
file_name = file_name + ".json"
regex_feature_file = os.path.join(model_dir, file_name)
write_json_to_file(
regex_feature_file,
self.regex_feature, separators=(",", ": "))
return {"file": file_name}
else:
return {"file": None}
def match_regex(self, message):
extracted = []
for d in self.regex_feature:
match = re.search(pattern=d['pattern'], string=message)
if match:
entity = {
"start": match.pos,
"end": match.endpos,
"value": match.group(),
"confidence": 1.0,
"entity": d['name'],
}
extracted.append(entity)
extracted = self.add_extractor_name(extracted)
return extracted
def process(self, message: Message, **kwargs: Any) -> None:
"""Process an incoming message."""
extracted = self.match_regex(message.text)
message.set(
"entities", message.get("entities", []) + extracted, add_to_output=True
)

Then add your current working directory to PYTHONPATH

export PYTHONPATH=$(pwd):$PYTHONPATH

Add this component to config.yml

$ cat config.yml 
# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: "en"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "regex.RegexEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"

So now my directories look like this

├── config.yml
├── data
│ └── nlu.md
├── models
│ ├── nlu-20190629-152221.tar.gz
│ └── nlu-20190629-152550.tar.gz
└── regex.py

Train again

rasa nlu train

Okay let’s test. Start nlu shell.

rasa nlu shell

First, I entered 5 digits. Both extractors extracted as transaction_id. Expected.

Next message:
09877
{
"intent": null,
"entities": [
{
"start": 0,
"end": 5,
"value": "09877",
"entity": "transaction_id",
"confidence": 0.7534748414773789,
"extractor": "CRFEntityExtractor"
},
{
"start": 0,
"end": 5,
"value": "09877",
"confidence": 1.0,
"entity": "transaction_id",
"extractor": "RegexEntityExtractor"
}
],
"intent_ranking": [],
"text": "09877"
}

Next, I entered 8 digits. Remember, transaction_id should be 5 digits but CRFENtityExtractor extracted

Next message:
9878987
{
"intent": null,
"entities": [
{
"start": 0,
"end": 7,
"value": "9878987",
"entity": "transaction_id",
"confidence": 0.5528128173709483,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "9878987"
}

Remember customer_id should be [A-Z]{2}-\d{3} and product_id should be [A-Z]{3}-\d{4}. First, my input is XZ-321. This should match to customer_id. Both extractors extracted as expected.

XZ-321  
{
"intent": null,
"entities": [
{
"start": 0,
"end": 6,
"value": "xz-321",
"entity": "customer_id",
"confidence": 0.6541966803942679,
"extractor": "CRFEntityExtractor"
},
{
"start": 0,
"end": 6,
"value": "XZ-321",
"confidence": 1.0,
"entity": "customer_id",
"extractor": "RegexEntityExtractor"
}
],
"intent_ranking": [],
"text": "XZ-321"
}

How about XZZ-321. This should not match either product_id nor customer_id but again CRFEntityExtractor picked it up incorrectly.

XZZ-321
{
"intent": null,
"entities": [
{
"start": 0,
"end": 7,
"value": "xzz-321",
"entity": "customer_id",
"confidence": 0.3333333333333333,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "XZZ-321"
}

How about this? !@#!@#@#.

!@#!@#@#
{
"intent": null,
"entities": [
{
"start": 0,
"end": 1,
"value": "!",
"entity": "customer_id",
"confidence": 0.3333333333333333,
"extractor": "CRFEntityExtractor"
},
{
"start": 1,
"end": 7,
"value": "@#!@#@",
"entity": "customer_id",
"confidence": 0.3333333333333333,
"extractor": "CRFEntityExtractor"
},
{
"start": 7,
"end": 8,
"value": "#",
"entity": "customer_id",
"confidence": 0.3333333333333333,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [],
"text": "!@#!@#@#"
}

Summary

Sure this can be done at Custom Action level for sure but then you are kind of replicating the logic here in nlu.md and Custom Action. If we can define regex pattern just in nlu.md then RegexEntityExtractor filter things for you, then you can keep your code DRYer. Also, predictable character-level pattern matching (just like Duckling) and comes in handy when you have a custom pattern that Duckling doesn’t cover.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store