NERwhal: A multi-lingual suite for named-entity recognition
This blog post introduces the Python package Nerwhal, that we have developed for our open document anonymization app OpenRedact. OpenRedact is a Prototype Fund project, supported by the Federal Ministry of Education and Research.
If tasked with English language named-entity recognition, you will likely find many pre-trained deep learning models to base your work on. Much of the natural language processing research is happening in English, and this is where most training data and models are produced. For English, freely available named-entity recognition (NER) models can recognize 18 different categories out of the box or be highly specific for a certain domain. But if you are tasked with recognizing named entities in less common languages such as German, as OpenRedact does, there are fewer models available. German pre-trained models only recognize 4 different categories and there is generally much less training data. While training your own deep learning model always remains an option, it may not be economical to obtain a sufficient number of labeled training examples.
In such cases, users need an economical method to detect named entities. To the rescue come rule-based approaches that have been established well before the rise of deep learning and are still effective and widely used. For scenarios where a deep learning approach alone is not sufficient, we have developed the NERwhal suite: NERwhal combines different rule-based and statistical recognition methods, and provides an interface to quickly define your own recognizers.
How does it work?
Under the hood, NERwhal uses several recognition methods and makes them available behind a unified API:
- Regular expressions: Using regular expressions you can define a named entity as a set of strings.
- Entity Ruler: spaCy’s Entity Ruler lets you define patterns for sequences of tokens.
- FlashText: The FlashText Algorithm can search texts very efficiently for long lists of keywords.
- Deep Learning: The Stanza library and models (which provide state-of-the-art results for NER in many languages) power NERwhal’s statistical recognition. Stanza currently provides NER models for 8 languages.
At present, you can define custom recognizers based on regular expressions, the Entity Ruler or FlashText.
Smart combination of the results
The suite can combine the results of these methods in a smart way to get the best results. E.g. a match with a higher score can overwrite a lower scored match, and its confidence score can be increased if one entity is matched multiple times.
Each recognizer can define a list of context words that may occur in the surrounding of a named entity. If a context word is found in the same sentence as the entity, the confidence score is increased.
Custom vs integrated recognizers
The variety of contexts and domains in which named entities may occur is vast, and it is difficult to find all of them without mistakenly identifying false positives. Therefore, recognition methods typically not only need to be language specific, but also specific to a type or style of text.
The intended use of NERwhal is to define your own custom recognizers. To exemplify NERwhal's usage and to help you bootstrap your own recognition suite, some integrated recognizers are provided.
NERwhal as part of OpenRedact
OpenRedact aims at making document anonymization more efficient by using a semi-automatic approach. The NERwhal suite is employed for the automatic detection of personal data. Several recognition methods are used and the results are combined to achieve the highest possible confidence. Still, the recognition may produce false positive results or overlook sensitive data. Particularly the latter should never occur in an anonymization use case. That's why, after automatically recognizing the personal data, OpenRedact presents the results to the user in an annotation tool, where they are given the chance to correct the findings.
What’s next, and how can you help?
In the future, we would love to use NERwhal in other projects to increase the visibility of our library and to quicker establish NER systems.
If you are interested in the library, check out our GitHub. We track issues and greatly appreciate your contribution in reporting or solving them.