Named Entity Recognition for Unstructured Documents
Named Entity Recognition (NER) on unstructured text has numerous uses. Companies sometimes exchange documents (contracts for instance) with personal information. It may be the case that the personal information contained in these documents should be anonymised. It may also be the case that personal information should be anonymised before being accessed by certain level of employees. Another use would be to assess actions performed by the entities on large texts. NER can be used to tackle these and many other cases where person's must be identified in large sets of documents.
The text used to perform the anonymisation tests was inspired in the 1986 Fifa Wolrd Cup English team, and it is as the following:
“Gary Winston Lineker was an excellent football player.
GARY WINSTON LINEKER was a striker.
gary winston lineker was born in England.
gARY WiNsTon lInEker is married to Danielle Bux.
Gary W. Lineker, Kanny Sansom and Peter Shilton played together.
The midfields were:
— Bryan Robson;
— Ray Wilkins;
— Chris Waddle.”
Some aspects were tackled in these texts:
- Writing names in unusual ways (e.g. ‘gARY’, ‘gary’, ‘GARY’)
- Writing names without any character to separate them (which is the case of the list of the defenders)
The Spacy API
The "en" and the "en_core_web_md" models were used to discover names in the text above. The latter, a huge model of 1GB. The correct evaluation rates were 0.47 and 0.79 respectively.
The strings "GARY WINSTON LINEKER", "gary winston lineker", "gARY WiNsTon lInEker" weren't correctly recognised by neither models. This fact shows us that Named Entity Recognition (NER) on Spacy models are case sensitive. The classifier, expects the names to be written with the first letter in upper case and the rest in lower case. An important lesson shall be taken here. Lowering case text preprocessing cannot be performed prior to named entity recognition with this models.
Another noticeable fact was that changing the bullet character from "-" to "•" dramatically changed the results. The names in the defenders list were all correctly recognised after updating the bullet character. The hit rate for both models jumped to 0.79. The only unrecognised names were the ones that did not follow the case pattern of proper names.
The Stanford Classifiers
The "Stanford NER is a Java implementation of a Named Entity Recognizer" (https://nlp.stanford.edu/software/CRF-NER.shtml). However, there is a Python wrapper (StanfordNERTagger)for this piece of software implemented in the NLTK module.
Three different classifiers were used:
- 3class: Location, Person, Organization
- 4class: Location, Person, Organization, Misc
- 7class: Location, Person, Organization, Money, Percent, Date, Time
Only the Person entity were assessed in this evaluation.
The Stanford Classifiers require us an extra job. They tag persons’ names, but do not identify multiple consecutive tagged names as just one individual. To illustrate what I mean, they recognised "Peter" and "Shilton" as person's names, but did not recognise "Peter Shilton" as a single entity. To overcome this limitation, I used a naive approach, which was to consider consecutive names as an entity's names. A more advanced approach, such as using Parts of Speech tagging, could have been used with better results.
The hit rates were 0.71, 0.64 and 0.64 for the 3class, 4class and 7class classifiers respectively for the text above. The biggest problem was, once again, the list of defensors. As they do not have any separator, the naive approach used to find distinct entities returned "Gary Stevens Kenny Sansom Terry Butcher" for that list. If a separator is used, the hit rate for the 3class classifier increases to 0.93. The only strings not recognised as persons is "gary winston lineker".
The code and results for Stanford NER experiments can be found here.
Both Spacy and Stanford NER models can be used for named entity recognition on unstructured documents achieving reasonably good outcomes. The former has the advantage of automatically recognising the entities out of the persons' tokens. The latter is more flexible by distinguishing names typed on unconventional ways (not following the case pattern for proper names).
However, it was surprising that both models rely on letter casing for identifying named entities. Lowering the case of the text to be processed commonly one of the initial tasks of Natural Language Processing (NLP). This preprocessing task shall be performed after the NER task, as performing lowering case previously would ruin named entity recognition.