A Beginner’s Guide to Named Entity Recognition (NER)
What is NER and why should you care?
What is Named Entity Recognition (NER)?
It is the process of identifying proper nouns from a piece of text and classifying them into appropriate categories. These categories can be generic like ‘Organization’, ‘Person’, ‘Location’, etc. or they can be tailor-made for a particular application, e.g. ‘Programming Language’, ‘Blogging site’, etc.
In simpler words, if you want to find out ‘who’, ‘what’, ‘when’, ‘where’ from a sentence, then NER is your task.
Take the following example (or try it yourself here):
Why is it even a problem?
When I first heard about NER, it sounded terribly boring and even trivial. You might be thinking, “Isn’t this a simple lookup problem? Why do we even need ML here?” and so on. Let me try to convince you otherwise:
1. “Can’t I Just look up the names?”
You can, and often you do. You might think you’ll simply keep a giant list of all the proper nouns that exist in the world. But when you read a sentence, like
“Mr. Moony presents his compliments to Professor Snape, and begs him to keep his abnormally large nose out of other people’s business”
you can still infer from the context that ‘Moony’ and ‘Snape’ are names of people, even though you might never have heard these names in real life. So, there must be something more to the problem of NER than just looking it up.
2. “Can’t I just use handcrafted rules?”
Again, you can, and often you do. You might select all the Title-Case words or UPPER-CASE words.
However, as you can imagine, this is obviously prone to errors if the text is not properly formatted. This can especially be a problem when the text is obtained from applications like speech recognition.
Even if you manage to isolate all the entities using rules like these, they still have to be classified into proper categories.
How do I formulate it in ML terms?
So, assuming that you are now convinced that the problem is worth solving, let’s formulate it in more concrete ML terms: Given a sentence, classify each word into a category of proper nouns. If a word is not a proper noun, then categorize it as ‘Other’.
Thus, this is a sequence labelling problem.
There are some finer details to be added here. First, the classification will be done on tokens and not words.
Second, to handle multi-words entities better (and also to distinguish where one entity ends and the next begins), we use multiple fine-grained classes for each category. For example, the sentence
“Why are Anthony Gonzales, Robert Downey Jr, and Sachin Tendulkar in Paris?”
will be tagged as:
This is typically called the ‘IOB scheme’, where ‘B’ stands for ‘Beginning’, ‘I’ stands for ‘Inside’, and ‘O’ stands for ‘Outside’. There are even more fine-grained schemes like BILUO scheme, where ‘L’ stands for ‘Last’, ‘U’ stands for ‘Unit’. SpaCy uses the latter scheme. BILUO scheme has the same expressivity as IOB scheme, but this paper shows that BILUO is easier to train than IOB.
What are the features?
What gives us information about whether a word is a Named Entity or not? It can be:
- Prefix, suffix
- POS Tag
- Relative location in the sentence (beginning, end, etc.)
- Other words in the sentence.
- Whether the word is ‘OOV’ (Out Of Vocabulary)
What algorithms can I use?
NER being a sequence labelling task, we can use a wide range of algorithms for it. We can roughly categorize them as:
1. Traditional ML based:
- Conditional Random Fields (CRF) [scikit-learn provides a very handy framework called sklearn_crfsuite]
- Maximum-entropy Markov model
2. Neural Networks based:
- LSTMs, bi-LSTM (A good framework is Flair)
- CNNs (SpaCy uses CNN based architecture)
- Transformers (Spacy has recently launched it here)
Where do I get the data for training?
There are several open source datasets available for NER training. Each of them have their own characteristics. Some of them are:
- W-NUT 2017 (This set is more focused on novel, emerging entities)
- OntoNotes 5 (Contains multiple genres of speech)
For a much more extensive collection of datasets, refer to this github repository.
How do I know if my model is performing well?
When learning about a new ML problem, people often focus a lot on the algorithm, but forget to think about performance metrics. To know which model is performing the best, we need some model-independent metrics for any ML task.
1. Traditional Metrics
As a classification task, the performance of NER is usually measured in terms of classification metrics (over all the tokens) like precision, recall, F-score, accuracy, etc. These are fine if you just want to compare different models in a simple way.
Keep in mind that there will be a severe class imbalance in any NER data because most of the words in any corpus are not proper nouns (Typically, ‘O’ tag alone comprises of more than 75% of the tags). So in my opinion, a macro-average over all the classes is more appropriate than a micro-average (for any metric).
This post gives a great overview of a few different metrics derived from these simple metrics that have been used over the years.
2. New Developments
However, the above-mentioned simple metrics do not tell us a lot. They are aggregated over the entire dataset and we can not characterize the strengths or weaknesses of the model in meaningful ways. This recent paper tries to address this problem by breaking up the dataset into different chunks based on explainable criteria and examining the model performance on each chunk separately.
You can read more about it in this post that I wrote recently.
Summing it up
NER is the problem of classifying words in a sentence into categories of proper nouns. It is not completely solvable by lookup or handcrafted rules, but they help. There are various traditional ML based as well as deep learning based algorithms to solve it. Frameworks like sklearn_crfsuite, SpaCy, Flair, etc. make it really easy to implement these algorithms.