Two minutes NLP — Quick Intro to Coreference Resolution with NeuralCoref
Mentions, Word Embeddings, NeuralCoref, and SpaCy
Coreference Resolution has the goal of finding, grouping, and then substituting any ambiguous expressions with the real-world entities they are referring to. It is an important step for a lot of higher-level NLP tasks such as document summarization, question answering, and information extraction.
Here is an example of what Coreference Resolution does.
A typical coreference resolution algorithm goes like this:
- A series of words that are potentially referring to real-world entities are extracted. We call these words mentions.
- For each mention and each pair of mentions, we compute a set of features. This is commonly done by averaging the word embeddings of the mention and its adjacent words to consider context information.
- Then, we input these features into machine learning models to find the most likely antecedent for each mention (if there is one).
One of the most popular libraries to do Coreference Resolution in python is NeuralCoref.
NeuralCoref
NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolves coreference clusters using neural networks.
Install the library using pip and make sure to have the correct version of spaCy. Remember to download the spacy models for the English language.
Then, import both spaCy and NeuralCoref in your code and add the latter to the spaCy parsing pipeline.
Last, parse a sentence with spaCy. NeuralCoref will automatically resolve the coreferences and annotate them as extension attributes in the spaCy Doc
, Span
and Token
objects under the ._.
dictionary.