How to Boost Your Topic-Modeling Performance with Coreference Resolution

Improving the accuracy score from 83% to 93% to identify land conflict topics in news articles.

Published in

Omdena

4 min readNov 26, 2019

Hugging face coreference system in operation with my own example. Try it for yourself!

As a Junior Data Scientist, my Machine Learning journey thus far has led me to NLP challenges which involved good old’ fashioned text-classification. So, I was enthused when Omdena presented an opportunity for me to broaden my skill-set and delve into an aspect of NLP I was not familiar with.

After applying to Omdena, I was accepted as a Machine Learning Engineer to collaborate in their AI for Good challenge with the World Resources Institute.

The challenge

Identifying environmental conflict events in India using news media articles.

Part of this project was to scrape news media articles to identify environmental conflict events such as resource conflicts, land appropriation, human-wildlife conflict, and supply chain issues.

With an initial focus on India, we also connected conflict events to their jurisdictional policies to identify how to resolve those conflicts faster or to identify a gap in legislation.

Part of the pipeline in building this Language Model was a semi-supervised Topic Modeling task whose process and the outcome is detailed below.

How-To Identify Land Conflicts in India Through NLP Semi-Supervised Topic Modeling

Semi-supervised learning to identify topics in articles of land conflicts with a model accuracy of 93 percent.

medium.com

In short, in order to make this Topic Modelling model robust, Coreference Resolution was suggested as one of the possible additions.

I took the initiative to work on this task and was later elected as the Task Manager.

The team consisted of 27 other collaborators ranging from data wranglers to data engineers, and machine learning engineers. Together we were ready to contribute!

Where to begin? Research.

Since I had no experience with Coreference Resolution, I knew my best starting point, as it is with most projects, would be researching this topic.

What exactly is Coreference Resolution?

Coreference resolution is the task of finding all expressions that refer to the same entity in a text (1)

I like to explain using practical examples, a real example featuring my son :)

“My mom is the best!”, said Hamza, “She wakes me up, makes me healthy food then she lets me eat junk food”.

Entities: “Hamza”, “my mom”.
Expressions which refer to “Hamza”: “My”, “me”, “me”, “me”.
Expressions which refer to “my mom”: “She”, “she”.

Pretty simple!

Use Cases

In the context of this project, Coreference Resolution could be best used in order to enrich downstream topic modeling by replacing references with the same entity in order to better model the actual meaning of the text. This increases the Tf-Idf of generalized entities and it removes ambiguous words that are meaningless for classification.
Another use-case would be to use the Coreferenced text data as additional features, along with Named Entity Recognition tags, in any classification approach. A one-hot-encoded version of unique entities can be used as input to factorization machines or other approaches for spare modeling.

Which packages are available to implement it?

I explored almost every available python package out there.

I toyed around with some packages which seemed good in theory but were rather challenging to apply to our specific task. We needed a package that would be user-friendly, as a script would have to be developed for 28 people to take and be able to apply without much struggle.

NeuralCoref, Stanford NLP, Apache Open NLP, and Allennlp. After trying out each package, I personally preferred Allennlp, but as a team, we decided to use NeuralCoref with a short but effective script written by one of the collaborators Srijha Kalyan (add a Github link to the code?).

When applied, the package identifies the entities in the given text then they produce “clusters” or “chains”. These clusters consist of the entity (“Hamza”), the references linked to that entity ( “My”, “me”) and also their index (position) in the text.

The code was applied to the article data which was annotated by fellow collaborators from the Annotation Task Group. This resulted in a CSV file with the original article titles, the original article text and a new column of Coreferenced article text; not as chains but in the same written format as the original article text.

The output was then sent to the Topic Modeling Task Team which at that point was sitting on an accuracy of 83%, with the Coreference Resolution data, the accuracy jumped to 93%!

That’s an 11% improvement! All the hard work and hours we put into this learning this task was clearly worth it!

I’m proud to say I have a new skill added to my Data Science ninja Resume!