How-To Identify Land Conflicts in India Through NLP Semi-Supervised Topic Modeling

Semi-supervised learning to identify topics in articles of land conflicts with a model accuracy of 93 percent.

Joanne Burke
Omdena
4 min readNov 19, 2019

--

As a long-time analytics manager, I pivoted into data science.

Growing up, my mom was a social worker and my dad was an engineer so I was naturally drawn to the growing number of AI for Good initiatives.

I found Omdena online and applied for an NLP project. I was accepted to a 2-month challenge with the World Resources Institute on identifying causes of land conflicts in India and connecting these topics to the respective policies to accelerate restoration efforts.

I could not be more excited!

Our project began with many great ideas from a group of 30 diverse data scientists and computer engineers. We were given a newspaper article corpus, 200 gold standard examples, and excellent direction for utilizing state of the art NLP approaches.

My Topic Modeling Journey

I started a task to build the topic model. A Word Cloud of the gold standard articles yields many ideas.

A starter LDA unsupervised topic model was indicating a variety of different topics. I saw some topic overlap and lack of differentiation. Topic models are tricky. They are highly dependent on data inputs and parameter fine-tuning. Here is the pyLDAvis visualization.

I can see at most 3 different topics — yikes!

pyLDAvis and Top 30 Words

So I was stuck. I researched the state of the art semi-supervised approaches. Someone on the challenge posted a link to the CorEx (Correlation Explanation) model — a breakthrough! I added stopwords, stemming, and expanded the size vocabulary in the Tf-Idf vectorization of the articles.

The CorEx model is semi-supervised. By adding keywords as anchors, I could provide guidance to the model for selecting correlated words for each topic.

Aligning with the policies, I defined five core topics. After iterating the CorEx model, however, two more topics came to light — Drought/Rainfall and Water/Irrigation.

Back to the WRI stakeholders

And I needed to go back to the World Resources Institute stakeholders to add to the policy documents based on the frequency of the two additional topics. They agreed!

Final CorEx Model and tSNE graph

Now the topic model was working. I could identify 85% of the articles — I wanted better. Another task in the WRI challenge was using state of the art methods to mark pronouns and additional references within the text, coreference. After adding in all mentions of important nouns in the corpus with neuralcoref, my model was at 93% — much better.

Never to pass up a good chance to learn a new visualization, I engineered a scatter pie plot with WordNet vectorization reduced to 2-D representations in tSNE.

Please refer to my GitHub link for sample code for one topic. The final representation shows very well the differentiation of topics.

Much better!

My topic model was added to one of the two final solution pipelines. I am proud to do my part.

Joanne is a Cornell University graduate with actuarial, analytics, and (recently) data science experience. Like-minded data scientists and machine learning engineers, please connect on LinkedIn.

Bibliography

CorEx Model: https://github.com/gregversteeg/corex_topic

Python Code for CorEx, Scatter Pie, and BiLSTM model:

Want to become an Omdena Collaborator and join one of our tough AI for Good challenges, apply here.

--

--

Joanne Burke
Omdena
Writer for

Data Scientist, formerly analytics manager and actuary