Using AI to Identify Environmental Conflict Events — From Scrapping News Articles to Map Visualization

The power of Artificial Intelligence meets collaboration, by humans, for humans.

Nikhel Gupta
Omdena
8 min readNov 21, 2019

--

Geolocated conflict events related to agriculture (light blue) and education (brown) in 2017–2018 for five example states of India. | Credit: Omdena AI

Environmental conflicts have emerged as major issues that deeply affect the socio-economic state of a region and/or an entire nation. These conflicts are related to natural resources, land, wildlife, supply chains etc. The crises are widespread around the globe and it is increasing rapidly.

According to the Environmental Justice Atlas, India has the most number of environmental conflicts, followed by Colombia and Nigeria. For instance, approximately 66% of all civil cases in the Supreme court of India are related to the land disputes for more than 2.5 million hectares of land. This affects an estimated 7.7 million Indians threatening investments worth $ 200 billion, according to the June, 2019 report by Centre for Policy Research.

Thus, it is a high time for policy makers to use the available data and develop policies with a pinch of science!

Protesters brutalized by police for opposing the proposed expansion of a factory. | Credit: N. Rajesh, The Hindu.

How did I end up in this project?

I am not an expert in the field of environmental sciences but in the process of expanding my domain knowledge for Data Science, I came across Omdena.

I joined one of their challenges in collaboration with the World Resources Institute with 32 other data scientists from various continents.

Country map of data scientists in the challenge. | Credit: Omdena AI

The data

The data for this challenge was scrapped from various news media reports with 65,000 candidate conflict articles. This process involved downloading GDELT data for a given country for an input period of time using Google Bigquery, scrapping full news text for a media article using news-please and manually labeling one month of news media data as Negative (no conflict news) and Positive (conflict news) with approximately 1,600 articles.

The major tasks for this project were:

  1. Apply coreference resolution to full news text data using Spacy and Neuralcoref.
  2. Use ELMo, BERT and Logistic Regression algorithms to train on labeled and co-referenced text data to predict positive and negative conflict articles.
  3. Topic modeling to find relevant topics using CorEx.
  4. Creating custom entities for NER using Spacy.
  5. Matching government policies to the conflict news events to understand policy gaps.
  6. Visualization of conflict events and their connection to policies.

In the following sections, I’ll take you through all of these tasks one by one.

Coreference resolution

As the name suggests, this is a task of locating all common expressions that refer to the same entity in text. In order to understand the sentiment behind an article and classify it into conflict and non-conflict events, it is important to change all the pronominal words like he, his, her, she, them, their, us, etc. into the nouns to which they belong.

Before the rise of AI, a typical algorithm for this purpose would first extract a series of entities from sentences and compute a set of features for each expression using hand-engineered features. For example, for a sentence in one of the scrapped news article:

Shahdara residents recall their experience with chilli powder in their throats.

In this sentence, ‘their’ is referred to ‘Shahdara residents’ and to resolve it, we could encode a rule. Similarly, we could add rules for many other properties in a full news article. However, it is close to impossible to do that for all the articles in this analysis.

Wait, why work if machines can do it for you, right?

The modern Natural Language Processing (NLP) techniques like neural networks allow us to do this job easily by training a model with a coreference-annotated dataset and use the trained model to perform coreference resolution for all articles. Even better, there are tools available that are trained on such huge datasets and we can just use them to resolve out text data of news articles. One such tool is Neuralcoref, a pipeline extension for Spacy which annotates and resolves coreferences using a neural network.

A working example of coreference resolution using Neuralcoref based on neural networks and Spacy.

The following is an example to do that in Python. First, we have to install some modules:

!pip uninstall spacy

!pip uninstall neuralcoref

!pip install spacy==2.1.0

!pip install neuralcoref

!python -m spacy download en

And applying Neuralcoref to a sentence (as above) is as simple as:

nlp = spacy.load(‘en’)

neuralcoref.add_to_pipe(nlp)

sen_nlp = nlp(sentence)

print (sen_nlp._.coref_resolved)

This will print the above sentence as:

Shahdara residents recall Shahdara residents experience with chilli powder in Shahdara residents throats.

Identifying conflict events

This was one of the major tasks aimed at classifying the news articles into positive (conflict events) and negative (non-conflict events) documents. With ~1,600 manually labeled articles — where the text was resolved for coreferences, three different models were trained to come up with the best solution.

One of the major advancements in the field of NLP is transfer learning, i.e. using a pre-trained model on a huge dataset to process and predict for a different task. ELMo and BERT are two of the most widely used pre-trained models for this purpose. ELMo is trained using the internal state of a bi-directional Long Short-Term Memory (LSTM) that is robust against the problems of long-term dependency and useful to recognize the contextual features in the input text. Following is the classification report that shows the recall and f1-score for positive and negative news articles where the ELMo model is transferred and trained for several epochs.

Classification report using a test dataset and ELMo model.

BERT, on the other hand, is an unsupervised and deeply bidirectional system for pre-training NLP. It is a general-purpose language understanding model trained on a large amount of free-text data from Wikipedia. The pre-trained representations are both context-less or contextual. Training BERT for our task produced the following scores.

Classification report using a test dataset and BERT model.

These results could be improved by data augmentation or choosing more optimal hyper-parameters for training. However, we found more peace in using well known logistic regression algorithm for this purpose. First, the labeled and co-referenced text data were changed into the vectors using the Bag of Words model and a bigram vectorizer and then, the vectorized data were used to classify the articles into positive and negative conflict texts, using logistic regression classifier and Grid search to optimize the hyper-parameters. This method successfully classified the news articles with a recall and an f1-score of 0.98 for conflict-related texts.

Classification report using a test dataset and logistic regression model.

Topic Modeling

After that, we built models to classify the news articles into conflict-related and non-conflict type documents with very high precision. Time to understand the topic of these articles i.e. to which conflict events do these news articles belong to!

For this purpose, we used anchored correlation explanation (CorEx, see this beautifully written paper) algorithm with anchors of our interest. CorEx supports hierarchical topic modeling with a mechanism to support domain knowledge with input anchor words. The modeling is as simple as running the following python script for a bag of words vectors to generate topics.

from corextopic import corextopic as ct

model = ct.Corex(n_hidden=TOPICS, seed=42) # TOPICS=7

model = model.fit(bag_of_words, words=vocab, anchors=anchors, anchor_strength=4)

Here the vocab points to the feature names in the vectorizer, anchor_strenght tells the model how much it should rely on the following anchors for 7 topics:

[‘land’, ‘acre’,’hectares’, ‘acquisition’, ‘land acquisition’, ‘agricultural’, ‘acres’, ‘degradation’,’landslides’,’property’,’resettlement’],[‘farmer’, ‘farming’, ‘agricultural’, ‘produce’, ‘crop’, ‘crops’, ‘agrarian’, ‘farms’,’farm’,’field’,’fields’,’soil’,’sugarcane’,’vegetables’,’farmers’,’agriculture’,’tractor’,’prices crops’, ‘debt’,’quota’,’food’,’fruits’,’livestock’,’cow’,’wheat’,’harvest’,’harvesting’,’horticulture’,’loan’,’loans’,’milk’,’paddy’,’rice’,’plant’,’plants’,’potatoes’,’potato’],[‘mining’, ‘coal’, ‘miner’, ‘miners’,’sand mining’, ‘sand’,’bauxite’,’iron ore’,’limestone’,’manganese ore’,’granite’],[‘forest’,’forests’, ‘forest department’, ‘reserve’, ‘forest officials’,’forestry’],[‘animal’,’leopard’,’leopards’, ‘animals’, ‘wildlife’, ‘tiger’, ‘attacked’, ‘slaughter’, ‘lion’,’lions’, ‘threat’, ‘tigress’, ‘bear’,’birds’,’cat’,’cattle’,’crocodile’,’elephant’,’elephants’,’pangolin’,’pangolins’,’species’],[‘drought’, ‘droughts’,’monsoon’, ‘rain’,’rains’,’rainfall’,’disaster’],[‘water’, ‘irrigation’, ‘monsoon’, ‘rain’, ‘flood’, ‘floods’, ‘flooded’, ‘climate change’,’climate’,’dam’,’dams’,’drinking’]

And,

model.get_topics(n_words=3)

gave us the following 7 topics with 3 words per topic.

Topic #1: land, resettlement, degradation
Topic #2: crops, farm, agriculture
Topic #3: mining, coal, sand
Topic #4: forest, trees, deforestation
Topic #5: animal, attacked, tiger
Topic #6: drought, climate change, rain
Topic #7: water, drinking, dams

Custom entities

After topic modeling, the next job was to find custom entities like actors (e.g. Government, Court etc.), numbers (e.g. number of people affected by a conflict), actions (e.g. protest), locations (e.g. city, state etc.), and dates. We label all the positive conflict articles and label them for these entities. Using them as the training dataset, we trained a Spacy model to find custom entities for the scrapped news articles. Following is a representation of the entity recognization for one of the news articles. The actors here are farmers, farms, rains, government, and farmers. The number of people affected is not mentioned in the article and there is no clear action mentioned as well.

Matching news articles to policy documents and map visualization

Next, we matched the news articles to the policy that could be applicable to such a conflict. We downloaded all the policy documents related to the topics found with topic modelling and ran cosine similarity.

Finally, we created an app using plotly and the models from all tasks.

The following video link to the app shows the workarounds by Kali Deverasetti.

Conclusions

In this project with Omdena and World Resources Institute, we as a collaboration worked on various NLP tasks like: coreference resolution; identifying articles as positive or negative conflict articles; topic modelling; custom entity recognition; matching conflict articles to policies with cosine similarity and finally project our results on an app with map visualization. Finally, following is the list of individuals who worked on this project.

Antonia Calvi Carlos Arturo P. Dennis Dondergoor Joanne Burke kali prasad deverasetti José Manuel Ramírez R. Jyothsna sai Tagirisa Kulsoom Abdullah Michael Lerner Rishika Rupam Sai Tanya Kumbharageri Shivam Swarnkar Srijha Kalyan Tomasz Grzegorzek Zaheeda Tshankie Dustin Gogoll Irene Nandutu Saurav Suresh Gabriela Urquieta Elizabeth Tishchenko Nikhel Gupta

Want to become an Omdena Collaborator and join one of our tough AI for Good challenges, apply here.

If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.

We are also on LinkedIn, Instagram, Facebook, and Twitter.

--

--