My Omdena Journey: Being a Task Manager, Taking Risks and Not Being Scared of Failing

How we used BERT language model and a classification layer to label land conflict articles in India with 92 percent accuracy.

Published in

Omdena

8 min readNov 30, 2019

By the end of the summer between my undergraduate degree and my upcoming masters, I felt like I had traveled and relaxed enough and wanted to get back into working. I was just about to start a Masters in Machine Learning, so wanted to brush up on the things I knew as well as learn new topics.

I had heard a lot about data science applied to social good and therefore was also actively researching into that, and that’s when I stumbled upon Omdena. The whole concept of it really excited me. I was excited by the potential of working on large scale projects which tackled worldwide problems, with the final outcome actually making an impact.

I was also really drawn by working collaboratively with a group of other passionate people. I applied for the challenge with the World Resources Institute about identifying conflict articles in India and linking them to mediating policies to solve the conflict faster. Land conflicts are affecting more than 7 million people in India.

When I got accepted I was very happy.

The beginning wasn’t so easy

At first, everything was a bit overwhelming. I had never worked with such a large group of people, all coming from different backgrounds and different skills, and I had never worked on such a large scale project either. It was reassuring, however, to feel that most people felt the same way. The first week included a lot of reading and researching and getting our heads around the project, and after that, we all decided on various tasks we would have wanted to contribute in. An aspect of Omdena I strongly appreciate is their belief that you learn by doing: if you have an idea, do not be scared to attempt it, and if it doesn’t turn out great, it’s part of the learning process.

It’s important to stay involved and not get scared

Because I really enjoy working with others and learning from other people, I joined multiple tasks and contributed as much as I could in each one of them. A great learning process also comes in the exchange of ideas on how to tackle each problem, so keeping on top of all of this was something I found extremely beneficial.

I was also put in charge of my own task: which included identifying newspaper articles dealing with environmental conflicts. When I joined Omdena, I knew I wanted to contribute in the best way I could, but never would I have thought to be given such a big responsibility. Being in charge of this task pushed me to stay motivated and not to be afraid to test various different methods to the same problem.

Being a task manager does not mean having all the answers

As a task manager, I was expected to show gradual progress every week, and I knew that in the end a solution should have been delivered. But there wasn’t an expectation of having answers immediately, what was expected was iterative progress towards a final solution.

Answers are found by testing different methods, and for this task, this specifically included two stages: an unsupervised manner at first, followed by an attempt of supervised methods.

Phase 1: Unsupervised method

At the first stage of the challenge, we did not have any labeled data: we just had a collection of thousands of unlabelled articles and only a gold standard dataset of 250 articles known to be talking about environmental conflicts.

I was attempting to find a way of differentiating between articles in an unsupervised way, and I thought of approaching this through KMeans Clustering. What this method does is an attempt to group similar things together in a single cluster.

But how do you do KMeans clustering on newspaper articles which only consist of words not interpretable by a model?. You have to translate these in an interpretable format for the algorithm: they had to be given a numerical representation.

There has recently been a lot of research in creating vector representations for words, paragraphs, and documents such that they are able to encode ‘meaning’. ‘Meaning’ in this case can be interpreted as wherein the embedding space the vectors lie: with the intuition that nearby vectors of words/documents/paragraphs would have similar meaning or content.

So how did I do this for the articles?

I transformed each document in a vector representation using the Universal Sentence Encoder (which is easily available on TF-HUB), which was trained specifically for sentences and documents. I then performed KMeans clustering with 20 clusters — which were chosen through the elbow method and silhouette analysis. A pitfall of KMeans clustering is that you have to specify the number of clusters the algorithm should create, but not knowing what to expect (due to unlabelled data), this can be quite hard. Elbow method and silhouette analysis are two standard ways that help to do this based on the consistency within clusters.

The challenge now lied on identifying what documents belonged to each cluster. To tackle this I performed NMF topic modeling on each cluster and looked at the top 7 topics.

How-To Identify Land Conflicts in India Through NLP Semi-Supervised Topic Modeling

Semi-supervised learning to identify topics in articles of land conflicts with a model accuracy of 93 percent.

medium.com

If they appeared to be environmentally related, I would consider documents in that cluster as positive documents. Example of good topics are shown in the following picture:

I ended choosing 7 out of the 20 clusters as being ‘good’ clusters and then evaluated my results on the Gold Standard dataset that was provided. Ideally, these were all documents that should have been classified as positive. 81.64% of them would fall into one of the 7 chosen clusters.

This overall was not a bad result, given the lack of labeled data. However, a big problem of the lack of labeled data was also no way of calculating how many of the negative documents could potentially be badly classified as positive with this model.

Phase 2: Supervised

As a group, we all soon realized that the lack of labeled data made everyone’s job, whatever their tasks, really hard. It made it difficult to validate our results well and judge whether changes or different attempts led to an improvement. We therefore all buckled up and labeled a total of 1500 documents — both positive and negative. Once this was done, I tested the unsupervised method and noted that only 76% of the positive documents would get retrieved, with also 349 out of 1282 FP. This needed improvement.

I decided to attempt a supervised method of labeling these texts. I did this by making use of BERT: a language model developed by Google with a deep sense of language context. By adding a classification layer at the end of the model, BERT can be used for classification tasks.

A first test resulted in an initial 86% accuracy. Although this can seem good at first glance, our data was extremely imbalanced: we had much fewer articles dealing with environmental conflicts than the opposite. Of the labeled articles, 1282 were negative and 286 were positive — so labeling everything negative would per se yield 81.8% accuracy. To better evaluate the model I, therefore, started looking at the F1 score — which takes into account both precision and recall of the model and is, therefore, a better way in asserting the model’s performance. This first try of the model yielded a 0.71 F1 score. The aim was to get values closer to 1.

By looking back at the data, we realized that some data had been wrongly labeled. Cleaning this up yielded a strong improvement: 0.83 in the F1 score. Finally, another task in the challenge was to augment the articles with pronoun resolution: i.e. replace each pronoun with its leading noun. The intuition behind this was that it would aid in making the actors of the articles appeared in the hope of better classifying the text. This did indeed caused further improvement with 0.92 F1 scores on the positive labels, and was, therefore, the final model used.

What to take away from my experience?

What I think is important to note is that the overall final model used for this task required a lot of iterations and trial and error. Even though some methods were attempted and then not finally used in the delivered project (kmeans), I learned that a lot of the process of working is like this. Not only, but attempting to classify with Kmeans required a bit of thinking ‘outside of the box’, and made me explore topic modeling, which was new to me. Another massive takeout from this is the importance of data and how big of an improvement this can make to already good models.

Further improvement for this would also augment the data that we had by creating extra documents with methods such as synonym replacement from the original articles. Although this was discussed we ran out of time to actually test it.

As outsiders to any other project we tend to always be exposed to final good results, but rarely realize that before those were achieved a lot of things were tried beforehand. We can’t always have the immediate correct answer as well as the immediate resources.

Working with Omdena really made me understand this.

I am so grateful to have taken a part in this project. It taught me a lot, especially on a personal level of testing things out, keeping on top with deadlines, communicating and working with a team from around the world.