Photo by akilash sooravally / Shutterstock

Applying AI to Analyze Domestic Violence in Lockdowns — From No Data to Building an ML Classifier for Tweets

Data mining, topic modeling, document annotations, NLP, and stacking machine learning models: A complete journey.

Harshita Chopra
Omdena
Published in
8 min readJul 9, 2020

--

Artificial Intelligence and its possibilities have always fascinated me. Making machines learn through data is undeniably phenomenal. When I got to learn about Omdena and its wonderful initiative for bringing AI to social good using the power of global collaboration, I couldn’t stop myself from participating in its empowering challenges.

I felt delighted to be given the role of a Machine Learning Engineer in my first project. Connecting with a team of 50 fabulous collaborators from various countries around the world, including domain experts, data scientists, and AI practitioners was of course a golden opportunity to gain knowledge in the best possible way.

A space to create value out of my ideas made me learn exponentially from the enhancements. The cordial environment provided me the experience of leading task groups and interacting with multiple innovative minds.

In this blog post, I’ll walk you through a major part of the project I led and contributed to for the past month.

The problem

There has been a surge in Domestic Violence (DV) and online harassment cases during COVID-19 lockdowns in India. Homes are no more a safe place for victims trapped in abusive relationships with their family members.

Domestic violence involves a pattern of psychological, physical, sexual, financial, and emotional abuse. Acts of assault, threats, humiliation, and intimidation are also considered acts of violence.

Data substantiating DV from government resources are only available in summary form. Incidents are largely reported via calls, and hence make data and subsequent mapping difficult.

The goal of the challenge was to collect and analyze data from different social media platforms or news sources so as to gain insights on the rise in DV incidents during the nation-wide lockdown.

Word-clouds from tweets describing incidents of domestic abuse

The solution

Diverse social media platforms come up as a huge and largely untapped resource for social data and evidence. It generates a vast amount of data on a daily basis on a variety of topics. Consequently, it represents a key source of information for anyone seeking to study various issues, even the socially stigmatized and society tabooed topics like DV.

Victims experiencing abuse are in need of earlier access to specialized services such as health care, crisis support, legal guidance, and so on. Hence the social support groups for a good social cause play a leading role in creating awareness promotion and leveraging various dimensions of social support like emotional, instrumental, and informational support to the victims.

Red Dot Foundation plans to deal with this challenge. When the victims seek help, it is important to identify and analyze those critical posts and acknowledge the help needed with more immediate impact.

Tasks were divided to mine data from different sources: Twitter, Reddit, YouTube, News articles, Government reports, and Google trends. After the acquisition of huge amounts of data, the next step was filtering out relevant posts through topic modeling and keywords. This was followed by annotation of data and then building an NLP based machine learning classifier.

In this blog post, Tweets would be in the spotlight!

Scraping data with the right queries

Tweets were extracted in the pre-lockdown and during the lockdown period so as to judge the surge in domestic abuse. Hence, we took a time-frame of January’20 to May’20.

Tweepy (the official tweets scraping API of Twitter) extracts tweets only from the past seven days in the free plan, making it a bothering limitation. Hence, we needed an alternative for mining old tweets with the location.

GetOldTweets3 is an effective Python library for this task, given that we can expand this pipeline by keeping it open-sourced.
Twitter’s advanced search can do a great job of generating your customized query. In order to extract harassment-related posts, here are a few examples of queries we used:

Using AND combinations of ‘relationship’ words with ‘actions’ and ‘common nouns’ yield the required results. The until and since attributes hold the limits of the time period.

The setNear() feature accepts a location name (e.g., Delhi, Maharashtra, India, etc) or latitude and longitude of that region. The central point of India is approximately around (22,80) degrees. The setWithin() feature accepts the radius around this point, and 1800 km roughly covers India and nearby places.

After executing more such queries with different keywords, we had thousands of tweets — some relevant topics and some irrelevant.

Data needs to be classified — Would topic modeling work?

Since a considerable number of tweets in our dataset were not related to the kind of harassment we were looking for, we needed some filtering. Classifying tweets into broad topics was the goal. Topic modeling was the first thing that clicked.

Topic modelling is an unsupervised learning process to automatically identify topics present in a collection of documents and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Latent Dirichlet Allocation is the most popular technique to do so. It assumes that documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

> Topic 0 words are generally included in awareness posts or #BanTiktok posts due to their inappropriate content.
> Topic 1 words are headlines or real victim stories.

Topic modeling works best when the topics are considerably distinct or not too related to each other.
The generated topics didn’t satisfy our target of classifying as relevant or irrelevant. Hence we had to pick up another approach, since our dataset, in general, talks about kinds of harassment.

Rule-based classification turned up to be a more precise approach in this task. I created three sets of keywords to look for — relationships, violence verbs, and not-relevant words. The following algorithm is implemented in Python to filter out some documents.

Relationships: [List of keywords like wife, husband, housemaid etc.]
Verbs : [List of keywords like harass, beat, abuse etc.]
Not-relevant : [List of keywords like webinar, politics, movie etc.]
Iterating through document:
{
R1: (any word in Relationships) AND (any word in Verbs) -> 'Keep'
R2: (any word in Not-relevant) -> 'Discard'
}

The dataset was filtered based on our needs. Before the task of modeling, we needed annotations for training a supervised model.

Deciding Labels and Annotating Tweets

Experienced domain experts pitched in the categories to classify tweets based on their context. Proper labeling guidelines were set up and training sessions helped to label tweets properly, keeping in mind the edge cases. For document annotation, a quick and efficient tool called Doccano was used. Several collaborators assisted by taking up queues of data points and annotating them. Following labels were used:

  • DV_OPINION_ADVOCATE
    (advocating against domestic abuse)
  • DV_OPINION_DENIER
    (denying the existence of domestic abuse)
  • DV_OPINION_INFO_NEWS
    (stating factual information or news)
  • DV_STORY
    (describing an incident of domestic abuse)
  • NON_D_VIOLENCE_ABOUT
    (other kinds of harassment)
  • NON_D_VIOLENCE_DIRECTED
    (harassment directed at individual or community)
  • NO_VIOLENCE
Analytics derived from NER

And data for modeling is ready!

After all the annotations and hard work with collaborators, we were ready with an incredible and novel training dataset.

Tinkering with Natural Language Processing…

Once pre-processing of texts by lowering the case, removing the URLs, punctuation, and stopwords, followed by lemmatization was done — we were ready to experiment with modeling techniques.

To convert words to vectors (machine learns through numbers), TF-IDF Vectorizer was used which gave decent results, but the vocabulary was limited. The inference data would have a greater variety of words. Therefore, a decision of using pre-trained word embeddings was made.

Our model used FastText English word vectors(wiki-news-300d-1M.vec) and IndicNLP word vectors (indicnlp.v1.hi.vec) for Hindi and Hinglish languages present in the documents.

Since tweets related to DV stories were less in number, data augmentation was used on these — by creating new sentences using synonyms of the original words.

nlpgaugis a library for textual augmentation in machine learning experiments. The goal is improving model performance by generating augmented textual data. It’s also able to generate adversarial examples to prevent adversarial attacks.

Bringing into play — MACHINE LEARNING MODEL(s)

A number of models including BERT, Bag of Words with SVM, XGBoost, and RandomForest were evaluated. Since there were just minute differences between similar classes, we needed to combine two sets of labels.

After combining similar labels:

Limitations faced — Data under these classes was not easily separable because 3 classes plainly talked about Domestic Violence (story, opinion, news/information) which made it tough for the classifier to spot marginal variation in semantics.

Also, data under DV_STORY had the least number of samples given the fact that it was the most relevant class.

Hence, to deal with an imbalanced dataset, Under Sampling using NeighbourhoodCleaningRule was used from the imbalanced-learn library. The resampled data was fed to Stacked Models.

Stacking is a way of combining predictions from multiple different types of ML models, that introduces the concept of a meta learner.

Source: GeeksforGeeks

Level 0 learners:
- Random Forest Classifier
- Support Vector Classifier
- MLP Classifier
- LGBM Classifier

Level 1 meta-learner:
SVC with hyperparameter tuning and custom class weights.

Class Encoding — 0: DV_INCIDENT, 1: DV_OPINION, 2: DV_OPINION_INFO_NEWS, 3: NON_D_VIOLENCE_ABOUT, 4: NO_VIOLENCE

This sums up the classification task. This model was used to predict labels on 8000 tweets as an inference set. The misclassifications were skimmed through and corrected in the crucial classes in order to deliver the best data for analysis. The inferences were used for creating interactive dashboards showing trends of domestic violence along with insights extracted from Named Entity Recognition.

I feel elated to be a part of this incredible community of change-makers. Creating some strong connections through this journey made the experience even better.

I am thrilled to collaborate in the upcoming high-impact projects, making the world a better place using AI for good!

More About Omdena

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

--

--