MultiLabel Classification of sexual harassment personal stories. #MeToo

sunil Belde

Published in

Analytics Vidhya

9 min readMar 8, 2021

source : Google Images — Source : Google Images

Table of Contents : 📋

Overview of the problem.
How machine Learning helps us in solving this problem.
Source of our data and its overview.
Existing solutions for this type of tasks.
Selection of right performance metric.
First cut approach.
Exploratory Data Analysis.
Data preprocessing and Feature Engineering.
Experimentation with different models.
Creating an End-to-End pipeline with best model for getting predictions from raw Input.
Future work.
References.

1.Overview of the problem. 🔍

To begin with many women in the world out there are sexually harassed and assaulted by men. Women became helpless and cannot defend themselves because of lacking courage.

To shed the light on this issue Tarana Burke, an American activist from New York started MeToo movement in 2006, on the Myspace social network to promote “empowerment through empathy”. This helped other women with similar experience to raise their voice against these activities and defend themselves.

Later in the October of 2017, an American actress Alyssa Milano posted on Twitter stating her views on the sexual harassment with a MeToo hashtag (#MeToo). This was soon spread all over the world in English. This movement was widely spread that millions of people started sharing their stories with MeToo hashtags over social media gaining attention of large corporations and civilians.

On the same day of the tweet by Milano October 15,2017 there were more than 2,00,000 times tweets on MeToo hashtag was found and tweeted more than 5,00,000 times within next 24 hours.

Facebook reported that 45% of the users in US had a friend who had posted using the term. Many celebrities have also replied with #MeToo stories.

This statistical evidence shows that how large the impact of the movement was.

With these increasing number of personal stories, it became difficult to manually categorize these stories that are shared on the online forum SafeCity, which identifies the crime patterns and takes necessary actions to make a safer place.

2.How machine Learning helps us in solving this problem. 📝

The problem is to categorize these stories , that may fall in to one or more categories. This type of problem can be mapped to multi-label classification, where a story can have one more categories associated with it.

As the stories are in the form of text we can leverage the power of natural language processing to complete this task.

3.Source of data and its overview 📂

The problem was taken from a research paper.

Safecity

Safecity is a platform as a service product that powers communities, police and city government to prevent violence in…

safecity.in

The dataset which was used in the research paper is obtained from a online platform,Safecity which collects and analyses crowd-sourced, anonymous reports of violent crime, identifying patterns and key insights.

swkarlekar/safecity

SafeCity: Understanding Diverse Forms of Sexual Harassment Personal Stories, EMNLP 2018 Authors: Sweta Karlekar & Mohit…

github.com

Introduction to dataset —

we have data in three files : Train, Dev and Test

Number of rows in train.csv = 7201

Number of rows in dev.csv = 990

Number of rows in test.csv = 1701

All the Data files contain 4 columns:

Description (string object): Personal story of a victim that was shared on social media with MeToo hashtag.

Commenting (integer): Label for the description stating whether it belongs to Commenting category or not.

Ogling/Facial Expressions/Staring (integer): Label for the description stating whether it belongs to any of the Ogling,Facial Expressions,Staring categories or not.

Touching /Groping (integer): Label for the description stating whether it belongs to any of Touching,Groping category or not.

Example Data point —

Description : I was at the tap when a boy came to pour water. He found a 14 years old girl waiting to fetch water and just grabbed her hands and dragged her away.

Commenting : 0

Ogling/Facial Expressions/Staring : 0

Touching /Groping : 1

The Description falls into a category of Touching/Groping

4.Existing solutions for this type of tasks ✏️

In the research paper that was mentioned earlier. It used Deep Learning models of CNN, RNN and CNN-RNN type of architectures with word embeddings on multilabel classification.

Improvements —

Now we will add some additional features on top of word embeddings such as

length of the text by word level and character level.
Sentiment scores of given description .

Experimentation's —

We will try out different word embeddings such as :

Pretrained Glove embeddings
tfidf-W2V with Glove embeddings as word vectors
Pretrained FastText embeddings
tfidf-W2V with FastText embeddings as word vectors

5.Selection of right performance metric : 📏

F1-Score = 2 * (precision * recall) / (precision + recall)

In case of multilabel classification we have two types in calculating F1-Scores-

Macro Averaged F1-Score : In this case we simply calculate F1-Scores of each class individually and take the mean of F1_score to get overall F1-Score.We have 3 classes in our case so, we calculate precision and recall on individual classes and take mean of them to calculate Macro F1-score

Macro-Precision=(Precision 1+Precision 2+Precision 3)/3

Macro-Recall=(Recall 1+Recall 2+Recall 3)/3

Macro-F1= 2 * (Macro-Precision* Macro-Recall) / (Macro-Precision+ Macro- Recall)

Micro Averaged F1-Score : In this case we calculate the precision and recall on entire classes by summing up all the TP’s and Type Errors instead of doing them on individual class. Then we calculate the F1-Score as Harmonic mean of precision and recall.

Micro-Precision=(TP 1+TP 2 + TP 3)/(TP 1+FP 1+TP 2+FP 2 +TP 3+FP 3)

Micro-Recall= (TP 1+TP 2 + TP 3)/(TP 1+FN 1+TP 2+FN 2 + TP 3+FN 3)

Micro-F1=2 *(Micro-Precision* Micro-Recall)/(Micro-Precision+Micro- Recall)

Hamming loss : It is the fraction of labels that are classified incorrectly.

Ex: if 2 labels were classified incorrectly out of 3. Hamming loss = 2/3

Exact math ratio : It indicates the percentage of samples that have all their labels classified correctly. It only cares about correct classification and ignores partially correct classifications.

Here, classification of each description class is very important so we will use macro F1-Score as our performance metric. Because macro F1-score treats all the classes with equal importance's whereas Micro F1-score only concentrates on all the classes combinedly.

6.First cut approach.

We will be doing some analysis on Dataset we have i.e., the total number of data points we have and what each of the feature corresponding to and checking for any duplicates are present in the data and removing those.
Later we will be doing some basic Exploratory data analysis(EDA) like drawing the conclusions from the following plots.
We will do Data preprocessing and data cleaning that involves tasks like removing stopwords that can be obtained from nltk library, removing special symbols, de-concatenation of words such as (can’t -> cannot , it’s -> it is ) , and converting the entire text into lowercase.
As a baseline model we will be using multioutput classifier from sklearn. For this convert the text sentence to tfidf-w2v representation, w2v embedding can be took from pretrained glove model. Now we will fit the multioutput classifier with these train data and keep it as baseline model results.

7.Exploratory Data Analysis (EDA) 📊

No missing data is present is present because we have single feature of text, there would be no missing data in text.

We will be checking duplicates of the stories.

We found there are 542 duplicates, further we analyse those duplicate rows.

By checking the duplicates on a single description we can notice that many rows have been duplicated, this could have been happened while manual labeling of data where a single word is interpreted to fall into different categories in different circumstances.

To avoid this ambiguities, we replace all the duplicates of a point with a single point having labels 1,1,1

We plot the distribution of points among the different classes :

There is some class imbalance between the classes that we need to keep in mind while training the models.

Now, we plot the label counts vs the number of stories having that label counts:

By this plot we can see that most of the stories are associated with a single label or none of the label and very few of them have all the three labels.

8.Data preprocessing and Feature Engineering 🔧 🔩

As part of the Data preprocessing following text cleaning is performed :

Deconcatenation of words
Removing special characters
stopwords removal and
Stemming

As part of Feature engineering we will extract these features from the text data :

Creating description vector representation(300-dim)
Length of the text by word level and character level (2-dim)
Sentiment scores of text data using nltk (4-dim)

We will get a total of 306 features.

We will try out with following 4 types of description vector representation and treat them as 4 sets of data.

Pretrained Glove embeddings
tfidf-W2V with Glove embeddings as word vectors
Pretrained FastText embeddings
tfidf-W2V with FastText embeddings as word vectors

By using the below code we can create word embeddings from glove and FastText:

By using below functions you can extract all the features required:

9. Experimentation with different models 🔧

In total we extracted 306 features :

We will train each model on 4 different type formats of word embeddings

Training a stacking classifier on glove embeddings format:

we use class_weight=’balanced’ as we have some imbalance in the data for machine learning models

For Deep Learning models we write a custom loss function that take care of the class imbalance with the help of giving different weights for positive and negative labels among the three classes.

By observing all the results of trained models, we can conclude that Glove Embeddings are doing well compared to FastText embeddings. So,we can use Glove Embeddings while doing deployment.

Best Machine Learning Model : LightGBM and Stacking Classifier

Best Deep Learning Model: LSTM

10.Creating a End-to-End pipeline with best model for getting predictions from raw Input.

For Creating a pipeline we need to save the best model as pickel file and use it while predicting. We should not train any model while creating a pipeline.

We need to perform all the preprocessing and feature extraction steps to get the input data for the model.

Output:

You can see the recorded video of running flask application deployed on AWS-EC2 instance below:

11.Future Work 🕐

Pretrained Bert model (768-dim) can be used to get description vector representation. These 768-dimensional output is passed through a Fully connected neural network followed by a sigmoid activation layer. In our case we will be having 3 sigmoid activation units which gives probabilities of each class label.
Character level word-embeddings can be used utilized in place of word-level embeddings with appropriate architectures.
LIME Analysis can be done to get more interpretability of the model, which gets the word level contributions to the model.

12.References 📚

If you Need any code related files then Please go to my GitHub Repository By the link given bellow :

sunilbelde/safecity-multilabel-classification

Multilabel classifiication of sexula harassment personal stories - sunilbelde/safecity-multilabel-classification