Analytics Vidhya
Published in

Analytics Vidhya

Sexual Harassment(Safe City) — Case Study

Me Too

Contents :

  1. Overview
  2. About the Data.
  3. Selection of right performance metric.
  4. Exploratory Data Analysis.
  5. Data preprocessing.
  6. Vectorization and Feature Extraction
  7. Experimentation with different models.
  8. Results
  9. End-to-End pipeline.
  10. Sample Output.
  11. Deployment
  12. APP Link
  13. Future Work.
  14. References.

1 — Overview: Safe City is a dataset having Multi Class Label along with sexual complaint where a complaint has been labelled in three classes Commenting, Ogling and Groping. This is NLP based problem, this classification sample has been adapted from a paper by Safe City: Understanding Diverse Forms of Sexual Harassment Personal Stories. I am using the same dataset that is in the paper. The forum Safe City a crowd sourcing platform for personal story for sexual harassment and abuse each story includes one or more tagged forms of sexual harassment along with a description.

2 — About the Data: The data has been scraped from

Dataset consists 3 parts Train, Dev and Test
Each parts contains four columns —

  • Description (string object): Personal story of a victim that was shared on social media with Me-too hashtag.
  • Commenting (integer): Label for the description stating whether it belongs to Commenting category or not.
  • Ogling/Facial Expressions/Staring (integer): Label for the description stating whether it belongs to any of the Ogling, Facial Expressions, Staring categories or not.
  • Touching /Groping (integer): Label for the description stating whether it belongs to any of Touching, Groping category or not.

3 — Performance Metrics: There are many performance metrics we can use for multi-label classification:

  • Accuracy: The accuracy is the proportion of correct predictions (True Positive and True Negative) among the total number of data points. It is simple but sometime it can be curse when evaluating the model especially when there is imbalanced data.
Accuracy Formula
  • Precision: It is the ratio between the True Positives(TP) and all the Positives, which means of all positive data point how many were actually positive which is predicted by our model . Mathematically:
Precision Formula
  • Recall: It is the ratio between the True Positives(TP) and all the data points(TP and FN), which means of all positive data point how many were actually positive which is predicted by our model . Mathematically:
Recall Formula

As we can clearly see that Precision and Recall both revolve around getting score about only positive classes/predictions, and leave a loose end for the negative classes/predictions. Of course, in some problems you might only care about only positive class not negative class but in this problem we need to have full information about both scenarios .

  • F1 Score: It is the weighted average of Precision and Recall. Therefore, this score takes both False Positives(FP) and False Negatives(FN) into account. F1 is usually more useful than accuracy, especially if you have an uneven class distribution.

Unfortunately, it is not as easy to understand as accuracy. For instance, if I get a F1 Score of 0.86 we can’t easily interpret what it actually meant as compared to Precision, recall or any any other metrics as it takes the weighted average of precision and recall.

F1 Formula
  • Hamming Loss: It is the fraction of the wrong labels to the total number of labels. It is very useful when using multi label classification as it also give some scores to partially correct prediction.
  • Compliment Hamming Loss: Also known as Hamming Score, it is very simple metric just inverse/compliment of hamming loss i.e. 1-Hamming Loss. In simple words, hamming loss tells use the loss while hamming score tells use how many correct prediction does our model made just like accuracy metrics. So, we will use this metrics as we are familiar with accuracy and we can easily interpret it.

4 — Exploratory Data Analysis: The backbone of a well trained and performing model is EDA. This part will help use to get insights of the data like data imbalance, etc.

  • Word Cloud: Its a plot of image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance. So, the more often a specific word appears in our text, the bigger and bolder it appears in our word cloud.
Code for Word Cloud
Word Cloud
  • Number of Label Associations: For more detail analysis of data we will plot the number of classes associated with the events.
Code for Label Association
Label Associations

In above plot Zero, One, Two and Three describes how many harassment act took place. For example, if the label count is two there were two harassment act took place among touching, staring or commenting. Of course, zero means its a neutral act i.e neither of the touching, staring or commenting took place.

  • Label Count: In this plot we plot every class/label count on every data part i.e Train, dev and test, to know if data is imbalanced or not.
Code for Label Count
Label Count
  • Getting list of Frequent and rare words: Now we will get the list of frequent and rare words of each data parts (i.e train, dev and test) and each class/act. This will help use to plan our preprocessing.
Code for frequent and rare words
Frequent and Rare words

5 — Data Preprocessing: This is the most important and crucial part in every machine learning or deep learning modeling. If we didn’t do well preprocessing of data then all the feature engineering and modeling will result in vain of hardwork.

  • Removing Duplicate data points: Duplicates are an extreme case of nonrandom sampling, and they bias our fitted model. Including them will essentially lead to the model overfitting this subset of data points.
Code for duplicate data point removal
  • De-contraction: Contractions are words that we write with an apostrophe. Since we want to standardize our text, it makes sense to expand these contractions. For examples, “aren’t” will become “are not”.
Code for De-contraction
  • Handle special characters, numbers and lower case: This step is essential because other terms in text data like special character and numbers can add noise to the data, which can adversely affect the performance of the machine learning model. So, we remove all special characters and “#” in place of numbers as many embedding technique does that. Lower casing words makes capitalized words to lower case so to avoid unnecessary complexity.
Code for handling numbers, special characters and lower case

6 — Vectorization and Feature Extraction: ML and DL algorithms does work with raw data i.e text data. So, to overcome this we convert text data to vector representation of the text data this is called feature extraction or feature encoding.

  • Bag of Words: Bag of words is a representation of text that describe occurrence of words within document. It is called a bag of words any information about order or structure in is discarded. If corpus if large it will create vector of huge representation where most values are sparse. It requires more memory and computation resources.
  • Term Frequency: It is a scoring of the frequency of the word in the current document.
Term Frequency Formula
  • Inverse Document Frequency: It is a scoring of how rare the word is across documents.
  • TFIDF: Rescale the frequency of words by how often they appear on document. Formula: where, TF is Term Frequency and IDF is Inverse Document Frequency. Note: TFIDF gives larger value for less frequent words TFIDF values is high when both IDF and TF are high i.e., word is rare in whole document and frequent in a document.
TFIDF Formula
  • Word2Vec: It is a technique for NLP in which the algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec represents each distinct word with a particular list of numbers called a vector which is calculated using simple mathematical function Cosine Similarity.
Code for Word2Vec
  • Fast Text: Fast Text is another word embedding method that is an extension of the word2vec model. Instead of learning vectors for words directly, Fast Text represents each word as an n-gram of characters. This helps capture the meaning of shorter words and allows the embeddings to understand suffixes and prefixes. Once the word has been represented using character n-grams, a skip-gram model is trained to learn the embeddings.
Code for Fast Text
  • BERT : We will be using state of art BERT for our experiment vectorization technique for this problem. It is a neural network-based technique for natural language processing pre-training. In plain English, it can be used to get the context of words.
Code for BERT
  • Feature Extraction: In another experiment we will be using simple autoencoder to extract features. To understand properly let’s break it into multiple parts —
  • Vectorizing — As discussed above we know the importance of vectorizing, so in this part we will vectorize , pad and truncate the text data.
Code for Vectorizing

Modeling — In this part will define autoencoder architecture and train it.

Auto-Encoder Model

7 — Experimentation with different models: Now we experiment with ML algorithms with different vectorization technique as discussed above. To avoid writing the same code again and again I have wrote a function named ‘‘auto_models ’’ which takes input as given below —

  • Train Data — In this we give preprocess train data(sometimes raw data to experiment with).
  • Dev Data — In this we give preprocess dev data(sometimes raw data to experiment with).
  • Test Data — In this we give preprocess test data(sometimes raw data to experiment with).
  • Vectorizer Mode — In this we give the string of vectorizer technique/method we want to use. Available Vectorizer Modes: {None, BOW, TFIDF, W2V, BERT, Fast, DL}
  • Model Mode — In this we give the string of model we want to use. Available Model Modes: {None, KNN, LGR, RF}
auto_models function

8— Results: I experimented trained some models and recorded their hamming score.


9— End-to-End pipeline: Now we take best performing model with the respective vectorization technique.

Pipe Line

10 — Sample Output:

Sample Output

11 — Deployment: For deployment I have use AWS and Streamlit. Some sample outputs:

Sample Output
Sample Output

13 — Future Work:

  • Character level word-embeddings can be used utilized in place of word-level embeddings with appropriate architectures.
  • Use LSTM or Bi-directional LSTM.

14 — References:

For complete code GitHub:

For my detailed Career history/background follow:



Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ismail Siddiqui

Machine Learning Engineer at AppyHigh. I have phenomenal problem solving and Machine Learning skill. Seeking to do an impossible task that no one can’t do.