Automatic Classification of Sexual Harassment Cases

Analyzing one of the biggest Sexual Harassment data sets to build a category classifier with an average accuracy of 93%.

Published in

Omdena

11 min readOct 2, 2019

Sexual harassment is defined by SAPAC (Sexual Assault Prevention and Awareness Center, University of Michigan) as any kind of unwelcome sexual advances, requests for sexual favors, and other verbal or physical conduct of a sexual nature [1].

It is a severe and pervasive worldwide problem, and it’s not a genre particular phenomenon. The efforts made to fire harassers from their jobs, or to offer sexual harassment training inside companies and even to improson or lynch these victimizers could provide an air of justice but does little to change society’s culture, and the relentless feeling of insecurity and shame persist.

Before I use any technical jargon about building the category classifier, I want to invite any victim no to be ashamed of their story, because it will inspire others. As the Swiss psychiatrist and psychoanalyst Carl Jung once said,

“I am not what happened to me, I am what I choose to become” — Carl Jung.

You cannot change what has happened but you can refuse being reduced by it. Also, if you are a witness of any kind of sexual abuse whether verbal or physical, denounce it. Silence is one of the main issues.

In the following, I will cover our data set and how we produced the best prediction models.

Working with one of the biggest datasets for sexual harassment

In our case, the data was provided by Safecity India, which is a platform launched on 2012, that crowdsources personal stories of sexual harassment and abuse in public spaces [2]. They have collected over 10,000 stories from over 50 cities in India, Kenya, Cameroon, and Nepal.

More specifically they provided us a .cvs file with 12,122 sexual harassment cases within a chart with 9 columns that as a pandas DataFrame looks like this:

Figure 1. Original dataset provided by Safecity

Additionally to the focal tasks of this project and as part of the NLP channel we decided to automate the category classification based on the sexual harassment case descriptions. Performing this classification task manually is time-consuming and leaving it entirely on the hands of the victim could produce ambiguity in the discrimination of the categories.

In order to make future expansions of the dataset more concise, we needed to create a model that is able to learn the patterns in the descriptions of the cases, and lastly, predict the categories of each of the cases.

Data preparation

First, we adequated the dataset to feed a non-symbolic AI model. The only features that our model needs are the description and the category label, being the input and output respectively. However, these two features are strings and one specific case could fall into more than one category, so we vectorized them (numeric values) for our model to understand and learn their distributions.

I extracted every unique category and extended the category column into its boolean representation, converting the problem into a multilabel classification task. The difference between multilabel and multiclass classification is that a Multiclass classification makes the assumption that each sample is assigned to only one label: a fruit can be either an apple or a pear but not both at the same time.

Multilabel classification assigns to each sample a set of target labels.

The following code shows how to extract the unique categories,

categories=[]for index, row in report_1.iterrows():
  cat_group_list = [i for i in row.CATEGORY.split(',')]
  del(cat_group_list[-1]) #last item is a comma
  for category in cat_group_list:
    category = [category.lstrip()]
    categories.append(category) if category not in categories else nextcategory_list = [cat[0] for cat in categories]

Following this approach, we found 14 unique categories:

Touching /Groping, Catcalls/Whistles, Sexual Invites, Stalking, Others, Commenting, Rape / Sexual Assault, North East India Report, Indecent exposure, Chain Snatching, Ogling/Facial Expressions/Staring, Taking pictures, Poor / No Street Lighting and Online Harassment.

Next, I iterated over the dataset to transform each case category feature into 14 new non-exclusive boolean features where a 1 means that a specific category was present in the sexual harassment case.

import pandas as pdcategories_bool = [[]]
category_bool = [0]*len(categories)for index, row in report_1.iterrows():
row_category = row['CATEGORY']
  for category in row_category:
  index_match = list(filter(lambda x: categories[x][0] == category.lstrip(), range(len(categories))))
  if len(index_match) == 1:
    category_bool[index_match[0]] = 1 if len(index_match) != 0 else None
  categories_bool.append(category_bool)
  category_bool = [0]*len(categories)
  del(categories_bool[0])#Extend the column CATEGORY by its boolean representation
df_categories = pd.DataFrame.from_records(categories_bool)
df_categories.columns = category_list
df_ready = pd.concat([report_1, df_categories], axis=1)

Figure 2 shows a sample of the extended version of the category column. Only the first 8 categories are displayed to fit in this article’s margins. The rows are not the same index as Figure 1 (head).

Figure 2. Extended category column sample

Familiarizing with the data

Now that we have the output column in a machine-friendly format, it’s time to vectorize the description column (input) and to perform some EDA (Exploratory Data Analysis).

For this purpose, I built one .txt file for each sexual harassment category where I collected all the description cases for that particular category. Initially, I added to the pipeline Tokenization, Lemmatization and stemming but the overall results of the final models were not better than directly applying Count or Tfif vectorizers.

You can find the complete analysis in my notebook and the function to build the collections [3].

Some of the case descriptions contain words that are not in English. In order to normalize the data we defined a regex pattern and filter the data with it. Notwithstanding, I kept non-English words for the EDA to avoid missing information. On the other hand, the dataset to be fed to the ML model does need to be normalized. Additionally, it’s better to lowercase all the corpus and to remove stopwords like noun articles, some adverbs, and connectors because they don’t contain any conceptual meaning to be considered. One really direct way to achieve this is by the means of The Natural Language Toolkit (NLTK).

import nltk
from nltk.corpus import stopwords as swcontent = category.read().lower()
pattern_wd_eng = (r'[A-Za-z]+')
tokens = re.findall(pattern_wd_eng, content)
no_stops = [t for t in tqdm(tokens) if t not in sw.words('english')]

Now, there are two main ways to vectorize the collections using scikit-learn feature_extraction tools for text: CountVectorizer and TfidfVectorizer.

CountVectorizer just counts the word frequencies. As Simple as that. But with the TfidfVectorizer the value increases proportionally to count but is offset by the frequency of the word in the corpus. — This is the IDF (inverse document frequency part). This helps to adjust to the fact that some words appear more frequently.

Without IDF, less meaningful words like ‘girl’ or ‘boy’ (most common words between each of the category-collections) would possess a higher weight than others that are more exclusive and relevant for each category.

Inverse document frequency is defined as:

Where N is the total number of documents in the corpus, and the term in the denominator is the number of documents where the term t appears.

Today, people nearly always use the TFIDFVectorizer.

I also used TfdifTransformer (sklearn) and TfidfModel (gensim) modules in different steps of the analysis and pipeline definition. The main difference between the two modules are as follows:

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores.

With Tfidfvectorizer on the contrary, you will do all three steps at once. Under the hood, it computes the word counts, IDF values, and Tf-idf scores all using the same dataset [4].

TfidfModel return a TransformedCorpus – TfIdf corpus, if bow (bag of words) is a corpus. So now let’s create the model to extract the most significant words of each bow (collection).

from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModeldef tf_idf_model(category_tokens):  # Build a dicitionary with all the unique tokens merging all categories
  dictionary = Dictionary(category_tokens) #after tokenization
  corpus = [dictionary.doc2bow(collection) for collection in category_tokens]  # Create a new TfidfModel using the corpus: tfidf
  tfidf = TfidfModel(corpus)  for index, tokens_and_Cnt in enumerate(corpus):
    print('\n', fileNames[index])
    # Calculate the tfidf weights of doc: tfidf_weights
    tfidf_weights = tfidf[tokens_and_Cnt]    # Sort the weights from highest to lowest: sorted_tfidf_weights
    sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
    
    # Print the top 20 weighted words
    for term_id, weight in sorted_tfidf_weights[0:20]:
      print(dictionary.get(term_id), weight)  return corpus, dictionary, tfidfcorpus, dictionary, tfidf = tf_idf_model(category_tokens) #where category_tokens is the corpora of all tokens for all the collections

In order to visualize the most frequent terms for each collection I decided to include a word cloud functionality using wordcloud library.

from wordcloud import WordCloud, STOPWORDSdef WordCloud_gen(corpus=corpus, corpus_index=0, dictionary=dictionary, tfidf=tfidf):  stopwords_eng = sw.words("english")
  tfidf_weights = tfidf[corpus[corpus_index]]
  weights = [(dictionary[pair[0]], pair[1]) for pair in tfidf_weights]
  alice_mask = np.array(Image.open('silhouette.png'))
  wc = WordCloud(background_color="white", max_words=400, mask=alice_mask, stopwords=stopwords_eng)
  wc.generate_from_frequencies(dict(weights))  plt.figure(figsize=(8,6), dpi=120)
  plt.imshow(wc, interpolation='bilinear')
  plt.axis("off")
  plt.show()  return wc

For example the WordCloud for ‘Stalking’ category is:

WordCloud_gen('tfidf', corpus_index=3)

Figure 3. WordCloud for Stalking type of SH.

Let’s build the model

There are several supervised ML approaches for NLP multilabel classification tasks. There is no universal model to apply. I had to test different models and tune them to acquire an accuracy above 90% on average.

A really good introductory article about all the Natural Language Processing for text classification using Naive Bayes, SVM, and Logistic Regression models in [5] and [6]. These models are going to be build using a pipeline a OneVsRest strategy.

Scikit-learn provides a pipeline utility to help automate machine learning workflows. There is usually a lot of data to manipulate and many transformations to apply. So I leveraged pipelines to train every classifier that OneVsRest strategy implies.

The OnevsRest strategy is a strategy where you train N binary classifiers with one class at a time and leaving rest out. In other words, it consists on a multi-label algorithm that accepts a binary mask over multiple labels. The result for each prediction will be an array of 0s and 1s marking which class labels apply to each row input sample. In document classification One-Vs-Rest treats the classes as mutually exclusive and trains 14 different classifiers in our case, corresponding to each sexual harassment category and converting this problem into a binary classification problem.

Being Naive

Naive Bayes model supports OneVsRest. Let’s start by dropping the missing values on the description column and splitting our data in train and test, using train_test_split functionality from scikit learn

report_1 = report_1[pd.notna(report_1['DESCRIPTION'])]
x_report_1 = report_1[report_1.columns[4]]
y_report_1 = report_1[report_1.columns[14:28]]
X_train, X_test, y_train, y_test = train_test_split(x_report_1, y_report_1, test_size=0.2, random_state=17)
categories = y_train.columns

You can read more about how naive bayes work here.

The classifier that I trained is a multinomial Naive Bayes classifier. And the pipeline definition is the following:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import PipelineNB_pipeline_CountV = Pipeline([('count', CountVectorizer(stop_words='english')),('clf', OneVsRestClassifier(MultinomialNB())),])NB_pipeline_TfidfV = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),('clf', OneVsRestClassifier(MultinomialNB())),])

And a helper function to execute them:

from sklearn import metricsdef execute(pipeline, X_train=X_train, X_test=X_test, confusion_matrix=False, verbose=True):  accuracies=[]
  
  for category in categories:
    pipeline.fit(X_train, y_train[category])
    prediction = pipeline.predict(X_test)
    
    if len(X_test) == 1: #for test strings
      print('Prediction for {} is {}'.format(category, prediction)) if verbose else None
    else:
      print('Test accuracy for {} is {}'.format(category, metrics.accuracy_score(y_test[category], prediction))) if verbose else None
      accuracies.append(metrics.accuracy_score(y_test[category], prediction))
      print(metrics.confusion_matrix(y_test[category], prediction, labels=[0,1])) if confusion_matrix else None
      print('precision_recall_fscore_support_weighted', precision_recall_fscore_support(y_test[category], prediction, average='weighted')) if verbose else None  print('mean: ', sum(accuracies)/len(accuracies)) if verbose and len(accuracies)!=0 else None
  return accuracies

The MutinomialNB posses a tuning parameter called alpha, which controls Laplacian smoothing. Any time you use counts to estimate parameters, which can lead to zero values you should use smoothing techniques. The goal is to increase the zero probability values to a small positive number (and correspondingly reduce other values so that the sum is still 1). More about how Laplacian smoothing works here.

Tuning alpha just make sense for the tfidf model, due to its mathematical formulation. Nevertheless, the accuracy improvement of the classifiers it is not very significant, as shown in figure 4.

Figure 4. Classifiers’ accuracy changing with alpha.

Playing with kernels

I will not get into too much detail about how Support Vector Machines work and how to apply hyperparameter tuning to it because I found a great article explaining it step by step on [7].

Support Vectors Classifier tries to find the best hyperplane to separate the different classes by maximizing the distance between sample points and the hyperplane.

The hyperparameter that we can tune for SVMs includes the kernel type, gamma, c and the degree of the polynomial used to find the hyperplane to split the data when the kernel is set to ‘poly’.

You can read more about how SVMs work here.

Check my notebook in [3] if you want to see the whole hyperparameter tuning process. The pipelines for the base and best models (after tuning) are the following:

#SVM base and after hyperparameter tuningSVM_pipeline = Pipeline([('vectorizer', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(LinearSVC()))])acc_SVM = execute(SVM_pipeline, verbose=True)SVM_pipeline_hyper = Pipeline([('vectorizer', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(svm.SVC(kernel='linear', gamma='auto', C=1.0)))])acc_SVM_hyper = execute(SVM_pipeline_hyper, verbose=True)

Accuracy results can be further seen in the models’ comparison chart in the last section of this article.

Logistic Regression

Logistic regression classifier is more like a linear classifier which uses the calculated logits (score ) to predict the target class.

logit_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),])acc_logit = execute(logit_pipeline, verbose=True)

You can find more about how logistic regression works here and here. Accuracy results can be further seen in the models’ comparison chart of the next section of this article.

Model Comparison

So after hyperparameter tuning, SVM is the best model so far, with an averaged accuracy of ~93%.

There are many more algorithms for multilabel classification like decision trees or random forest. But the results so far are good enough to trust in a model predicting the correct class or classes of sexual harassment cases.

Testing the best model

Now that the model is trained and we have obtained a decent accuracy on average, let’s try out a sexual harassment description which is not part of our dataset, and compare its output label to the category that a human would give it. The following test string was extracted from social media:

test_string = [""""He started rubbing my shoulders, telling me I looked stressed. Then he went down in my shirt. He walked around the living room area and he came back. That is when he touched my breast. Then he grabbed my waist of my pants and also grabbed my hair as I tried to leave. I made it to the door and left. It happened so quickly. I was just trying to get out of the townhouse"""]acc = execute(SVM_pipeline_hyper, X_test = test_string)

I leave you to judge to which category should this case belong. This is the output of the execution:

My experience at Omdena

It has been a really enriching journey intellectually and inter-personally. Omdena has continuously elevated me to my best in both of them. Thank you to every single person involved with the community. These are the type of actions that make me feel proud and hopeful for humanity.

If you want to be part of the #AIforGood movement, join Omdena’s global community of changemakers.

If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.

We are also on LinkedIn, Instagram, Facebook, and Twitter.

___________________________________________________________________