Exploratory Data Analysis of the World’s Biggest Sexual Harassment Database

What kinds of incidents are the most common? When and where do they occur? Preparing the data to be ready for machine learning.

Phenyo Phemelo Moletsane
Omdena
8 min readOct 12, 2019

--

I am a student based in South Africa. I am currently doing my Master’s degree in Big Data Science at the University of Pretoria. I am interested in using data science to solve social issues.

My first challenge with Omdena

After my application and interview to join the Omdena community, I was invited to be part of the SafeCity Challenge whose goal was to create automated safety tips and heat sensitivity algorithms to prevent sexual harassment. During the interview, I was asked which challenge I wanted join. My response was that, “I want to take part in the SafeCity challenge. As a woman, the issue being addressed is very close to my heart and I want to contribute to the solution.”

As a beginner in data science, being part of Omdena was an important career milestone. I have always been passionate about social change, and joining Omdena was a great way to make a positive impact while improving my technical skills.

The most exciting part about my experience was working with people from diverse backgrounds and different parts of the world. The diversity in the community created a collaborative culture that offered very unique ideas, perspectives and a wide range of solutions, where we were all learned from each other.

I took up a role as a task leader in the cleaning and exploration of the Safecity sexual harassment reports data. Most of the cases reported were taken from India. As a result, the data analysis in the following is focused on India and does claim to offer an accurate picture of the entire world.

My role involved coordinating the task, giving feedback on a weekly basis, making sure that communication flows smoothly among the task participators, and summarizing any results obtained. My experience in the challenge has been both challenging and fulfilling with great lessons.

This following blog post covers data exploration and visualization to extract useful insights after performing pre-processing. If you want to see how the first part of the pre-processing was done, you can check out this post:

Data exploration is a vital step in any data science project and should be performed prior to the machine learning phase. It is useful for extracting insights and understanding data quickly.

The data is provided by SafeCity, a platform that aims to make cities safer by using crowdsourced data of personal stories of sexual harassment and abuse in public spaces. First, we imported the required package and read the cleaned data which is a csv file. To get an overview of the data, let’s view the first few rows of the data.

import pandas as pd
df=pd.read_csv(‘safecity.csv’)
df.head()
First few rows of the data set

The metadata is defined below:

  • incident_no: Unique ID for each row
  • year: Year when the user reported the incident.
  • month: Month when the user reported the incident.
  • dayofweek: Day of the week when the user reported the incident.
  • hour: Hour when the incident was reported.
  • description: Description of the event reported.
  • Categories of the sexual harassment: touching /groping, catcalls/whistles, sexual invites, stalking, others, commenting, rape / sexual assault, chain snatching, ogling/staring, indecent exposure, taking pictures, poor / no street lighting and northeast India report.
  • country: The country in which the incident occurred.
  • latitude: Latitude coordinate where the incident occurred the event.
  • longitude: Longitude coordinate where the incident occurred the event.

Distribution of sexual harassment reports by category

Let’s have a look at the distribution of the harassment report by their category. Which category is most common?

import matplotlib.pyplot as plt
import seaborn as sns
sums = df.sum(axis=0)[9:24]
plt.figure(figsize=(15,10))
incident_categories=sns.barplot(x=sums.index.to_list(), y=sums.values, alpha=0.8)
incident_categories.set_xlabel("Sexual Harassment Category", fontsize=20)
incident_categories.set_ylabel("No. of Incidents", fontsize=20)
incident_categories.axes.set_title("No. of Incidents reported per Category",fontsize=20)
plt.xticks(rotation=90)
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.show()

Commenting is the most reported incident while online harassment, north east india report, human trafficking, and petty robbery are rarely reported. Catcalling and staring categories also seem to be common. If we have a classification problem, where we predict the category of sexual harassment, it it clear that we are going to have a problem of imbalanced classes as the categories are not equally represented. Imbalanced classes present a common problem in machine learning classification as most algorithms are often biased towards the majority class. The most common solution to this problem is resampling our data set before fitting a classifier on the data set. By exploring our data, we can identify potential problems and solve them before applying machine learning models to our data set.

When do sexual crime incidents occur?

To answer this question, we can look at time patterns across several different time scales: months, day and hours.

From the above observations, sexual harassment incidents seem to be low between 0–7 in the morning. We see a gradual rise from 9 am to 1 pm, peaking around 1 pm. Incidents appear to occur mostly in the late morning through to the night. Friday has the highest incident counts with about 2300 incidents, followed by Saturday. The lowest incident count is seen on Monday. The month of September has seen the highest number of sexual harassment cases, with the lowest in July and March.

It is important to note that the above observation is based on the counts and therefore we cannot conclude that it is always be like this.

Where do sexual harassment incidents occur?

With the help of Python’s Folium library, we are going to create an interactive map that visualizes geospatial data. For this task, we are going to use the longitude and latitude variables. The code below creates a map containing markers based on the resulting data set. We define the default location that will serve as the central location. Another important parameter is zoom_start which allows us to magnify the map. I set it to 4 to get a good display. The map tile used in here is OpenStreetMap (which is the default tile). We add the circle markers to the map that we have created and pop ups that detail the longitude, latitude, and harassment category. We can click on these markers to get more information about the incident in the area of interest.

import folium 
#setting starting coordinates:
start_coord = [0,0]
#creating an empty map
maps = folium.Map(location=start_coord, tiles='OpenStreetMap', zoom_start=4, color='red')
df_heatmap=df.iloc[0:5000,:] #for faster processing
for i in range(0,len(df)):
folium.CircleMarker(
location=[df_heatmap.iloc[i]['latitude'], df_heatmap.iloc[i]
['longitude']],
popup=" Latitude: {}<br> Longitude: {} Harassment category: {}
<br>".format(df_heatmap.iloc[i]['latitude'], df_heatmap.iloc[i]
['longitude'],df_heatmap.iloc[i]['category']),
fill=True
).add_to(maps)

We can save the map as html webpage.

maps.save('safecity_map.html')

The map looks like this:

From the map visualization above, it can be seen that there is a high amount of sexual harassment cases in the Southern Asia, mostly in India. Continents like Australia and South America have very few sexual harassment reports.

Which words appear the most in the descriptions of the incidents?

Let’s visualize the description of the reports with a word cloud. Since the description data comprises text, we are going to perform some text pre-processing to remove the noise. We will remove the stopwords, numbers and special characters, punctuation, and lemmatize the descriptions.

  1. Remove punctuation and special characters:
df['description'] = df['description'].str.replace("[^a-zA-Z#]", " ")

2. Remove stopwords

Stopwords include the most common words such as ‘the’, ‘to’, ‘and’. These words create noise and are not important when performing modeling.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def remove_stopwords(description):
desc_new = " ".join([i for i in description if i not in
stop_words])
return desc_new
#remove stopwords from the descriptions
descriptions = [remove_stopwords(b.split()) for b in df['description']]
#lower cases
descriptions = [r.lower() for r in descriptions]

3. Tokenization and Lemmatization

We use the Spacy library to lemmatize our words. Lemmatization will further remove noise from the text by reducing multiple forms of a word to a single word. Tokenization breaks up the description strings into a list of words.

import spacy
lem = spacy.load('en', disable=['parser', 'ner'])
def lemmatization(texts, tags=['NOUN', 'ADJ']):
results= []
for sent in texts:
doc = lem(" ".join(sent))
results.append([token.lemma_ for token in doc if token.pos_ in tags])
return results
#tokenize
tokenized_descriptions = pd.Series(descriptions).apply(lambda x: x.split())
descriptions2 = pd.Series(lemmatization(tokenized_descriptions))
descriptions3 = []
for k in range(len(descriptions2)):
descriptions3.append(' '.join(descriptions2[k]))

4. Word cloud

Now that our descriptions are clean, let us visualize the keywords with a word cloud. Word cloud highlights the frequency of a word in a text. We are going to set the maximum number of words in the cloud to 100.

from wordcloud import WordCloud
max_words=100
maximum_font=50
def w_cloud(data,bgcolor):
plt.figure(figsize = (15,10))
w_cloud = WordCloud(background_color = bgcolor, max_words =
max_words, max_font_size = maximum_font)
wc.generate(' '.join(data))
plt.imshow(wc)
plt.axis('off')

w_cloud(descriptions3,'white')

The frequency of words is visualized by size and color. The bigger the size of the word, the higher the frequency. We can see that the words: man, guy, boy, lady, and girl have come up quite frequently in the report descriptions. These words are relevant to sexual harassment as harassment is a gender-related crime.

We also notice harassment-related words such as touching, uncomfortable, private part, and location of incidents such as market, metro station, and station which may form an important part of the incident description.

As we can see, Word Cloud provides a quick way to analyze text quickly and depicts keywords in the text. However, word cloud does not provide any context or meaning. Further analysis is required to extract insights from the text. We could go on to use this data for topic modeling and sentiment analysis, or use it as a predictive variable for the sexual harassment category classification.

Conclusions

Throughout this post, I used the data from Safecity to extract some insights on the sexual harassment reports. After completing the data exploration process, one can start applying machine learning techniques to the data. From a data perspective, the data exploration step helps us understand the data better, summarize its characteristics, rapidly identify patterns, detect any problems or outliers in the data, and decide on how to proceed with the machine learning task at hand.

If you want to be part of the #AIforGood movement, join our global community as an (aspiring) data scientist or AI enthusiast.

If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.

We are also on LinkedIn, Instagram, Facebook, and Twitter.

--

--

Phenyo Phemelo Moletsane
Omdena
Writer for

Passionate about social change and Research. Interested in Data Science and AI.