My Unique Experiences in Omdena’s AI Challenge on Preventing Sexual Harassment in India

Working with a community around the world on using AI to prevent Sexual Harassment.

Published in

Omdena

4 min readSep 29, 2019

I am Ana Lopez from Colombia and my passion is AI, ML, DS and everything around these topics. I love Python and solving social problems with AI.

My mantra is to work together and solve problems that make a positive impact on other people.

Summary of my first challenge

When I heard about Omdena, I had a lot of interest because of their social challenges. For me, this is a big motivation.

I applied to the first challenge but I was not selected. However, I continued to follow Omdena and when this challenge was open I applied again and after the selection process, I received the good news.

I was over the moon but I had many feelings because it was more than a challenge for me, it was about measuring my technical knowledge and my communication skills.

My first role in the challenge was an ML Engineer, but it changed to Data Engineer and I really enjoyed this role because it taught me many new things.

I worked with a group of people from different countries, even with Colombian people who live in another country. It is amazing to have the opportunity to meet people around the world.

The first week was a bit crazy, many new posts to read, everyone was participating in the channels, introducing each other and proposing action steps.

We got the problem statement for the challenge and had the link to the dataset. I read about the case many times and explored the data.

The AI challenge partner was award-winning NGO Safecity India, which has one of the biggest data sets on Sexual Harassment cases. I installed their app to understand how people were reporting the incidents. The challenges focus was on Mumbai and Delhi.

I focused on data cleaning

I think that this is one of the most important but often overlooked tasks in a Machine Learning Project, because you need to understand the data in detail to make a decision on how to use it in the most impactful way.

I built a table with the metadata and this information was used to identify problems with the dataset:

# : Sequence ID unique for row.
INCIDENT TITLE: Title that the user gives to the incident reported.
INCIDENT DATE: Date when the user reported the incident.
LOCATION: Location where the incident occurs.
DESCRIPTION: Description of the event reported for the user.
CATEGORY: Category that the system allows that the User selected ( there are not limited in this selection): Online Harassment, Petty Robbery, Stalking, Ogling/Lewd Facial Expressions/Staring, Catcalls/Whistles, Taking pictures without permission, Commenting, Indecent Exposure/Masturbation in public, Sexual Invites, Touching /Groping, Rape / Sexual Assault, Human Trafficking, Others.
LATITUDE: Latitude coordinate where occurred the event.
LONGITUDE: Longitude coordinate where occurred the event.
More Info: Additional information that the user gives.

We had the opportunity to work with the top cloud platform for machine learning Spell as they partnered with Omdena. This made the process way easier as I only needed to create a developer account to work in Jupyter. I decided to work the code in Visual Studio Code and created the repository in GitHub. Then I installed the package of Spell in my machine to run the code in my console using the resources in the cloud. It was great.

If you check the image below you see only one line and my code runs without problems.

The first problem was missing values. In our challenge, the location of the sexual harassment incident was important, and we had 421 missing values in Latitude and Longitude columns. Dropping this information was not an option, so I searched for APIs to solve this.

I found HERE and the developer account had interesting options to help us solve the problems with coordinates or descriptions of sites. In our case, the column LOCATION was useful to complete the data because in this column we had the description of the location of the incident.

When the missing values were solved, we found that we had descriptions in Hindi language. We decided to use the package googletrans to solve this problem and identify descriptions in a language different from English and translate this description into English.

Next, the problems with duplicates, special characters, normalizing text, etc., were solved with the help of pandas, nltk, re, and other packages.

Through group work, we could clean the data faster and decided to save the dataset in the cloud and give access to the public.

Not the last time I participated in Omdena’s challenges

I am very happy about this opportunity and my experience was wonderful because I learned a lot and I could connect with a group of people with strong skills in different areas. This was an big enrichment to all,

We made a great team and I will surely continue participating in Omdena’s challenges. :)

If you want to be part of the #AIforGood movement, join our global community as an (aspiring) data scientist or AI enthusiast.

If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.

We are also on LinkedIn, Instagram, Facebook, and Twitter.