Combating human trafficking using machine learning: Part 3.

Juanchobanano
7 min readJun 11, 2022

--

Photo from Canva. License terms can be found here.

Hey! Welcome to the Part 3 of this series, if you haven’t read the second part you can read it here. As I mentioned in the previous post, in this article we dig into the problem of feature engineering and transformation of the Canadian escorts dataset using the individual advertisement data as well as its underlying graph structure. Since it might take us a while to extract all the desired features, I will divide the feature engineering process in two articles. Therefore, in this post we will focus on individual advertisements, and in the following one on exploiting the graph structure of the data.

Let’s begin!

Feature Engineering

My methodology for the feature engineering of the advertisements consist in two parts: first, I compute individual features for each one of the advertisements, and then I use them for characterizing the emerging communities of the dataset, which are the instances that we want to use for learning to recognize risky communities.

However, there is an important fact that we have to keep in mind. Since our focus is on communities rather than individual advertisements, is very likely the final number of instances (communities) we will get is rather small compared to the total numbers of ads, so most probably we will have to use a regularization technique for avoiding overfitting in our models (note, however, that we can get communities consisting in only one advertisement).

Exploiting individual advertisements

Figure 1. Example advertisement text with highlighted risk factors such as movement control (“new to the town”), vulnerable ethnicity (“Asian”) and usage of third person pronouns (“busty girl”).

As I mentioned in the previous article, most of the research on human trafficking (Alvari. H & et la, 2016), (Alvari. H & et la, 2017), (Nagpal & et la, 2017), (Gimmoni, L & et la, 2021), (Synder & et la, 2017), in listing websites only focus on learning and predicting on individual advertisements, ignoring the underlying graph structure of the data. However, the graph structure of the data can be use to better understand how these organizations operate, i.e. identify which phone numbers, external websites, emails and locations are being use to promote their potential victims.

In addition, there is another major advantage of this approach: filling missing values. For example, given a community of ads from which for the half of the posts we know the age of the people advertised, we can use the average of these ages to fill the missing ages of the other half of the advertisements, and we can use similar techniques to filling other features. Concretely, when can use this approach to get a boarder sense of how risky is an individual advertisement given the features of the rest of posts that belong to the same community.

Now, let’s extract some features!

Processing initial features

The original dataset contains the following features: title, text, category, phone number, region, city, place, post date, latitude, longitude, email, external website, ethnicity and age, however, I decided to dropped the place, category and post date columns because the place column is empty, the category column only contains one value (“escorts”) and even though the post date might be useful, I won’t focus on temporal data for now.

In addition, I encoded the phone number, region, city, email, external website and ethnicity features using the LabelEncoder object of sklearn. With the exception of the ethnicity feature, I will use the previous features only for identifying the emerging communities. As for the ethnicity feature, I will use it for determine the how many ethnicities are present in the same community.

Now, I define four functions that will help us to extract features from the texts. The purpose of this functions is to find bigrams, words and phrases of interest.

Third person pronouns

This is one of the most risky factors according to prosecutors because it give a sense that the people advertised is being controlled by a third party. Hence, our interest is to identify third person pronouns such as “she”, “her”, “hers” and “herself”. Additionally, we also seek to find bigrams such as “new girl”, “sexy chicks”, among others. Notice that the we consider a bigram window because sometimes we might find patterns like “sexy asian chicks”, so the bigram window allows use to keep identifying “sexy chicks” even though there is a word in the middle.

First plural person pronouns

Another risk factor to prosecutors is the presence of first plural person pronouns in the advertisements. This is a factor of interest because it implies the presence of several people being promoted in the advertisement which might be related with existence of some organized group (which most of the times is related with some prostitution organization). Therefore we are interest in pronouns such as “us”, “our”, “ours”, “ourselves”, and some bigrams such as “with us”, “message us”, “visit us”, etc.

Service is restricted (somehow)

According to the Peru prosecutors I interviewed, human trafficking organizations force their victims to have sexual relationships without any kind of restrictions or limits. Since the main focus of these criminals is to satisfy their clients requirements, they force their victims to have any kind of sex without any type of protection (use of condom). Thus, we seek to identify the advertisements that specify some type of limit in the promoted service, because this might give us some clue about person’s chance to decide what kind of relationship she wants to have.

Service place

Human trafficking networks most if not all of the times want to keep hidden their victims. In fact, most of them forbids their victims to interact with any kind of computer or cellphone because this would potentially could with them the chance to contact law enforcement or relatives. Therefore, for prosecutors, advertisements that only offer in-call services are more likely to be related with human trafficking, because they force the client to go to an specific place, usually controlled by these organizations. Therefore, we seek to characterize advertisements in the following way: -1 if no place is specified, 0 if it offers out-call services and 1 if it only offer in-call services. Note that we will treat this feature as a number rather than a categorical feature.

Keywords

For this feature, we seek to identify several keywords or expressions commonly used by traffickers related with victim’s movement control and victim’s age. For the former, one strategies of this organizations is to constantly move their victims across several cities or regions so it is harder for law enforcement to keep track of them, that’s why we look for expressions such as “new in town”, “new girl”, “short-term”, etc. For the latter, usually criminals use some keywords for referring to victim’s age such as “new face”, “turned 18”, “i am a newb”, among others.

Note: I want to thank Stop the Traffik (2022), who provided my a list of keywords commonly find in advertisements proven to be related with human trafficking.

Using the functions

Finally, applying all the previous functions, we get the following distributions of our new features. As I mentioned before, one we have identified the emerging communities of our dataset, using features such as phone number, email and external website, we will use these brand new features for characterizing the communities, and finally, create a communities dataset.

PD: Visit the jupyter notebook I used here.

Image by the author.
Image by the author.
Image by the author.
Image by the author.
Image by the author.

What’s next?

In the next post we will focus on exploiting the graph structure of the data. Given the fact we already computed all the desired features on the individual advertisements in this article, we will use this information to build a new dataset where the instances are communities and not individual advertisements.

References

  • Chirag Nagpal, Kyle Miller, Benedikt Boecking and Artur Dubrawski (2017). An Entity Resolution Approach to Isolate Instances of Human Trafficking Online.
  • Hamidreza Alvari, Paulo Shakarian and J.E Kelly Snyder (2017). Semi-supervised learning for detecting human trafficking.
  • Hamidreza Alvari, Paulo Shakarian and J.E Kelly Snyder (2016). A Non-Parametric Learning Approach to Identify Online Human Trafficking.
  • Luca Giommoni and Ruh Ikwu (2021). Identifying human trafficking indicators in the UK online sex market.
  • Stop the Traffik (2022). Human trafficking keywords dataset (2022).

--

--

Juanchobanano

Hi! My name is Juan Esteban Cepeda. I’m a computer scientist interested in human trafficking, computational consciousness and reinforcement learning.