Combating human trafficking using machine learning: Part 4.

7 min readJun 24, 2022

Clip from Canva. License terms can be found here.

Hey! Welcome to Part 4 of this series, if you haven’t read the third part you can read it here. As promised, given the fact we already computed all the desired features of the individual advertisements in the previous post, in this article we will focus on exploiting the graph structure of the data, which will help us to build a community dataset.

Let’s begin!

But.., what is a community?

Since the first article I have mentioned we will use the graph structure of the data for identifying communities of advertisements. However, I haven’t provided a formal definition for the concept “community”. This concept is very important for us because we can use these structures not only for getting a wider sense of how risky a community is (given the features of the advertisements that are part from it) but also to understand how these organizations operate, i.e. identify which phone numbers, external websites, emails and locations are being use to promote their potential victims. Without further ado, let’s define the concept:

Community definition.
Let G =(V, E) be a graph and A, B vertices of G. We say that vertices A and B are related if and only if there exists a path P= (e1, e2, …, eN) in G that connects them. Hence, a Community C is a set of advertisements such that for all A, B that belong to C, A and B are related. (note that the path P might consist in edges of different types).

Given the fact we already uploaded all the dataset into Neo4j Graph Database using the entities and relationships mentioned in here, Neo4j provides a data science toolkit called Graph Data Science (GDS for short) that allows to execute different graph algorithms. In our case, we will use the algorithm “Connect Components” which finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set (this is exactly what we want according to our definition of community!).

Figure 1. Image by the author. Histogram of the number of advertisements in the communities detected. Single communities are ignored in this graph.

Hence, using this algorithm we get a total of 2197 communities (remember that we started with a total of 3463 advertisements), from which 1815 correspond to individual advertisements (i.e. communities of a single ad). The biggest communities found by Neo4j consist in a total of 280, 101, 35, 24 and 21 elements respectively while the average of the rest is around 12 (Figure 1).

The next figures correspond to the distribution of some features of the second biggest community consisting in 101 advertisements. There are some remarkable facts to point out about this community. For example, the advertisements of this community are distributed along several regions which might give some clue about the existence of an organization operating along the whole country (Figure 2).

Figure 2. Image by the author. Region distribution of the second biggest community detected.

Besides that, this community contains some advertisements written in third person (Figure 3) and first plural person (Figure 4), which again, are two of the most risky factors according to prosecutors because it give a sense that the people advertised is being controlled by a third party and several victims are being promoted in the same advertisement leading to a possible case of facilitation of prostitution, respectively.

Figure 3. Image by the author. Third person distribution of the second biggest community detected.

Figure 4. Image by the author. First plural person distribution of the second biggest community detected.

Finally, it is also interesting the fact that this community is composed by people from different ethnicities, which might imply that the organization behind this advertisements has access to several vunerable populations.

Figure 5. Image by the author. Ethnicity distribution of the second biggest community detected.

Exploiting communities data

Figure 6. Image by the author. Communities detected in Canadian escorts advertisements.

In the previous article we calculated several features of the individual advertisements, specifically for each advertisement we computed a vector consisting on the following attributes: third person flag, first plural person flag, service is restricted (somehow), service place and human trafficking keywords flag. Now, we are interested in computing communities features using this variables. To achieve this objective, we follow the human trafficking patterns in online advertisements explained in UNODC (2020) report on human trafficking and in Giommoni, L. & Lkwu, R. (2021).

Text similarity

Figure 7. The difference between Euclidean distance and cosine similarity. An Opportunistic Routing for Data Forwarding Based on Vehicle Mobility Association in Vehicular Ad Hoc Networks — Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/The-difference-between-Euclidean-distance-and-cosine-similarity_fig2_320914786 [accessed 24 Jun, 2022]

Given a set of advertisements, we want to compute the cosine distance (Figure 7) between the the two less similar texts, however, this requires to encode the texts as vectors. The latter can be achieved using models such as Word2Vec (Vatsal has a great post about this topic here), which can take as input a document (sequence of words) and return a vector embedding in some (usually big) dimensional space (there are a lot of Python libraries that offer Word2Vec models, in our case, we will use Spacy: check Lars Nielsen post about this NLP library here). With this feature we want to understand if criminal organizations use similar texts to promote different victims or if they rather write a particular description for each victim.

Several regions or cities

Figure 8. Canada regions. Image extracted from https://www.touropia.com/gfx/b/2019/08/canada.png

Given a set of advertisements, we want to determine if a community offers sexual services in more than one region or city. This feature is very important for prosecutors because it gives a sense of the existence of an organization promoting different victims in several places of the country. In fact, the community explored the previous section contains a lot of regions and cities that are quite far apart each other, which makes more unlikely that the same person is simultaneously offering the same service in different parts of the country because it is simply physically impossible. Hence, it is consider as a risk factor by law enforcement.

Community size

Figure 9. Example communities size. Image extracted from https://neo4j.com/blog/graph-databases-drupal-neo4j-module-rules-integration/

Another feature we want to compute is the normalized community size. Rizwan Alam has a great explanation of this process in here. In our case, since we have communities consisting in just one advertisement, is clear that the minimum size for the communities size is 1. One major issue that arises with this measure is the presence of outliers. For example, the first two communities consist in 280, 101 advertisements respectively, while most of the others consist in only one ad.

Minimum age

Given a set of advertisements, we want to determine the minimum age of the people promoted in that community. This might give a clue that the organizations operating behind the community is promoting underage people.

Feature Aggregation

In the previous post we explained the importance of features such as third person, first plural person, service place, service is restricted and human trafficking words. Hence, using the communities information, we want to aggregate the individuals features of each advertisement into a single one that represents the overall risk of each community.

Hence, we will consider the following approach: if there is a least one advertisement written in third person, first plural person or includes human trafficking words in a community, we will set those community flags, respectively, as true. As for the service place, we will set the service place of a community as the most risky one (only in-call) found in at least one advertisement that belongs to that community and for the service is restricted feature, we will consider the most common one found in a community. Additionally, we will consider for the first time the ethnicity information of the advertisements. Specifically, we will consider the number of ethnicities that appear in one community as another predictor for our problem.

Now, I define the functions that will help us to extract these features from the advertisements of each community. The purpose of this functions is to get a global sense of how risky a community is based of the elements that belong to it.

Building a new dataset

Figure 10. Image by the author. Community dataset info.

Finally, applying all the previous functions, we get the community dataset (Figure 10) consisting in a total of 2197 communities. Each row of this dataset correspond to a community vector embedding consisting in the features previously explained.

What’s next?

In the next post we will apply weak learning for generating pseudo-labels for our dataset. Using this pseudo-labels we will determine which are the most important predictors for our problems and finally using this information we will train a machine learning model for classify our data.

References

Giommoni, L & Ikwu, R. (2021). Identifying human trafficking indicators in the UK online sex market. Link
Global Report on Trafficking in Persons (2020). Link