Grooming Detection Part. 1: How to detect grooming in chat data?

Published in

Besedo Engineering Blog

9 min readJan 10, 2023

Imagery of a masked gamer on his computer — Photo by Jake Schumacher on Unsplash

The growing evolution of technology permits us to connect more easily with people worldwide. However, this broad connection is not without risks. Nowadays, young users are more and more connected online at an early age. Thus, it is important to protect them from bad encounters that can lead to grooming.

Grooming is a communication process in which an offender tries to engage an underage user in a relationship via the web, most of the time to perpetrate molestation [1][5]. At Besedo, user protection is a priority, especially regarding child protection. If chat message moderation presents sensitive yet important questions, we want to be prepared to handle this kind of danger for our clients and their users.

This blog post series about CSAM Detection aims to raise awareness about that phenomenon and suggest some technical tips to deal with it. This first blog post is specifically about significant state-of-the-art information about grooming and will give an overview of the existing datasets for grooming detection. Let’s go!

Two approaches in grooming detection

In the state of the Art, we can find two kinds of papers about grooming detection: papers from the forensic community and papers from the data science community. The table below sums up the strengths and weaknesses of both, mentioned by Milon-Flores and Cordeiro in [2].

Forensic approach: (+) behavioral features (-) not automated + analysis of finite conversations, when it is too late. Preventive approach: (+) automation, fast, in real time (-) filtered data, information lost in preprocessing. — Pros and Cons of the Forensic Approach vs. the Preventive Approach to Detect Grooming

Even if our application is on the data science side, we found it very important also to gain information from forensics. Let’s go deeper with forensic studies first, as they can bring great insights into offenders’ and victims’ behaviours — which is what we want to detect in our data.

How do offenders approach minors online?

To summarize every piece of information we found in SOTA, we will bring you a worst-case scenario of what grooming implies. Several studies talk about 3 main stages in the grooming communication process, but we added a 4th regarding the high risks incurred by this last stage. We will use a male character here for convenience.

1) Breaking the ice: Grooming can start as a normal conversation, with a simple “Hello, what’s your name?”. In that “breaking the ice” stage, the offender tries to gather personal information about his targeted victim: a (full) name, age, and location. This information seems commonplace but sharing them is dangerous for the victim. If the offender is an expert, he may even isolate the victim in a less moderated and private chat to remain discreet: “Do you have Discord?”. He can also pretend he’s a child: in that case, unfortunately, the offender is nearly impossible to detect…

2) The deceptive relationship: After a few exchanges, a relationship takes place. The offender tries to get his victim’s trust. They may talk about parents, school, friends, and hobbies. He manipulates his victim so that nobody knows they are talking to an older person: isolation and secrecy are a big part of grooming. Exclusivity can occur with some “flirtatious” texts and imitating a strong, fusional relationship.

3) Sexual intentions: Once the trust is built, discussions can effortlessly come around sexuality. Groomers tend to discuss some awful things like puberty stages and sexual fantasies. The offender might text like a young person discovering relationships and sexuality and/or ask lots of inappropriate questions (”Have you ever kissed a boy?”, “what would you do if I do that?”). They can also use sexual actions reframed as acceptable as they are talking to younger people (i. e., they are saying some things without clearly saying them, using some metaphors and analogies). Some very raw and harsh sexual messages can be found too.

4) Real-life approach: the worst-case scenario would include a real-life meeting representing huge physical, emotional and psychological risks for the victim: sexual molestation and child abuse, which can also be recorded and shared online in paedophiles’ communities (leading to CSAM).

Of course, the conversations between offenders and underage are not always as tricky. Sometimes, groomers are sexually explicit in the first place, and it is thus easier to detect and stop them. But those forensic studies globally teach us that contact information, contact request on external chats, sexual words, and arranging of a real-life meeting may be key features to detect grooming in texts. In addition to that, know that in public chats, denunciations of other people may help to detect grooming too “You perv,” “He’s a pedo,” “she’s like 14” [4].

Existing grooming datasets

Most of the time, grooming appears in chat data, which is difficult to collect. In addition, child abuse is a very sensitive subject and a rare phenomenon in real-life data. For model training, an important study regarding grooming, say a representative dataset, would contain less than 4% of grooming conversations [1]. Now, where to find grooming data?

PAN12 is the most well-known grooming dataset. It was made for a grooming detection challenge in 2012. PAN12 contains binary chat messages labelled on the chat and message levels. It contains true negatives from IRC logs (a computer science forum), false positives from Omegle (that may contain consensual sexual conversations), and true positives from Perverted Justice’s Archives, which seems to be the only grooming data publicly available since then.

📌 Perverted Justice’s Archives contains 1 on 1 conversations between real groomers and people pretending they are underage victims. Almost every grooming study is based on this data. However, keep in mind this dataset contains some biases:
· The dataset is ten years old and English language evolves very fast
· It is not real victims answering to the groomers but volunteers acting like
· Regarding PAN12: Despite the chat messages from Omegle as false positives, there is a huge gap between computer science conversations and one on one personal conversation containing grooming.
· Plus, it is possible that Omegle contains grooming?

In early 2022, Milon-Flores and Cordeiro [2] constituted a new grooming dataset using the same method as PAN12: PJZC. The authors took newer IRC logs data (2013–2022) as true negatives, “chit chats” conversations as false positives and the last grooming chats from PJ’s Archives (2013–2014) as true positives. This dataset is also binary annotated but only on the chat level.

PAN12 grooming texts (label on chat level) / PAN12 total texts: 139 740/2 669 388. PJZC grooming texts (label on chat level) / PJZC total texts: 20 570/372 161. — Description of Open-access Grooming Datasets PAN12 and PJZC

Knowing this information, we think a nice and representative open-access dataset to start grooming detection would be the ones used in Milon-Flores and Cordeiro’s study, which mixes PJZC and PAN12. Its content is described in the table above. It is covering a wide period (2012–2022, but only 2012–2014 for grooming chats) without overlapping.

💡 If you have some resources and time for data annotation, know that recent chats from teens communities can be interesting to detect grooming. For instance, we can think about Discord or Reddit which is not so moderated, or even Tik Tok’s and Instagram’s comments. In-game or live platforms chats, social medias and dating websites in general might be interesting data, but their access is often restricted.

Processing chat messages

About data science papers now, we find topics such as (e)SPD coming for (early) Sexual Predator Detection. Those papers aim to detect sexual predators in one-on-one chat messages, including groomers. An important question to tackle in your grooming detection is: should you classify messages or conversations? A clear decision should be made as chat messages are often very short, and more context is often needed to understand the overall meaning.

Suppose you decide to classify messages in the first place. In that case, the challenge with chat data is to add context to that only message, as we often can’t make a solid decision (say if this is grooming or not) based on only one text: the previous texts could also be taken into account. The field of Early Text Classification (ETC) proposes some solutions to represent this context. It focuses on dynamic representations of features and/or texts to process chat data. For instance, a quick and suitable representation of text would be to concatenate texts: this is the solution Milon-Flores and Cordeiro [2] used. If you do so, you will have the following data representation:

“index”, “text”, “label”\n”0",”Hello baby, how are you?”,”grooming”\n”1", ”Hello baby, how are you? Hello”, ”grooming”\n”2", ”Hello baby, how are you? Hello Are you home alone?”, “grooming” — Chat messages with concatenated texts

But this solution can imply biases because the previous text is processed inside the actual text. Some context may be added using metadata as well :

“index”, ”text”, ”chat history*”, “user logs**”, “label”\n”0", ”Hello baby, how are you?”, “”, “”, “grooming”\n”1", “Hello”, “Hello baby, how are you?”, “”, “grooming”\n”2", “Are you home alone?”, “Hello baby, how are you? Hello”, “Hello baby, how are you?”, “grooming” — Chat messages with additional metadata

*chat history: history of the overall conversation

**user logs: history of the user messages

That way, all information can be considered in the training process, but not all the texts from the chat history and the user logs might be interesting for the final decision. Another solution would be to add or replace those additional columns with specific features representing the text or the history/logs content.

Grooming Features

Some Data Science studies use grooming features inspired by forensic studies to detect grooming. Those features can be detected with a rule-based matching solution. The results of the rule-based matching solution can be used for text augmentation. A random example is provided in the table below.

“index”, “text”, “sexual_count”, “approach_count”, “parents_count”, “spans”, “label”\n”74", “I pick you up at 1:30”, “0”, “2”, “0”, “0”, “[“pick you up”, “1:30”]”, “grooming” — Chat messages with grooming features

A rule-based matching solution can be used alone to filter potential grooming messages and then reviewed by human experts: no model is necessarily needed. Those features can also help in data annotation. Here is a non-exhaustive list of features that can help in grooming detection, with some examples. They come from several studies mentioned in the table below.

Know that some things may be prioritized in your task definition regarding grooming features. You can pick in the table the most interesting features regarding your data and what you want to detect. As an example:

explicit grooming denunciation may be sufficient to detect grooming in public chats or live platforms (e. g.: “Ok… pedo.”)
On dating websites, people searching for underage partners (e. g.: “I’m 35, (…) looking for a younger girl”) and underage users interested in sexual exchanges — or not (e. g.: “16yo gayteen 👅 Chat Hot 👀”) could be prioritized. You can also pay attention to the sugar daddy/sugar mommy phenomenon if sugar babies are underage.

[Bonus] Existing Open-access Models to Detect Grooming

In addition to those datasets and grooming features, two models are already trained to detect grooming. They both classify conversations regarding if they contain grooming (label=1) or not (label=0).

a CNN model trained on PAN12: https://github.com/kenjenlee/online-grooming-detector
some BERT models (Bert-base, Bert-large, and MobileBert) trained on PANC [3]: https://gitlab.com/early-sexual-predator-detection/eSPD-lab/-/tree/master/

Conclusion

As we just saw, detecting grooming in data is a very challenging task. Not only it appears in chat data which contains very short texts with very little context, but there are few open-access resources around that sensitive subject online. PAN12 and PJZC are two datasets you can use in your grooming detection task: they contain Perverted Justice’s Archives of one-on-one conversations between real groomers and volunteers pretending they are victims.

To process chat messages, many configurations are possible. Important decisions would be to (i) choose to develop a rule-based solution, a classification model, both, or even another automation (ii) decide if you want to process entire chats or only messages, and (iii) predict the presence/absence of grooming or some more detailed features. For instance, contact information/requests, sexual content, age-related conversations, and real-life approaches may be key points of grooming. These are at least the steps in the communication process where underage users face huge risks.

We hope this blog post helped you understand grooming and how to start your detection project! In the next blog posts, we will share two examples of grooming detection and their results: Part. 2 presents a rule-base matching solution we tried and Part. 3 will jump into a BERT model experiment. Stay tuned!

References

[1] Escalante and al. (2013). Sexual predator detection in chats with chained classifiers [paper]

[2] Milon-Flores and Cordeiro. (2022). How to take advantage of behavioural features for the early detection of grooming in online conversations [paper]

[3] Vogt, Leser and Akbik. (2021). Early Detection of Sexual Predators in Chats [paper]

[4] Lykoulas et Patsakis. (2020). Large-scale analysis of grooming in modern social networks. [paper]

[5] Gunawan and al. (2016). Detecting online child grooming conversation. [paper]