Detecting Predatory Conversations: Leveraging the NPS Internet Chatroom Corpus for Online Safety

Mohamad Mahmood
Lexiconia
Published in
3 min readJun 18, 2024
Photo by Alexander Shatov on Unsplash

Detecting predatory conversations refers to the use of natural language processing and machine learning techniques to automatically identify potentially harmful or abusive interactions in online chat, messaging, and other digital communication platforms. The key objectives are to analyze linguistic patterns, recognize sequences of predatory behavior, and classify conversations as benign, suspicious, or high-risk, in order to enable timely interventions and protect vulnerable users, such as minors. By leveraging annotated datasets like the NPS Internet Chatroom Conversations corpus, which provides rich linguistic and contextual information, researchers can develop advanced models capable of accurately detecting the indicators and progressions of online predatory behavior.

.

(Further reading: Online grooming detection: A comprehensive survey of child exploitation in chat logs)

.

[1] The NPS Internet Chatroom Conversations

The NPS Internet Chatroom Conversations corpus consists of 10,567 English chat posts (45,068 tokens) gathered from age-specific online chat rooms in October and November 2006. Each file contains a text recording from one of these chat rooms for a specific day. The posts are annotated with part-of-speech tags from the Penn Treebank and dialog act tags. All usernames have been anonymized to protect privacy. This corpus was developed by researchers at the Naval Postgraduate School and is one of the first publicly available text chat corpora annotated with linguistic information.

.

The NPS Internet Chatroom Conversations corpus was compiled by researchers at the Department of Computer Science at the Naval Postgraduate School. It consists of 10,567 English posts (45,068 tokens) gathered from various age-specific online chat rooms across different chat services during October and November 2006. Each file in the corpus represents a text recording from one of these chat rooms over a short period on a specific day. This initial release represents a subset of the approximately 500,000 chat posts that were collected, and future versions of the corpus are planned to include more of this wider data set as the research project continues to progress.

.

The NPS Internet Chatroom Conversations corpus includes extensive metadata and linguistic annotations for each post. The filenames provide information about the date the posts were collected, the target age group of the chat room, and the number of posts in the file. For example, “10–19–20s_706posts.xml” indicates the file contains 706 posts from the 20s age group chat room collected on October 19th. All usernames have been anonymized and replaced with generic “UserN” identifiers. The posts were initially part-of-speech tagged using a tagger trained on the Penn Treebank corpus. The remaining posts were then further annotated with dialog act tags, such as “Statement”, “Question”, or “System”, to capture the communicative function of each message. This rich metadata and linguistic annotation make the corpus a valuable resource for researchers interested in natural language processing of computer-mediated communication.

.

The NPS Internet Chatroom Conversations corpus has been widely used by researchers to develop and test a variety of natural language processing applications for computer-mediated communication domains. Researchers have leveraged the corpus to explore tasks such as conversation thread topic detection, author profiling, named entity identification, and social network analysis. The annotated linguistic information has also proven valuable for creating and evaluating micro-text classification methods, particularly in the context of military chat applications. Additionally, the corpus has been used as a training dataset to build programs capable of automatically detecting the age group of chat participants, which can be useful for identifying potentially suspicious online behavior. Overall, the availability of this richly annotated corpus of real-world chat data has enabled significant advancements in natural language understanding and analysis of computer-mediated communication.

.

(Further reading: “The Nps Chat Corpus”, https://web.archive.org/web/20201229175705/https://bond-lab.github.io/Corpus-Linguistics/pdf/corpora/2015-NPS-chat.pdf)

.

(Further reading: “Lexical and Discourse Analysis of Online Chat Dialog”, https://ieeexplore.ieee.org/document/4338328, https://web.archive.org/web/20240503022904/https://core.ac.uk/download/pdf/36731948.pdf)

.

[2] NLTK NPS Chat Corpus

.

The NPS Chat Corpus, Release 1.0 consists of over 10,000 posts in age-specific chat rooms, which have been anonymized, POS-tagged and dialogue-act tagged.(Further reading: https://www.nltk.org/howto/corpus.html#:~:text=nps_chat)

.

import nltk
nltk.download('nps_chat')
print(len(nltk.corpus.nps_chat.xml_posts()))
# 10567
print(nltk.corpus.nps_chat.words())
# ['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]
print(nltk.corpus.nps_chat.tagged_words())
# [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
print(nltk.corpus.nps_chat.tagged_posts())
[[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ('with', 'IN'), ('this', 'DT'), ('gay', 'JJ'), ('name', 'NN')], [(':P', 'UH')], ...]

.

--

--

Mohamad Mahmood
Lexiconia

Programming (Mobile, Web, Database and Machine Learning). Studies at the Center For Artificial Intelligence Technology (CAIT), FTSM, UKM, Malaysia.