Tracking Hackers with NLP and Machine Learning

An automated approach for analyzing activity on underground cybercrime forums has been devised by a team of data science researchers from multiple universities

As technology continuously evolves on a massive scale, so does cybercrime. Cybercriminals, especially blackhat hackers and identity thieves, depend on underground online forums to communicate, often for the purpose of initiating transactions. They buy and sell a variety of illicit products and services such as stolen credit cards, online credentials, compromised hosts, hacking tools, and other wares.

Cybercrime researchers and law enforcement need to broadly understand the scale and scope of the activity on these underground markets, but it takes a long time for human analysts to peruse entire forums. To expedite this process, a multi-university team of researchers including Damon McCoy, Assistant Professor of Computer Science and Engineering at NYU, has developed new natural language processing tools that can be trained on forum-specific data to categorize posts and determine what products are being bought and sold for what prices. Annotations that would take a human analyst many hours to complete take between five and fifteen minutes for the new NLP tools, depending on the forum.

But developing NLP tools for underground forums presents a unique set of challenges. Forum users do not adhere to conventional grammar, and sometimes their communication styles are incomprehensible. Furthermore, grammar and styles of communication can vastly vary among forums. McCoy and his fellow researchers overcame this challenge by tuning their NLP tools to complete precise sets of tasks rather than to comprehend the meaning of entire forums.

Their automated tools can identify post category, product, and price with a minimum accuracy rate of 80%. The accuracy rate can be even higher, sometimes near 100%, but machine learning-based methods degrade when they are applied to different forums from the ones on which they’ve been trained.

These new NLP tools can be used by future researchers to gain a holistic understanding of the activity on underground forums, and they could be used by law enforcement to respond rapidly to large-scale cybercrime events.

With an automated approach, investigators can quickly gauge strong upticks in products for sale, such as credit card numbers, that would indicate data breaches, allowing them to respond faster.

To continue their research, McCoy, along with Sam Bowman, Assistant Professor of Linguistics and Data Science at NYU, and other members of McCoy’s team have received a seed grant from the Moore Sloan Data Science Environment, a cross-institutional initiative at NYU CDS. They plan to explore how private messages (where the actual transactions occur) affect price and ultimately determine the revenue of underground markets.

By Paul Oliver