How Computational Linguistics Is Predicting Online Troll Attacks
Reddit, unsurprisingly, is the target platform for this subject. Browsing through an r/ thread incites a lot of emotions.
There’s thoughtful and sweet comments:
“Uhh…” threads:
And self-aware threads:
Yeah, trolls. Trolls are inevitable and they’re everywhere. Despite anything you do online, they’re fleas you can’t get rid of. To cut to the chase, because we can’t really stop trolls from trolling (especially not on Reddit), tech junkies/linguists are devising ways to predict if posts will be trolled.
So, I’ll be looking at semantic analysis, a field in natural language processing (like computational linguistics) that digitally analyzes the meaning of sentences and words with different algorithms. Now it’s being used to predict troll activity.
The Anatomy of the /r Thread
Forums are arranged as trees, where the original post (OP) is the root and the following replies line up as branches until the last one (the leaf). In programming, we’d refer to the these components as nodes (parent node = OP, children nodes = replies).
To illustrate the technique (which heavily relies on data), I’ll cite two studies trying to model and predict troll attacks.
The first is from UT Dallas researchers Mojica and Ng. To navigate the tree-esque nature of Reddit forums, they used Lucene to create an inverted index from the comments, querying for comments containing the word “troll” and close variations of the word, assuming these words would be reasonable candidates of real trolling attempts, basing the “troll” category on four criteria:
Subtlety refers to how subtle the trolling was; Interpretation refers to how the users interpreted the trolling.
Reportedly, 44.3% of the comments were real trolling attempts.
So, they rebuilt their conversation tree. For each retrieved comment, we reconstructed the original conversation tree it appears in, from the parent node to the [last of its] children. They had a dataset consisting of 1000 conversations composed of 6,833 sentences and 88,047 tokens, so now they needed to improve their program that classifies the information based on the criteria.
They trained four independent classifiers using logistic regression (with the Python machine learning software sci-kit learn), using a basic and extended feature set to model their results.
Linguistic Features in Analyzing trolls
The GloVe Embeddings and FrameNet are machine learning programs that are too advanced to be condensed into a short summary. What’s important to know is that their analysis entailed looking at different everyday linguistic cues plus some visual companions (emojis). In the end, Mojica and Ng were able to get suboptimal results, shown below (the gist of the table is to show which users were trolling based on intention, the subtlety of intention, interpretation, and reaction).
TL;DR: The researchers’ accuracy in predicting online troll attacks were still less than optimal.
Before I get into the limitations, let’s quickly look at the second study by Tsantarliotis et al. for a second line of support, who tried to defined a “troll vulnerability metric” of how likely a post is to be trolled. They used 20 submissions for each of 18 Subreddits based on their popularity, resulting in 555,332 comments. To do this, they had a dataset to first detect troll activity. They focused on the “anti-social part of trolls” (a.k.a offensive remarks).
So, they modified a publicly available classifier produced in a Kaggle competition for detecting offensive content, detecting 9,541 troll attacks in their dataset. Their experimental method was actually more straightforward than Mojica and Ng’s. It’s pretty lengthy, so you can read the full text if you’re interested.
Instead of training their system on linguistic features and linear regression (a data sampling algorithm), they used a troll vulnerability metric with 3 different conditions that the program had to satisfy. They constructed models that used features from the content and the history of the post for prediction. Their results “identified a large fraction of the troll-vulnerable posts”.
Limitations
Trolling without curse words / aggression
“Awww you’re so cute”, “chill bro” (in the context of trolling someone), or the like fit into this category. Subtle aggressions and insults are difficult to classify and require more advanced semantic analysis.
Saying curse words…but not to troll
Ruling someone a troll for saying something like “holy f***” in jest is more petty than troll-y (this is Reddit, not a family chat hosted by your grandmother). Classification systems also automatically rule out controversy-associated words (even when said words aren’t used as an insult) such as racism and black.
Sarcasm, irony, and other figures of speech
One of the early arguments against AI, coined by the famed mathematician Descartes, was its inability to recognize underlying meanings behind text. Even if they are trained, language still consists of infinite combinations (in cognitive science we call it the compositionality of language), and we haven’t developed a strong enough AI as it’s strictly rule-based and systematically isolated.
Mojica and Ng suggest that a solution for this problem may be the use of deeper semantics, or using deep learning to represent the comments and sentences in their logical form and infer the intended meaning. But to carry out deeper semantics, to effectively eliminate these limitations…is another story. Another article to come…
Tools for Semantic Analysis Programming
If you read through the article and thought, WTF, don’t worry. Even I don’t know all about semantics…it’s a deep learning issue the experts are still figuring out. The point of this was to show how it’s possible, albeit not 100% accurate, to predict posts possibly targeted by trolls.
Different softwares exist to cater to different facets of text analysis: parsing, classification, word representations, streaming through grammar rules, and more. I suggest you start with Open Semantic Search, a collaborative search engine based on Apache Lucene (the one Mojica and Ng used for finding troll-like comments!). It can evaluate, value or assess documents.
Based on a thesarus the semantic search engine will find synonyms, hyponyms and aliases, too. Using heuristics for grammar rules like stemming it will find other word forms, too.
For text classification: sci-kit (machine learning with Python, woohoo!)
Parsing (analyzing a sentence into its parts, describing their syntactic roles): SEMATOR Parser
Github is also a great resource for text mining and analysis. Learn more about the word representation software GloVe. Also look at: Tools for text analysis, mining, and analytics.
Or you could start from Square 1 and look at an Introduction to Natural Language Processing, which basically describes everything I just talked about.
While it may seem confusing now, I do plan to learn about it more myself. If you’re interested in joining me, follow me on Instagram (@nullastic) and we’ll chat about it.