Linguistic Data Analysis for OOT

Müge Yerdenler
Opt Out Tools
Published in
4 min readMar 24, 2020

A corpus linguistic analysis of online harassment

Warning: Beware the strong and offensive language in this piece.

As Opt Out Tools (OOT) we aim to create a modelling and train our machine with our modelling to detect online misogyny automatically. We know we need to conduct a linguistic analysis first so we can see what kind of linguistic patterns are observable in our data (see how we accessed our data: https://www.optoutools.com/tech).

In our data, we have 318,765 word tokens which are as large as we can go through manually. For this reason, as the linguist team in OOT, we decided to analyze our data by using one of the most popular tools for corpus linguistic analysis which is called “AntConc”. AntConc provides the user with the functions of word counting, deducting clusters/N-Grams, concordance list, keyword list, collocation lists, and more.

We started our analysis with N-Grams. N-Grams refer to the cluster of words that are aligned together, but regardless of their position in the sentence. They could be either at the beginning, in the middle, or at the end of the sentence. In AntConc you can decide how many words you expect your cluster to have and the software lists you them from the most frequent one to the least. Surprisingly, the cluster of “not sexist but” is the most frequent one in our data and ironically the rest of the sentence mostly includes sexist expressions such as “@WizardryOfOzil I’m not sexist, but a female standup comedian has never successfully made me laugh” or ”RT @bquinn18 I’m not sexist but all women suck at driving and they should not have the right to get behind the wheel”. Another cluster which consists of four words is “call me sexist but” and this one also is mostly followed by a sexist expression. “@jaykyew Call me sexist, but hearing a woman do play by play for a football game just sounds off to me” is one of the examples.

After N-Gram analysis, we used the concordance tool of AntConc. With the concordance tool, you enter a keyword or more and see how frequently your keyword is used in your data. Also, it is possible to sort the words next to your keyword/s as on the left or on the right. This is a really efficient way to see what words or, more generally speaking, linguistic structures are mostly collocated with your keyword/s. For this part of the analysis, we entered the offensive words as keywords such as “bitch/es”, “feminazi,” “whore”, “cunt”, “hysterical”, “skank”, “hoe”, and also “rape” as a potential threat word. However, of course, not all the tweets, which include these words, are necessarily misogynistic. Some of them can be jokes, quotations, or criticism of the word itself depending on the context. In order to minimize the risk of labelling non-misogynistic content as misogynistic, we first looked at which words our keywords are mostly aligned with both to the left and right of them and analyze which ones are most likely to be misogynistic. For example, when we put ”bitch” as a keyword and look at the next words right to it, it mostly goes as in “bitch about” which does not necessarily constitute a hate speech (see the Figure below) whereas it is offensive when it is followed by “fuck you” or “fuck u”. When we looked at the words left to it, it is mostly preceded by “a” as in “a bitch”. When it comes to the words further than “a”, “what a bitch” appears as another popular cluster and could include misogynistic content as in the example of “NOOOOOOO FUCCCKING WAY, WHAT A BITCH!!!!! YOU THOUGHT ABOUT HOOKIN UP WITH ANOTHER GUY WHILE YOU HAVE A BOYFRIEND. YOU CUNT!!!”.

Figure: A caption from our corpus linguistic analysis with AntConc Tool.

When we examined the words around “rape”, those preceding it such as “about” and “against” or those following it such as “victim” and “ culture” do not have a misogynistic content, but rather include comments on rape or rape incidents in general. However, those which include rape threats precede the word “rape” and they are mostly “I” or “I’ll” although they are not high in numbers, to be exact only four. “Rapping on twitter, get yo ass back in the kitchen before I rape that lil bitch in your profile pic” and “Ill rape women but Ill respect cow” are some examples for them.

So far we have described a cross-section from our data analysis and emphasized our first-hand results. The more we analyze the more interesting the results get. There is still more to scrutinize and finally collaborate with the data and engineering team for the purpose of implementing our linguistic results into our machine learning system.

--

--

Müge Yerdenler
Opt Out Tools
0 Followers
Writer for

Linguist, Language Teacher, Maintainer in OOT