Differentiating hateful messages from friendly ones

An attempt at detecting hostilities automatically on social networks

Published in

Empathic Labs

7 min readJul 29, 2020

Whether on the Internet or in real life, swear words are not always used for hateful purposes. Thus, good friends will often tend to insult each other without any intention of hurting each other. Swear words can also appear in citations of other messages. In this case, they do not target anyone specifically. An other example of neutral content concerns insults that a user addresses to himself.

Being able to automatically detect real hostilities on social networks, or on any messaging application, could be useful to prevent conflicts, hate speech or even cyberbullying. This was the goal of my Advanced Project, realised during the second semester of my Master at the University of Applied Sciences of Western Switzerland (HES-SO). To do so, a classifier had to be created and a web application set up to collect and to analyze messages from social networks.

Social network choice

The choice of the social network that was going to be used to collect and to analyze the messages came naturally since Twitter is the only one to make its data easily available. Indeed, an easy-to-use API allows anyone to retrieve tweets, or other information, as long as they are public.

Dataset creation

Building a classifier implies owning data to train it and to test it. Unfortunately, I did not find any similar project and I had to realise the most interesting task by myself, i.e. labeling the data.

Using a small Python script, around 4'000 French tweets, which all contained swear words such as “con” (jerk), “débile” (retard) or “pute” (bitch), have been stored in a database. In total, 99 swear words have been used. Finally, all tweets have been labeled manually as “hateful” or as “neutral”.

Features engineering

The dataset being ready to be exploited, the feature creation phase can start. Different features have been generated and a few of them are presented below.

Presence of swear words

The first type of features imagined simply indicates the presence or absence of each swear word. This results in a sparse binary matrix with 99 columns, one for each of the swear words.

Number of swear words

Here, a single feature indicates the number of swear words that the tweet contains.

Pronouns used

The pronouns used in a sentence can give a good indication on the target of an insult. For example, first-person pronouns (je, j’, me, ma, mon, etc.) often imply self-mockery while second-person ones (tu, t’, te, ta, ton, etc.) are often used to address other people.

To detect whether these pronouns were present or not in the different texts, a simple search for matching strings has been realised.

Presence of smileys and Internet slang

Some smileys such as 😂 or 😜 have a positive connotation while 😠 or 🖕 have a negative one. The same can be said for the Internet slang with words like “lol” or “mdr” (Mort De Rire in French). Punctuation must also be taken into account with, for example, a repetition of exclamation marks .

To find out if those symbols could help the model to differentiate hateful messages from neutral ones, the two following features have been added.

Smileys and Internet slangs features example

Preprocessing, word embedding and TF-IDF

In order to compute word embedding features and TF-IDF ones, texts have first been preprocessed. Lowercase conversion, stop words removal and lemmatization are all techniques which have been applied to do so. The following example illustrates the modification.

Word embedding is a technique which consists in the conversion of each word into a vector of length n, representing the spatial position of the word. The goal of this transformation is to obtain close positions for similar words. For example, the words “good” and “great”, which are often interchangeable, should be more or less close.

If all tweets contained the same number w of words, then n*w features could be generated. Since it is not the case, vector averaging can be applied.

Here, each tweet has been transformed in a vector of size 10, as shown in the following example.

The TF-IDF value of a word represents how important a word is to a document in a collection. It depends on the number of occurrences of the word in the document, in our case in the tweet, and the number of occurrences of the word in the whole dataset.

The computation of the TF-IDF values resulted in the creation of 10'079 features.

Evaluation

The features created, several tests have been realised to evaluate the performance of different models and to select the best features.

What emerged from the results obtained during this phase is the overemphasis on swear words in the various texts. Indeed, as one might assume, some swear words tend to be more used in a hateful context than in a neutral or positive one.

Rate of appearance in hateful texts by swearword

As it can be seen, the word “baleine” (whale) appears most of the time in neutral texts. This is due to the fact that the meaning depends a lot on the context since the word is not always used as an insult. On the other hand, the words “abruti” and “avorton” are used to insult a third party most of the time.

This phenomenon misled the machine learning models. Using only the presence or the absence of swear words as features, the following scores have been reached with a random forest algorithm and a neural network :

Results using only the presence / absence of swear words

Although they are way better than those of a random classification, these scores are still far from being perfect.

Using all the features created, scores have been slightly improved:

Finally, a selection based on the importance of the features made it possible to reach an accuracy of 72.76%, as it can be seen below.

Accuracy evolution as a function of the number of features (sorted by feature importance)

Among the 400 selected features, most of them were TF-IDF ones and thus continued to indicate indirectly to the model which swear words were present or not in the texts. This explains why the two following texts are wrongly classified.

Text wrongly classified as hateful because of the “fils de pute” insult

Text wrongly classified as neutral because of the “tapette” word

Conclusion

The final model built is far from being perfect. There are many reasons that may explain the relatively poor performance. First of all, classifying whether a text is hateful or not is anything but trivial. Indeed, it depends on so many parameters that even a human could be mistaken. Thus, a same sentence could be considered as friendly in a specific context and hateful in another one. Furthermore, the fact that most people do not bother to write properly on Twitter does not help either when it comes to analyze textual content.

It is possible that the model currently in use relies too much on the statistics of the various insults. For example, if one insult appears predominantly in positive samples rather than negative ones, the model will have difficulty in correctly predicting the samples containing this insult but whose class should be negative. Of course, the opposite is also true.

Although the scores obtained so far may well be open to improvement, it seems difficult to imagine that this classification task will ever come close to perfection. Many of the messages posted online are related to events which took place in the context of the users’ privacy. They may simply be the continuation of a discussion started at work or an altercation at school and this information is not accessible. But after all, maybe the real solution is to teach people to communicate without swearing or insulting each other.

Thanks for reading!

Web application

A small demo of the application can be viewed on Youtube.

If you’re interested about text analysis, I’d suggest you to read Luke’s article about datetime extraction in chatbot messages, and if empathy in HCI in general is your thing, all the Empathic Labs articles are listed here. Enjoy!