Week 4-Hate Speech Detection on Social Media

Ege Çınar
bbm406f19
Published in
2 min readDec 22, 2019

Team Members: Ege ÇINAR, Gökhan ÖZELOĞLU, Yiğit Barkın ÜNAL

Last week we have discussed the Naive Bayes Classifier and explained why it’s convenient to use it. This week we first preprocessed our twitter data and then we have implemented our model and got our first results.

We had twitter mentions, hashtags, HTTP links and emojis in our twitter data. We discarded those parts of the tweets and used the resulting dataset in our Multinomial Naive Bayes model

Stop words are commonly used words in a language such as the, is, are, on. They are usually discarded because they appear so frequently and we can focus on the important words instead.

N-gram means we group n adjacent words together to discover if they are more meaningful with the adjacent words.

TF-IDF is an acronym meaning Term Frequency — Inverse Document Frequency. It is a technique used to understand how important a word is in a document.

The dataset we use has three different labels which are hate, offensive, and neither. We used three different feature extraction techniques. These are n-gram, stop words and TF-IDF. We also experimented with 2-gram, 3-gram, and 4-grams.

We had problems with memory while storing the text in an array. Hence we decreased the size of our data by ignoring words that have documentation frequencies are lower than 0.0005 for 2-gram, 3-gram, and 4-gram. We have split data into test and training 30% and 70% respectively. These are our results.

We had the best accuracy with unigrams. The reason for this could be the length of the tweets which are relatively shorter.

That’s it for today. See you again in the upcoming weeks.

--

--