Week 3-Hate Speech Detection on Social Media

Yiğit Barkın Ünal
bbm406f19
Published in
2 min readDec 15, 2019

Team Members: Yiğit Barkın ÜNAL, Gökhan ÖZELOĞLU, Ege ÇINAR

Introduction

This week, we have focused on how to preprocess our data before using it and which model should we use

There is a lot of unnecessary stuff in our data such as account names, symbols, links, emojis, etc. and we have to eliminate them before we use it in our model for making our model more accurate and more efficient. To do that, we have used the regex library in Python. After processing the data, we’ve looked at the possible models. In our updates last week, we have discussed the related works. In one of the works we have discussed (“Learning from Bullying Traces in Social Media”) they have used Naïve Bayes Classifier, SVM with linear kernel, SVM with RBF kernel and Logistic Regression with unigrams, unigrams+bigrams, and POS-colored unigrams+bigrams respectively. This week, we have focused on the Naïve Bayes Classifier with unigrams.

Before going into the details, let’s discuss the Naïve Bayes (NB) Classifier. NB is a generative classifier that focuses on the likelihood and prior probabilities instead of calculating the posterior probability directly. It uses the Bag of Words model and it means the model does not care about the location of the word in the data. It assumes the probabilities of the words are conditionally independent. We are using the unigram in our classification. It means that we are considering the words one by one.

The reason we have focussed on Naïve Bayes classifier is, it is one of the most fundamental models and it is easy to implement and it gives more accurate predictions on small data and we’ve decided to start one of the less accurate prediction models and keep looking at the more accurate ones.

This is it for this week, thank you for reading and stay tuned for other updates :)

Previous Weeks

Week 2- Hate Speech Detection on Social Media

Week 1- Hate Speech Detection on Social Media

--

--