Hate Speech Detection with Machine Learning
Introduction
Hate speech poses a significant challenge on popular social media platforms such as Twitter and Facebook, with many posts containing hateful content often originating from individuals with strong political viewpoints. If you’re interested in learning how to develop a hate speech detection model using machine learning, this article is tailored for you. Throughout this piece, I’ll guide you through the process of building a hate speech detection system using Python.
Defining hate speech can be complex, as it often involves subjective interpretation and can vary across different contexts. However, the United Nations broadly defines hate speech as any form of verbal, written, or behavioral communication that targets or employs discriminatory language against individuals or groups based on their identity, including factors such as religion, ethnicity, nationality, race, color, ancestry, gender, or other identity markers.
Now that we have a clearer understanding of what constitutes hate speech, it’s evident that social media platforms play a crucial role in detecting and preventing its dissemination. In the following sections, I’ll delve into the process of hate speech detection using machine learning techniques in the Python programming language. Join me as we explore this important aspect of moderating online discourse and promoting a safer digital environment.
📚Sections
Introduction
import Libraries
Data Preprocessing
Data Splitting
Model training.
Model testing.
📚 Import Libraries
Hate Speech Detection using Python
The dataset I’m using for the hate speech detection task is downloaded from Kaggle. This dataset was originally collected from Twitter and contains the following columns:
- index
- count
- hate_speech
- offensive_language
- neither
- class
- tweet
!pip install nltk
NLTK, or Natural Language Toolkit, is a comprehensive library in Python for natural language processing (NLP) tasks. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Some key functionalities of NLTK include:
from nltk.util import pr
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import re
import nltk
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
stopword=set(stopwords.words('english'))
📚Data Loading
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv("/content/drive/MyDrive/Datasets (1)/Hate Speech/twitter.csv")
print(data.head())
📚Data Preprocessing
Now I will create a function to clean the texts in the tweet column:
def clean(text):
text = str(text).lower()
text = re.sub('
', '', text)
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('\w*\d\w*', '', text)
text = [word for word in text.split(' ') if word not in stopword]
text=" ".join(text)
text = [stemmer.stem(word) for word in text.split(' ')]
text=" ".join(text)
return text
data["tweet"] = data["tweet"].apply(clean)
data
📚Data Splitting
Now let’s split the dataset into training and test sets and train a machine learning model for the task of hate speech detection:
x = np.array(data["tweet"])
y = np.array(data["hate_speech"])
cv = CountVectorizer()
X = cv.fit_transform(x) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
📚Model Training
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
📚Model testing
y_pred_DT = clf.predict(X_test)
DT_Acc=clf.score(X_test, y_test)
print('Accuracy score= {:.4f}'.format(clf.score(X_test, y_test)))
#Accuracy score= 0.7409
Now let’s test this machine learning model to see if it detects hate speech or not:
sample = "Let's unite and kill all the people who are protesting against the government"
data = cv.transform([sample]).toarray()
print(clf.predict(data))
Github
Here you can find the complete code of project
Please Follow and 👏 Clap for the story courses teach to see latest updates on this story
🚀 Elevate Your Data Skills with Coursesteach! 🚀
Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!
🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️🗨️ Computer Vision, 🔬 Research — all in one place!
Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at
Machine Learning projects course
🔍 Explore Free world top University computer Vision ,NLP, Machine Learning , Deep Learning , Time Series and Python Projects, access insightful slides and source code, and tap into a wealth of free online websites, github repository related Machine Learning Projects. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your Machine Learning projects potential!”
Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Deep Learning in more detail!
Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️
Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!
Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.
Together, let’s make this the best AI learning Community! 🚀