Tweet Analysis using Naive Bayes Algorithm in R

Apoorva jain
Analytics Vidhya
Published in
3 min readFeb 7, 2020

In this blog, we are analyzing the sentiment of the tweets that took place in the August 2016 Presidential debate in Ohio. We did both data categorization and content analysis to answer if the tweet was relevant, which candidate has mentioned most in the tweets and the sentiment of the tweet.

Read the data into R, the data is also available in the SQL database as well but here we have loaded the CSV file into R.

data=read.csv("Sentiment.csv") head(data)

Structuring the data is the most vital part of this process. The data set “Sentiment” have various other information that are not relevant therefor selecting specifically the “text” and “sentiment” column here and eliminating the rest.

library(tidyverse) datas=data%>%select(text,sentiment) head(datas) round(prop.table(table(datas$sentiment)),2)

Output after structuring the data:

Data Cleaning :


library(tm) library(SnowballC) corpus = VCorpus(VectorSource(datas$text)) corpus = tm_map(corpus, content_transformer(tolower)) corpus = tm_map(corpus, removeNumbers) corpus = tm_map(corpus, removePunctuation) corpus = tm_map(corpus, removeWords, stopwords("english")) corpus = tm_map(corpus, stemDocument) corpus = tm_map(corpus, stripWhitespace) as.character(corpus[[1]])

Output after cleaning the text:

For counting the frequency of each word in the whole document we use another term known as document term matrix which makes our corpus function more numerical presentable.

dtm = DocumentTermMatrix(corpus) dtm dim(dtm) dtm = removeSparseTerms(dtm, 0.999) dim(dtm)

Below shows a word list from our text having a frequency of a minimum more than 100 times.

Visualizing the text data by using a word cloud that gives a better insight into the top used word in each sentiment.

library(wordcloud) 
library(ggplot2)
install.packages("RColorBrewer")
#wordcloud requires RColorBrewer positive= subset(datas,sentiment=="Positive") wordcloud(positive$text, max.words = 100, colors = "blue") negative = subset(datas,sentiment=="Negative") wordcloud(negative$text, max.words = 100, colors = "purple") neutral = subset(datas,sentiment=="Neutral") wordcloud(neutral$text, max.words = 100, colors = "turquoise")

Further, we have used machine learning algorithm “Naive Bayes” for prediction discussed below.

#As naive bayes algorithm excepts binary 
convert <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}

datanaive = apply(dtm, 2, convert)

dataset = as.data.frame(as.matrix(datanaive))
dataset$Class = datas$sentiment
str(dataset$Class)

Data Splitting


set.seed(31)
split = sample(2,nrow(dataset),prob = c(0.75,0.25),replace = TRUE)
train_set = dataset[split == 1,]
test_set = dataset[split == 2,]

prop.table(table(train_set$Class))
prop.table(table(test_set$Class))
The proportion of Train Dataset
The proportion of Test Dataset
# naive bayes
install.packages("e1071")
library(e1071)
library(caret)
control= trainControl(method="repeatedcv", number=10, repeats=2)
system.time( classifier_nb <- naiveBayes(train_set, train_set$Class, laplace = 1,trControl = control,tuneLength = 7) )

Output of the model :

Evaluating the Naive Bayes model and the accuracy predicted is 98.67% that is shown below:

# model evaluation

nb_pred = predict(classifier_nb, type = 'class', newdata = test_set)
confusionMatrix(nb_pred,test_set$Class)

Statistics of the Naive Bayes model :

Thank you for reading my article. Hope you gained some knowledge of how Machine Learning algorithm works with content analysis.

--

--