Sentiment Analysis (Part-1)

Pavithra Reddy
TechJingles
Published in
4 min readApr 25, 2021

Hello everyone! I’m Pavithra, a 3rd-year student, studying Bachelors of Technology in the field of Computer Science. In the coursework of the degree, I did this project named “SENTIMENT ANALYSIS USING MACHINE LEARNING AND LEXICAL ANALYSIS” along with my teammates Nikhita, Pavana, and Rutuja (who will further interact with you in the upcoming posts).

The main aim of this project was to build a model that will determine the tone (positive, negative) of the text. To do this, you will need to train the model on the existing data (train.csv). The resulting model will have to determine the class (neutral, positive, negative) of new texts (test data that were not used to build the model) with maximum accuracy.

Do check out the slideshow below which will take you over a short overview of the project.

Importing Libraries and Reading Data

So first we start by importing all the necessary libraries such as Numpy, pandas, matplotlib, seaborn, nltk, string, PorterStemmer, etc.

The dataset for the project was taken from Kaggle. It contains 1,600,000 tweets extracted using the Twitter API. The tweets have been annotated (0 = negative, 4 = positive) and can detect sentiment. You can download the dataset from here.

If you go through the dataset clearly, each column has its own detail,
Column 0 has the polarity of the tweet (0 = negative, 4 = positive),
Column 1 has the id of the tweet,
Column 2 has the date of the tweet,
Column 3 has the query (lyx), suppose if there is no query, then this value is NO_QUERY,
Column 4 has the name of the user that tweeted, and lastly
Column 5 has the text of the tweet.

Reading the content from the dataset
Few sentiment = 0 (Negative) tweets from the dataset
Few sentiment = 4 (Positive) tweets from the dataset

Training the Data

Now we draft our data dictionary, this will be further updated in the later posts.

Next, we have plotted the distribution of the length of strings in each entry.

Graphical distribution of length of strings

Data Cleaning and Data Preparation

Now it's time for some real cleaning!

We have applied a data cleaning function that will be acted on the whole data set.

In data cleaning, we have added a few more commands for cleaning the data thoroughly and preparing it for sentiment analysis

  1. HTML decoding

2. Removing @ mention

3. Removing URL links

4. Byte Order Mark

5. Removing hashtags and numbers

Now we finally clean and parse all the data.

Saving the Cleaned Dataset

All the cleaned data is stored in a new CSV file named clean_tweet_texts.csv which will be used for further analysis.

So I will conclude the first part of the project by saving the cleaned dataset. In the upcoming post, we will work on creating word clouds for positive and negative sentiments and applying Zipf’s law along with data visualization.

Thanks for reading the post, hope you found the post resourceful. Will see you all in upcoming posts!

Link to the next post (Sentiment Analysis — Part 2): https://medium.com/techjingles/sentiment-analysis-part-2-e72fe28af19a

--

--