This is the Part 2 of the series of implementation of Sentiment Analysis. If you haven’t read the Part 1, I would suggest you to read the “Sentiment Analysis of a Youtube Video”. In this part we will be cleaning our text data to get meaningful insights.
We are performing Sentiment Analysis on the below youtube video part by part. We are taking the video on “The uprising of India’s farmers: What’s behind the protests?” published by Global News.
This video is about “Farmers in India have been rallying for months against three agricultural laws enacted Sept. 20 by Prime Minister Narendra Modi’s government. The Indian government has argued the changes will give farmers more freedom, but farmers are concerned the new laws will drive down their products’ prices with no safeguards to protect them against corporate takeovers and exploitation. But the crisis in India’s agricultural sector is nothing new as the industry has been suffering for decades.”
PART 2: CLEANSING THE DATA
Cleansing the text data is very important as it can contain many unwanted characters. This cleaning can go forever as it is not like a numerical data where data cleaning often involves removing null values and duplicate data, dealing with outliers, etc.
There are some common data cleaning techniques in text data, which are also known as text pre-processing techniques.
Common data cleaning steps on all text:
- Make text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text (/n)
- Remove emojis
In previous article we have successfully scrapped all the comments of the above youtube video through the Youtube API. In this part we will be cleansing our data to get meaningful insights from it.
Step 1: Import the data from the “response.json” file
We got our response in json format. I saved it in a file named “response.json”.
We have imported our json data into “response” variable. This step is not mandatory. I have done this because I just saved the data so that I can work on same data which I got in my previous article. If I again ran the API, it would fetch me newest comments. You can directly go to the step 3.
Step 2: Putting response data into Data Frame
Remember: This is the same code we used in Step 3 of Google API in part 1 article.
We will get our data frame “df” as shown below:
Step 3: Cleaning — Round 1
In our “Comments” column we have several emojis used by the comment author which needed to be removed. In line 3, we have deleted those emojis.
There were also some numbers present in the comments which are unnecessary. So we removed it in line 4.
There were some special characters needed to be removed like,
: , @ , * , ) , period(.), $ , ! , ?, comma(,) , %, “
These characters we have removed in line 5.
After performing all the cleaning in round 1 we saw the data which still had some of the special characters. So we need to perform round 2 of cleaning process.
Step 4: Cleaning — Round 2
There were still some emojis left after round 1 of cleaning. Those were 💁🌾 😎 ♥ 🤷♂. So we removed those in line 2 in below code.
There were also some special character left which needed to be removed. This was removed in line 3.
After performing round 2, we find there were some new line characters present in the comments and also some leftover special characters. This we removed in round 3.
Step 5: Cleaning — Round 3
The new line characters present in some of the comments were removed by line 2.
‘ , 🇵🇰 , ; , ! special characters left in both rounds were finally removed at line 3.
We got our dataframe shown below.
But we forgot our first step of cleaning which says that make the text in lowercase. Let’s do it in another round.
Step 6: Cleaning — Round 4
Till now, we have cleaned our data by removing unnecessary comments, special characters, numbers and emojis.
The only thing left is to make all the comments in lower case.
Bingo! Now we got our data frame ready to be used for next part.
Let’s assemble all the codes together from Part 1 article and this article.
This completes the cleaning part of the data. I have only applied most common cleansing step which everyone does. You can apply other cleaning steps like removing of Stop Words etc. To achieve better results you should try to clean your data as much as possible. It will enhance your analysis.
In the next part we will see the concept of Subjectivity, Polarity and how they are used. We will also form word clouds to visualize commonly used words in the comments by different users.
Stay tuned for next part!!