Performing Sentimental Analysis on Twitter and Facebook (Part 3 — Data analysis)

Hung Mai
6 min readNov 28, 2017

--

Hello, welcome back!

We will do more cleaning and some sentimental analysis using Python, Excel and Tableau.

Just to refresh your memories, we’ve gone from Twitter:

to an Excel file that contains all the tweets:

to a list of words that matter:

cool cool cool :) obviously there will be words that will appear multiple times, and this is something that we should know, the more they appear, the more people talk about them, the more people associate these terms with Clark University.

So first we need to put everything into a dataframe. I named it good_df, and renamed the column that contains all the words

good_df = pd.DataFrame(lower_list)good_df = good_df.rename(columns = {0: ‘word’})

Now let’s do some math. The screenshot above is just a quick look at the dataframe, but I see that terms like ‘clarkuniversity’, ‘commencement’, ‘today’, etc appear multiple times, for an obvious reason. So in order to count how many times each word appears, we need to group the words by the number of times they appear, a.k.a a groupby clause.

In order to do a group by clause, I assigned a value of one for each word, so that if we were to group the words together, they will add up. For example:

this

will becomes this:

good_df.loc[:, ‘NewCol’] = 1good_df = good_df.rename(columns = {‘NewCol’:’count’})grouped_df = good_df.groupby('word')['count'].sum()

I was saying to Python: Hey, add a new column called ‘NewCol’ and assign all values of that column = 1. Also I don’t like the name ‘NewCol’ so rename it to ‘count’. Then Python please group all the words together and find out how many times the 1s of each word can add up, or how many times the words will appear.

Now we have just performed a groupby clause, and we know that there are 5363 words in our bag, some of the words will appear more than once, many will not have emotional value such as “anyway”, “clarku”, “httpworcestermag…”, “clarkuniversity”, etc. We need to get rid of them, just like what we have done in part 2 with stopwords.

If you forgot, these are the words and punctuations that we have previously gotten rid of.

That was part 2.

First, group_df is still a table that we created with the group_by clause, so we need to transform it into a dataframe. I convert everything to dataframes, over and over again, and I might be redundant with this unnecessary step, I probably am, but it doesn’t affect anything so why not.

popular_words = grouped_df[grouped_df[‘count’]>20]

Here I picked out the words that appear over 20 times and put them into a new table (yes table after table, I do have OCD). Then I imported matplotlib, a library for data visualization.

import matplotlib.pyplot as plt
%matplotlib inline

And made a horizontal bar graph with all the words from popular_words

popular_words.plot(kind=’barh’,figsize = (15,25))

As you can see, we still have more cleaning to do, so:

popular_words.drop(['clarkuniversity','clark','worcester','university','college','via','day','get','via','day','mt','campus','get','students','back','see', 'students','time','check','go','clarku’],inplace=True)popular_words.drop(['us','tonight','clarkie','one','year','atlanta','class','school','home','semester','show'],inplace=True)

so I have eliminated/dropped the words that don’t have much meaning to our project. Now let’s extract the remaining words into an Excel/csv file.

popular_words.to_csv(‘list_of_words’,sep=',')

list_of_words is the name of the file, while sep= ‘,’ separates the words and their counts. Without this code, it will look something like this

With the sep clause, if you open the list in text (.txt), you should get

I then opened the file in Excel. Should be a smooth ride from here, just make a table and do all kinds of visualization you like.

For me, a horizontal bar char of what people are saying isn’t enough. I want to present data so that someone who haven’t read the last 2 parts and doesn’t code can still understand. I made a word-cloud on Tableau, my favorite data visualization tool.

In Tableau, connect to the excel file.

Then create a new sheet, and drag the column ‘word’ into “Label”, and ‘count’ into “Size”

Here you can see a clear separation. If necessary, change the Mark type from Automatic to Text.

To add color, drag the same dimension to Color on the Marks card. You should get something like this

To hide the “Word” box on the right:

And drag the sheet into dashboard, which you should get

As you can see, this is the beauty of Tableau, it transform all the words into something understandable and interactive. You can move the mouse over a word and it tells you how many time people mentioned it.

Sweet, I then uploaded the dashboard on the public site.

Final result:

Tadada! Now you can share your work with other people. Here’s mine.

Awesome, but for me this isn’t enough yet. You can easily say that people have a positive perception on Clark University on Twitter based on these words, but I wanted to apply some more science to go an extra mile.

Because training nltk to distinguish if a word is positive or negative is pretty complicated, and especially since we have known the answer, I decided to go the more efficient way by using a website called http://text-processing.com/demo/sentiment/

Just paste your words (from the excel table) into the program.

It tells you whether the words carry a positive emotion or negative. This case, it confirms our hypothesis. Great, now we are good!

We just did a ton of cleaning and a bit of sentimental analysis. I hope the tutorial helped. Part 4 will be about performing sentimental analysis on Facebook, and it will just be about extracting Facebook posts, the remaining part after that will be the same as part 2 and 3 of this Twitter series.

Thanks for being here, enjoy and stay tuned!

--

--