Sentiment Analysis

goncalo pereira
Jul 1, 2013 · 6 min read

Introduction

For the last few weeks I’ve been taking an online course about Data Science. It was a great introduction to learn more on what to do with the extensive amounts of data we generate through the use of our APIs.

One of the exercises was about Sentiment Analysis, to understand trends from social media around specific terms. This is quite a hot topic as it is currently used to understand the relationship of social media with the stock market or with global natural disasters.

All work done around this project is available in my Github account.

A lot of the ideas were taken from Sentiment Analysis of Twitter Data.

Cleaning up Tweets

As a goal I decided to look into Twitter and the results we have around 7digital and what could I do around it.

The first step here is to pre-process the data:

I downloaded a few days worth of search results around the word 7digital and started getting it in a state where it can be used to look for usable data.

Each tweet is filtered through several steps like removing 7digital related accounts or words with two or less characters.

After that I generated a list of slang words and used it to replace some of the non-English words and acronyms. Some of the replacements from the weekend’s data include:

chk-check
2-too
chk-check
2-too
atl-atlanta
atl-atlanta
thx-thank you
bday-birthday
bday-birthday
api-application program interface
luv-love
ur-your
gonna-going to

Also generated a list of emoticons based off Wikipedia to add some extra data to each sentence.

The tweets are now closer to written English and using a list that attributes a value to each word that represents a good or bad sentiment (from 5 to -5) I have now an approximate value of the tweet as a sum of all the individual words.

Filtering the tweets

After having cleaned up the data, and having mapped values around sentiment in memory I was able to create new filters to find out new information from all the original bulk data.

Sorting

One of the easiest ideas is just to sort the data, we can tell when something amazing happened so we can get involved by re-tweeting or chatting with the user. Or if there is a bug trending we can find it fast by noticing when tweets with very low values appear.

The following rows include account name, total sentiment value and original text:

“Best of” the weekend

Richard Peterson, 12, RT @7digital_US: Free MP3s this week from @bsmsblues, @BosnianRainbows, @Selebrities and @bandofnymph. Have a nice weekend, everyone! http:…
BSMS, 12, RT @7digital_US: Free MP3s this week from @bsmsblues, @BosnianRainbows, @Selebrities and @bandofnymph. Have a nice weekend, everyone! http:…
Chris, 9, RT @RobAshton: 7digital’s @ChrisAnnODell is running a good session on their overall dev strategy in the other room :)
James Tryand, 9, RT @RobAshton: 7digital’s @ChrisAnnODell is running a good session on their overall dev strategy in the other room :)
Chris, 9, RT @westleyl: Good discussion. “@rammesses: @ChrisAnnODell ‘s 7digital review is now in the Q/A phase - lots of interesting discussion. #dd…
‘Hurricane Rob’, 9, 7digital’s @ChrisAnnODell is running a good session on their overall dev strategy in the other room :)
Anthony Steele, 9, RT @westleyl: Good discussion. “@rammesses: @ChrisAnnODell ‘s 7digital review is now in the Q/A phase - lots of interesting discussion. #dd…

“Worst of” the weekend

ballitsa, 0, I’m at @7digital HQ (London) http://t.co/bSFeOjmaAs
world_on_mars, 0, RROT’S -Good Night- http://t.co/RyOywZeON1 http://t.co/U8qyui72r4 http://t.co/7dPjaEFuOP #music #info #Japan
mars_project_on_amer, 0, RROT’S -Good Night- http://t.co/a0sV94rqnO http://t.co/8DeGgs0Mwe http://t.co/adM10tH6Tf #music #info #Japan
Calum Hale, 0, Editors interview on 7digital UK http://t.co/byFUxl7y6R
Calum Fairweather, -1, Black Ocean (2013) | Skies of Fire | MP3 Downloads 7digital United Kingdom http://t.co/h6hsDF7FCT
Christine Oram, -1, Rock Hard & Co. now available on 7digital http://t.co/tyKMe6ibnN
Calum Fairweather, -2, Black Ocean by Skies of Fire http://t.co/2MElNvGJU6

Averages

Because Twitter might include several chats in parallel it is hard to understand if the sentiment is actually shifting in real time because something is happening or if it’s a one time only event.

As a very simple example to tackle this problem I added a moving average to each tweet. Now for a tweet to stand out it needs to be standing out a lot on its own or in the middle of several tweets with high sentiment.

Here’s some of the information extracted from the weekend’s tweets this way:

39
Chris, 9, RT @westleyl: Good discussion. “@rammesses: @ChrisAnnODell ‘s 7digital review is now in the Q/A phase - lots of interesting discussion. #dd…
40
Chris, 9, RT @RobAshton: 7digital’s @ChrisAnnODell is running a good session on their overall dev strategy in the other room :)
41
Chris, 6, RT @apwestgarth: Enjoying @ChrisAnnODell’s session on Continuous Delivery at 7Digital at #dddea http://t.co/16cAZkVs5Q
42
∞Θs¢αя Ĵιмєηєz∞, 0, #tiesto by #7digital http://t.co/IkY2jYU2I8
43
BSMS, 0, @7digital_US thx mates
44
BSMS, 12, RT @7digital_US: Free MP3s this week from @bsmsblues, @BosnianRainbows, @Selebrities and @bandofnymph. Have a nice weekend, everyone! http:…
45
●▬๑۩georgia m.۩๑▬●, 0, RT @DukeboxJohnny: Please download The Enemies new single ‘Smile’ all proceeds for Down Syndrome Ireland http://t.co/uiDwcBA1TX
46
Johnny Crean, 0, Please download The Enemies new single ‘Smile’ all proceeds for Down Syndrome Ireland http://t.co/uiDwcBA1TX
47
Peter Shaw, 6, RT @apwestgarth: Enjoying @ChrisAnnODell’s session on Continuous Delivery at 7Digital at #dddea http://t.co/16cAZkVs5Q
48
‘Hurricane Rob’, 9, 7digital’s @ChrisAnnODell is running a good session on their overall dev strategy in the other room :)
49
Anthony Steele, 9, RT @westleyl: Good discussion. “@rammesses: @ChrisAnnODell ‘s 7digital review is now in the Q/A phase - lots of interesting discussion. #dd…
50

By using the moving average against the unsorted tweets and adding a counter to the unfiltered list we can actually understand ongoing conversations by noticing high sentiment tweets happening in small intervals.

Conclusions

My first conclusion (as noticed several times during the course) is that preparing the data is a problem on it’s own, after acquiring a model we have the new problem on how to infer knowledge on the new data set.

In the current problem of filtering tweets we can see several issues:

It’s a very hard problem to understand the true sentiment of a tweet just by adding the sentiment values. It is missing the context of the sentence so something like wicked good which would mean something very good is actually seen as a bad word and a good word cancelling each other out.

Another problem is the sheer number of words in any language, although there are ways to infer new sentiment to new words based on existing data (out of scope of this project) all the available free dictionaries don’t contains more than a couple thousand examples.

Yet another problem with this specific project is that a lot of the tweets are music related and the filtering will try to infer sentiment based on name of albums or bands, which are actually completely unrelated with the true value of the tweet.

Bonus!

I also applied the same algorithms to the company’s internal IRC channels.

Due to privacy reasons I have to keep the example data to a minimum.

Some slang replacements

uat-User Acceptance Testing
uat-User Acceptance Testing
yea-yeah
dont-don’t
dont-don’t
uat-User Acceptance Testing
dns-Domain Name System
tc-take care
api-application program interface
api-application program interface
admin-administrator
admin-administrator
meh-whatever
prod-product
fyi-for your information
alright-all right
ok-okay
5-oh
tb-text back
4-for
tb-text back
ok-okay
irc-internet relay chat

“Best of” the weekend

<1>, 6, <1> it has a certain amount of beer in it but there’s a fair amount of free space
<2>, 4, <2> WOAH! AMAZING
<3>, 4, <3> wooo 6 nodes again.
<4>, 4, <4> access is fun in itself.

“Worst of” the weekend

<1>, -4, <1> Who manages ***? Lots of build warnings “Failed to delete empty directory: ***” etc
<2>, -4, <2> damn you, ***!
<3>, -6, <3> Windows crashes and then blames we, *** *** Windows

Some moving average data

<1>, -2, <1> we’re having trouble with timeouts
37
<2>, -4, <2> guys, something is killing disk IO for *** *** VM and this blocks us from fixing our tests (deploys fail) - could you take a look at what’s going on with the host for that machine?
38
<2>, 0, <2> Is there any list of systems tools that I could use for self-help? Like: vms->physical mappings to get to zabbix stats from there?

As each channel has a theme and an order it is much easier to understand and make a new problem or query stand out.

The large amount of technical vocabulary makes the standard available dictionaries makes it less useful.

    goncalo pereira

    Written by

    distributed systems 🚀 ,metrics 📈,clean code 🧐 , automation 🤖.Contracting at the moment @HMRCgovuk http://www.goncalopereira.com/work

    Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
    Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
    Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade