5 Simple Steps to create a simple sentiment analysis on movie reviews using Natural Language Processing!

— — Welcome to my very first post! I hope this will be a knowledgeable blog and a great start for me ! Feel free to leave your comment below :)— —

Natural language processing is the application of computational techniques to the analysis and synthesis of natural language and speech. Natural Language Processing (NLP) had been the trend nowadays, movie reviews is quite a classic example to demonstrate a simple NLP Bag-of-words model on movie reviews. In this post, I would like to use NLP to determine whether a given movie review is good or bad given a 25000 datasets.

Let’s start! :)

First of first, lets have a look at some of the reviews which I had been extracted from Kaggle:

‘“The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature\’s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey — the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against one nature\’s most fearsome predators. Furthermore a third Sabretooth more dangerous and slow stalks its victims.<br /><br />The movie delivers the goods with lots of blood and gore as beheading…”’

I will use Python as my analytic tool to perform the sentiment analysis. Here I will only provide the instructions at a glance while details on coding in Python I would like to discuss in my next post. (Look forward to my future posts if you like it. :) )

  1. Remove tags(example: <br />) and punctuation.

Tags and punctuation are removed first so that we are left only to deal with words. I use BeautifulSoup packages to remove them then use re package to remove punctuation using regular expression techniques.

2. Lowercase all words.

To make sure all the words are in the same form:)

3. Remove stop words(example: [‘i’, ‘me’, ‘my’…] )

Stop words are remove as they don’t carry much meaning. Recommended to use package nltk to download stop words and remove them from movie reviews dataset.

4. Represent each words in vector form using scikit-learn

For instance:

{ I, am, on, a, bus, wow, so }

To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, “the” appears twice, and “cat”, “sat”, “on”, and “hat” each appear once, so the feature vector for Sentence 1 is:

Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

5. Implement random forest algorithm and WE ARE DONE!

  • Do note that random forest algorithm provides great accuracy but it tends to overfit for some datasets.

See you in the next post!