Scoring Opinions and Sentiments

Published in

Analytics Vidhya

7 min readDec 25, 2019

What is opinion mining?

Opinion mining also known as Sentiment analysis refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials — Source.

What is Natural Language Processing?

As human beings, understanding any language is one of our first achievements, and associating words with their meaning looks natural. Computers don’t have the ability to understand the human tone naturally but can rely on NLP, a field of computer science concerned with language understanding and language creation between a machine and a human being.

How Does a Machine Understand?

Before a computer can do anything with text, it must be able to read the text in some manner. The example given below shows how we can prepare data to understand categorical variables, such as a feature representing a color(for example, representing whether an example relates to the color red, pink, or purple). Categorical data is a type of short text that you represent using binary variables, that is, variables coded using one or zero values according to whether a certain value is present in the categorical variable.

Therefore, just as you transform a categorical color variable, having values such as red, pink, and purple, into three binary variables, each one representing one of the three colors, so you can transform a phrase like “Two driven jocks help fax my big quiz” using eight binary variables, one for each word that appears in the text (“Two” is considered different from “two” because of its initial capital letter). This is the Bag of Words form of representation.

In its simplest form, Bag of Words shows whether a certain word is present in the text by identifying a specific feature in the dataset.

Let’s start with an example

Take a look at an example using Python and its Scikit-learn package. The input data has three phrases, sentence_1, sentence_2, and sentence_3, placed in a list, corpus.

A corpus can be defined as a collection of text documents. It can be thought of as just a bunch of text files in a directory, often alongside many other directories of text files- Source

When you need to analyze text using a computer, you load the documents and place each of them into a string variable. If you have multiple documents, you store them all in a list, the corpus. When you have a single document, you can split it using chapters, paragraphs, or simply the end of each line.

After splitting the document, place all its parts into a list and apply analysis as if the list were a corpus of documents.

Now that you have a corpus, you use a class from the feature_extraction module in Scikit-learn, CountVectorizer, which easily transforms texts into Bag Of Words like this:

This gives the output in binaries:

[[0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0]
 [1 0 0 1 1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 1 1 0 0 1]]

The CountVectorizer class learns the corpus content using the fit method and
then turns it (using the transform method) into a list of lists.

A list of lists is nothing more than a matrix, so what the class returns is actually a matrix made of three rows (the three documents, in the same order as the corpus) and 24 columns representing the content.

The Bag of Words representation turns words into the column features of a document matrix and these features have a non 0 value when present in the processed text.

For example, consider the word lazy. The following code shows its representation in the Bag of Words:

The output now looks like:

{'two': 21, 'driven': 5, 'jocks': 11, 'help': 8, 'fax': 6, 'my': 14, 'big': 1, 'quiz': 16, 'the': 20, 'five': 7, 'boxing': 2, 'wizards': 22, 'jump': 12, 'quickly': 15, 'your': 23, 'dog': 4, 'is': 9, 'so': 18, 'lazy': 13, 'that': 19, 'it': 10, 'sleeps': 17, 'all': 0, 'day': 3}

The CountVectorizer prints the vocabulary learned from text reports that it associates dog with the number 4, which means that dog is the fourth element in the Bag of Word representations. The fifth element of each document list always has a value of 1 because the dog is the only word present in all three documents.

Considering basic processing tasks:

Instead of marking the presence or absence of an element of the phrase (technically called a token), you can instead count how many times it occurs, as shown in the following code:

The output now looks like-

[[0 0 1 1 1 1 0 0 2 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0]]

This code modifies the previous example by adding a new phrase with the word dog repeated two times.

The code appends the new phrase to the corpus and retrains the vectorizer, but it omits the binary=True setting this time.

The resulting vector for the last inserted document clearly shows a 2 value in the ninth position, thus the vectorizer counts the word dog twice.
Counting tokens helps make important words stand out. Yet, it’s easy to repeat
phrase elements, such as articles, that aren’t important to the meaning of the expression. In the next section, you discover how to exclude less important elements, but for the time being, the example underweights them using the term frequency-inverse document frequency (TF-IDF) transformation.
The TF-IDF transformation is a technique that, after counting how many times
a token appears in a phrase, divides the value by the number of documents in
which the token appears. Using this technique, the vectorizer deems a wordless important, even if it appears many times in a text when it also finds that word in other texts. In the example corpus, the word dog appears in every text. In a classification problem, you can’t use the word to distinguish between texts because it appears everywhere in the corpus. The word fox appears only in one phrase, making it an important classification term.

The following example demonstrates how to complete the previous example using a combination of normalization and TF-IDF.

The output looks like:

crazy: 0.125Summed values of a phrase: 0.1
  fredrick: 0.125Summed values of a phrase: 0.2
    bought: 0.125Summed values of a phrase: 0.4
      many: 0.125Summed values of a phrase: 0.5
      very: 0.125Summed values of a phrase: 0.6
 exquisite: 0.125Summed values of a phrase: 0.7
      opal: 0.125Summed values of a phrase: 0.9
    jewels: 0.125Summed values of a phrase: 1.0

Using this new TF-IDF model rescales the values of important words and makes them comparable between each text in the corpus. To recover part of the ordering of the text before the BoW transformation, adding n-grams is also useful. The following example uses CountVectorizer to model n-grams in the range of (2, 2), that is, bigrams.

{'two driven': 25, 'driven jocks': 6, 'jocks help': 14, 'help fax': 11, 'fax my': 8, 'my big': 18, 'big quiz': 1, 'the five': 24, 'five boxing': 9, 'boxing wizards': 3, 'wizards jump': 27, 'jump quickly': 15, 'your dog': 28, 'dog is': 5, 'is so': 12, 'so lazy': 21, 'lazy that': 16, 'that it': 22, 'it sleeps': 13, 'sleeps all': 20, 'all the': 0, 'the day': 23, 'crazy fredrick': 4, 'fredrick bought': 10, 'bought many': 2, 'many very': 17, 'very exquisite': 26, 'exquisite opal': 7, 'opal jewels': 19}

Setting different ranges lets you use both unigrams (single tokens) and n-grams in your NLP analysis. For example, the setting ngram_range=(1,3) creates all tokens, all bigrams, and all trigrams. You usually never need more than trigrams in an NLP analysis. Increasing the number of n-grams is slightly beneficial after trigrams and sometimes even just after bigrams, depending on the corpus size and the NLP problem.

Stemming and removing stop words:

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mitusha\AppData\Roaming\nltk_data...
['love', 'tommy', 'swim', 'time']
[[1 0 1 0]]

The first output shows the stemmed words. Notice that the list contains only
swim, not swimming or swims. All the stop words are missing as well. For example, you don’t see the words so, he, all, or the. The second output shows how many times each stemmed word appears in the test sentence. In this case, a love variant appears once and a swim variant appears once as well. The words sam and time don’t appear in the second sentence, so those
values are set to 0.

Thanks for reading :)