Sentiment Analysis of Movie Reviews pt.3

pt.3 — n-gram

Published in

Analytics Vidhya

4 min readJan 26, 2021

link to my Github for more code: https://github.com/charliezcr/Sentiment-Analysis-of-Movie-Reviews/blob/main/sa_p3.ipynb

N-gram

In part.1’s text preprocessing step, we tokenized the words in reviews one by one. For example, ‘Very boring movie’ will be tokenized as [‘very’,’boring’,’movie’].

This kind of model is called unigram model, because we are only taking one token at a time. However, there are other ways of tokenizing the words in n-gram model. We can instead take a sequence of tokens at a time. For example, in a bigram (2-gram) model, ‘Very boring movie’ will be tokenized as [‘very boring’,’boring movie’].

In a trigram (3-gram) model, ‘Very boring movie’ will be tokenized as a single token ‘very boring movie’.

N-gram model is helpful in our sentiment analysis because sequences of words may contain more important semantics for classification. For example, the unigram ‘very’ does not contain any sentiment per se. ‘boring’ means that the reviews hates the movie. However, ‘very boring’ conveys that the reviewer really hates this movie, more the just ‘boring’. ‘very boring’ shall be treated differently as ‘boring’ because it contains a stronger sentiment. Therefore, we need to find good n-gram models to do sentiment analysis.

Tuning parameter

In Scikit-learn’s TfidfVectorizer, we can choose the n-gram model by passing in the parameters, tuples of minimum n and maximum n. For example, (1,1) means that we are only using unigram model, since minimum n and maximum n are both 1. (1,3) means that we are using unigram, bigram, and trigram model together. For example, ‘Very boring movie’ will be tokenized as [‘very’,’boring’,’movie’,’very boring’,’boring movie’,’very boring movie’].

Therefore, we can refine the preprocess and classify function in part.1 as below:

from nltk.stem import PorterStemmer    # stem the words
from nltk.tokenize import word_tokenize # tokenize the sentences into tokens
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer # vectorize the texts
from sklearn.model_selection import train_test_split # split the testing and training setsdef preprocess(path, ngram):
    '''generate cleaned dataset
    
    Args:
        path(string): the path of the file of testing data
        ngram(tuple (min_n, max_n)): the range of n-gram model    Returns:
        X_train (list): the list of features of training data
        X_test (list): the list of features of test data
        y_train (list): the list of targets of training data ('1' or '0')
        y_test (list): the list of targets of training data ('1' or '0')
    '''
    
    # text preprocessing: iterate through the original file and 
    with open(path, encoding='utf-8') as file:
        # record all words and its label
        labels = []
        preprocessed = []
        for line in file:
            # get sentence and label
            sentence, label = line.strip('\n').split('\t')
            labels.append(int(label))
            
            # remove punctuation and numbers
            for ch in punctuation+'0123456789':
                sentence = sentence.replace(ch,' ')
            # tokenize the words and stem them
            words = []
            for w in word_tokenize(sentence):
                words.append(PorterStemmer().stem(w))
            preprocessed.append(' '.join(words))
    
    # vectorize the texts
    vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf=True, ngram_range=ngram)
    X = vectorizer.fit_transform(preprocessed)
    # split the testing and training sets
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
    return X_train, X_test, y_train, y_testfrom sklearn.metrics import accuracy_score
def classify(clf, todense=False):
    '''to classify the data using machine learning models
    
    Args:
        clf: the model chosen to analyze the data
        todense(bool): whether to make the sparse matrix dense
        
    '''
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    return accuracy

Naive Bayes Classifier

Because, from part.1, Multinomial Naive Bayes classifier was fast and accurate. We are going to use MultinomialNB as a baseline model for tune the parameters for it. We can pass in different tuples of parameters, from (1,1) to (3,3) to the classifier and record the performance in a Pandas dataframe as below:

In [40]:

from sklearn.naive_bayes import MultinomialNB
import pandas as pd
# create a dictionary to record the accuracy for each ngram_range
d = {}
# iterate through each ngram_range
for ngram in [(1,1),(1,2),(1,3),(2,2),(2,3),(3,3)]:
    X_train, X_test, y_train, y_test = preprocess('imdb_labelled.txt',ngram)
    d[str(ngram)] = [classify(MultinomialNB())]
df = pd.DataFrame(data=d)

Results

We can see that we must include unigram because (1,1), (1,2) and (1,3) achieve great results. (2,2)’s performance is mediocre. (2,3), (3,3)’s accuracy rate are down to 0.5, meaning that they are useless.

Smoothing

In MultinomialNB model, we can tune the smoothing parameter αα of Laplace smoothing to explore a better result. For a more detailed introduction about Laplace smoothing, please refer to this article. We can choose αα from the list [0.1,0.5,1,1.5,2,2.5] and the n-gram model from (1,1),(1,2),(1,3). Then, run the sentiment analysis and record the accuracy in a Pandas dataframe. In this way, we can find the best pair of parameters.

alpha_list = [0.1,0.5,1,1.5,2,2.5]
d = {'alpha':alpha_list}
for ngram in [(1,1),(1,2),(1,3)]:
    acc = []
    for value in alpha_list:
        X_train, X_test, y_train, y_test = preprocess('imdb_labelled.txt',ngram)
        acc.append(classify(MultinomialNB(alpha = value)))
    d[ngram] = acc
df = pd.DataFrame(data=d)