Using Machine Learning to Predict Movie Reviews

8 min readMay 5, 2024

📚Introduction:

In the ever-evolving landscape of cinema, understanding audience sentiments towards movies is paramount. Movie reviews offer valuable insights into audience perceptions, but manually analyzing a large volume of reviews is time-consuming and often subjective. Enter Sentiment Analysis in Natural Language Processing (NLP), a powerful tool that automates the process of gauging sentiments from textual data.

Sentiment analysis is a powerful tool that allows computers to understand the underlying subjective tone of a piece of writing. This is something that humans have difficulty with, and as you might imagine, it isn’t always so easy for computers, either. But with the right tools and Python, you can use sentiment analysis to better understand the sentiment of a piece of writing.

Why would you want to do that? There are a lot of uses for sentiment analysis, such as understanding how stock traders feel about a particular company by using social media data or aggregating reviews, which you’ll get to do by the end of this tutorial.

In this blog, we delve into the realm of movie reviews, exploring how Sentiment Analysis revolutionizes the way we perceive and analyze cinematic feedback.

📚Table of Content

Import necessary libraries
Load Dataset
Data Preprocessing
Model Training 6- Model Testing
Model saving
Performance Checking

📚Import necessary libraries


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re # for regex
from nltk.corpus import stopwords
# stopwords corpus within NLTK contains a collection of common words that are often considered irrelevant for analysis
# and are thus typically removed from text data during preprocessing.
from nltk.tokenize import word_tokenize #tokenize module, text data can be split into individual words or tokens,
from nltk.stem import SnowballStemmer#stem module applies the Snowball stemming algorithm to reduce words to their root or base form
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score
import pickle

The CountVectorizer from scikit-learn’s feature extraction module converts a collection of text documents into a matrix of token counts, representing the frequency of each word in the corpus, thereby enabling machine learning models to process textual data.

stopwordscorpus within NLTK contains a collection of common words that are often considered irrelevant for analysis and are thus typically removed from text data during preprocessing.

The pickle module in Python provides functionality for serializing and deserializing Python objects, allowing for easy storage and retrieval of data structures, such as lists or dictionaries, in a binary format.

📚Load Dataset

from google.colab import drive
drive.mount('/content/drive')

This Python code mounts the Google Drive to a Colab notebook, enabling access to files and directories stored on Google Drive within the notebook environment.

data = pd.read_csv('/content/drive/MyDrive/Courses /Data Science /NLP/Datasets/IMDB-Dataset.csv')
print(data.shape)
data.head()


data.info()

The info() method likely provides information about the data object, such as the data types of each column, memory usage, and non-null counts, commonly used in Python libraries like pandas for DataFrame objects.

data.sentiment.value_counts()

This command returns the frequency count of different sentiment categories present in the ‘data’ object, aiding in the analysis of sentiment distribution within the dataset.


data.sentiment.replace('positive',1,inplace=True)
data.sentiment.replace('negative',0,inplace=True)
data.head(10)

This preprocessing step is called “Label Encoding,” where the categorical sentiment labels, such as ‘positive’, are replaced with numerical values, such as ‘1’, in the ‘data’ object, and the changes are made in place.

📚Pre-processing Steps

Any sentiment analysis workflow begins with loading data. But what do you do once the data’s been loaded? You need to process it through a natural language processing pipeline before you can do anything interesting with it.The necessary steps include (but aren’t limited to) the following:

Remove HTML tags:
Remove special characters
Convert everything to lowercase
Remove stopwords
Stemming

All these steps serve to reduce the noise inherent in any human-readable text and improve the accuracy of your classifier’s results. There are lots of great tools to help with this, such as the Natural Language Toolkit, TextBlob, and spaCy. For this tutorial, you’ll use spaCy.

1- Remove HTML tags

def clean(text):
    cleaned = re.compile(r'<.*?>')
    return re.sub(cleaned,'',text)

data.review = data.review.apply(clean)
data.review[0]

This Python function clean utilizes regular expressions to remove HTML tags from the input text and returns the cleaned text.

2. Remove special characters


def is_special(text):
    rem = ''
    for i in text:
        if i.isalnum():
            rem = rem + i
        else:
            rem = rem + ' '
    return rem

data.review = data.review.apply(is_special)
data.review[0]

This Python function is_special removes non-alphanumeric characters from the input text and replaces them with whitespace, returning the modified text.

3. Convert everything to lowercase

def to_lower(text):
    return text.lower()

data.review = data.review.apply(to_lower)
data.review[0]

4. Remove stopwords

Stop words are words that may be important in human communication but are of little value for machines. nltk comes with a default list of stop words that you can customize. For now, you’ll see how you can use token attributes to remove stop words:

import nltk
nltk.download('stopwords')

This Python code imports the Natural Language Toolkit (NLTK) library and downloads the stopwords corpus, which contains common words like “the,” “is,” and “and,” used in text processing tasks for filtering out irrelevant words.

import nltk
nltk.download('punkt')

This Python code imports the Natural Language Toolkit (NLTK) library and downloads the Punkt tokenizer models, which are used for tokenization tasks like splitting text into individual words or sentences.

def rem_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    return [w for w in words if w not in stop_words]

data.review = data.review.apply(rem_stopwords)
data.review[0]

This Python function rem_stopwords removes stopwords from the input text using NLTK’s English stopwords corpus and tokenizes the text into words, returning a list of words excluding the stopwords.

5-Stem the words

def stem_txt(text):
    ss = SnowballStemmer('english')
    return " ".join([ss.stem(w) for w in text])
data.review = data.review.apply(stem_txt)
data.review[0]

data.head()

This Python function stem_txt stems the words in the input text using the Snowball Stemmer for English and returns a string where each word is replaced with its stem.

📚Model Training

Creating Bag Of Words (BOW)


X = np.array(data.iloc[:,0].values)
y = np.array(data.sentiment.values)
cv = CountVectorizer(max_features = 1000)
X = cv.fit_transform(data.review).toarray()
print("X.shape = ",X.shape)
print("y.shape = ",y.shape)

This Python code segment converts the ‘review’ column of the ‘data’ object into a bag-of-words representation using CountVectorizer with a maximum of 1000 features, assigns it to ‘X’, and extracts the ‘sentiment’ column into ‘y’, then prints the shapes of ‘X’ and ‘y’.

Creating a Bag of Words (BOW) involves representing text data as a collection of unique words and their frequencies, disregarding grammar and word order, essentially converting text into numerical vectors for machine learning tasks.

2. Train test split


trainx,testx,trainy,testy = train_test_split(X,y,test_size=0.2,random_state=9)
print("Train shapes : X = {}, y = {}".format(trainx.shape,trainy.shape))
print("Test shapes : X = {}, y = {}".format(testx.shape,testy.shape))

3. Defining the models and Training them


gnb,mnb,bnb = GaussianNB(),MultinomialNB(alpha=1.0,fit_prior=True),BernoulliNB(alpha=1.0,fit_prior=True)
gnb.fit(trainx,trainy)
mnb.fit(trainx,trainy)
bnb.fit(trainx,trainy)

📚Model Testing

4. Prediction and accuracy metrics to choose best model

ypg = gnb.predict(testx)
ypm = mnb.predict(testx)
ypb = bnb.predict(testx)

print("Gaussian = ",accuracy_score(testy,ypg))
print("Multinomial = ",accuracy_score(testy,ypm))
print("Bernoulli = ",accuracy_score(testy,ypb))

Gaussian =  0.7843
Multinomial =  0.831
Bernoulli =  0.8386

📚Model saving

pickle.dump(bnb,open('model1.pkl','wb'))

📚Performance Checking

rev =  """Terrible. Complete trash. Brainless tripe. Insulting to anyone who isn't an 8 year old fan boy. Im actually pretty disgusted that this movie is making the money it is - what does it say about the people who brainlessly hand over the hard earned cash to be 'entertained' in this fashion and then come here to leave a positive 8.8 review?? Oh yes, they are morons. Its the only sensible conclusion to draw. How anyone can rate this movie amongst the pantheon of great titles is beyond me.

So trying to find something constructive to say about this title is hard...I enjoyed Iron Man? Tony Stark is an inspirational character in his own movies but here he is a pale shadow of that...About the only 'hook' this movie had into me was wondering when and if Iron Man would knock Captain America out...Oh how I wished he had :( What were these other characters anyways? Useless, bickering idiots who really couldn't organise happy times in a brewery. The film was a chaotic mish mash of action elements and failed 'set pieces'...

I found the villain to be quite amusing.

And now I give up. This movie is not robbing any more of my time but I felt I ought to contribute to restoring the obvious fake rating and reviews this movie has been getting on IMDb."""

f1 = clean(rev)
f2 = is_special(f1)
f3 = to_lower(f2)
f4 = rem_stopwords(f3)
f5 = stem_txt(f4)

bow,words = [],word_tokenize(f5)
for word in words:
    bow.append(words.count(word))
#np.array(bow).reshape(1,3000)
#bow.shape
word_dict = cv.vocabulary_
pickle.dump(word_dict,open('bow.pkl','wb'))

inp = []
for i in word_dict:
    inp.append(f5.count(i[0]))
y_pred = bnb.predict(np.array(inp).reshape(1,1000))

print(y_pred)

Please Follow and 👏 Clap for the story courses teach to see latest updates on this story

🚀 Elevate Your Data Skills with Coursesteach! 🚀

Ready to dive into Python, Machine Learning, Data Science, Statistics, Linear Algebra, Computer Vision, and Research? Coursesteach has you covered!

🔍 Python, 🤖 ML, 📊 Stats, ➕ Linear Algebra, 👁️‍🗨️ Computer Vision, 🔬 Research — all in one place!

Don’t Miss Out on This Exclusive Opportunity to Enhance Your Skill Set! Enroll Today 🌟 at

Machine Learning projects course

🔍 Explore Free world top University computer Vision ,NLP, Machine Learning , Deep Learning , Time Series and Python Projects, access insightful slides and source code, and tap into a wealth of free online websites, github repository related Machine Learning Projects. Connect with like-minded individuals on Reddit, Facebook, and beyond, and stay updated with our YouTube channel and GitHub repository. Don’t wait — enroll now and unleash your Machine Learning projects potential!”

Stay tuned for our upcoming articles because we reach end to end ,where we will explore specific topics related to Deep Learning in more detail!

Remember, learning is a continuous process. So keep learning and keep creating and Sharing with others!💻✌️

📚GitHub Repository

📝Notebook

Ready to dive into data science and AI but unsure how to start? I’m here to help! Offering personalized research supervision and long-term mentoring. Let’s chat on Skype: themushtaq48 or email me at mushtaqmsit@gmail.com. Let’s kickstart your journey together!

Contribution: We would love your help in making coursesteach community even better! If you want to contribute in some courses , or if you have any suggestions for improvement in any coursesteach content, feel free to contact and follow.

Together, let’s make this the best AI learning Community! 🚀

👉WhatsApp

👉 Facebook

👉Github

👉LinkedIn

👉Youtube

👉Twitter