Apply simple natural language model to predict movie reviews using Python!

Low Wei Hong
3 min readSep 2, 2017

--

Please read my previous post before reading this post, the link is: https://medium.com/@lowweihong/5-simple-steps-to-create-a-simple-sentiment-analysis-on-movie-reviews-using-natural-language-69c7cba4b79b

import re #import regular expression package

import nltk #import Python nltk package

import pandas as pd #import pandas data frame

from bs4 import BeautifulSoup #import BeautifulSoup to remove tags

from nltk.corpus import stopwords # Import the stop word list

After import all those Python packages, we will first import the data using pandas package and create a function to clean up the movie reviews training data. The link to download the data set for movie reviews is: https://github.com/M140042/nlp_dataset

train=pd.read_csv(“C:/Users/low/Desktop/labeledTrainData.tsv”,header=0,delimiter=”\t”,quoting=3)

  1. Delemiter=”/t” means separate by tab.

2. Quoting=3 tells Python to ignore double quoting.

3. header=0 tells Python column 1 is the column name of the data.

def review_to_words( raw_review ):
# Function to convert a raw review to a string of words
# The input is a single string (a raw movie review), and
# the output is a single string (a preprocessed movie review)
#
# 1. Remove HTML
review_text = BeautifulSoup(raw_review).get_text()
#
# 2. Remove non-letters
letters_only = re.sub(“[^a-zA-Z]”, “ “, review_text)
#
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
#
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words(“english”))
#
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
#
# 6. Join the words back into one string separated by space,
# and return the result.
return( “ “.join( meaningful_words ))

After creating the function, next step is to vectorize all words and for this case only 5000 most frequently used words are chosen to be the vocabulary of the data set.

# Initialize the “CountVectorizer” object, which is scikit-learn’s
# bag of words tool.
vectorizer = CountVectorizer(analyzer = “word”, \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 5000)

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an
# array
train_data_features = train_data_features.toarray()

Last but not least, put all training data to train a random forest natural language processing model!

from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)

# Fit the forest to the training set, using the bag of words as
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_features, train[“sentiment”] )

For the upcoming post, I will post another machine learning model:)

--

--