MACHINE LEARNING | NATURAL LANGUAGE PROCESSING
Simple movie review classifier in 3 steps!
Build a +ve/-ve movie review classifier in just a few lines of code
While there are many approaches you can take to build a classifier that classifies whether a movie review is positive or negative, we can build a simple one with the help of Bag of Words and Naive Bayes in just 3 easy steps! Let’s get to the coding part directly. We are going to use the IMDB movie review dataset.
1. Importing and Preprocessing!
We will be using sklearn’s Naive Bayes classifier and CountVectorizer from Bag of Words for creating features dictionary and vectors.
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
Now it’s time we preprocess our corpus. We need to clean our text so that no invalid token is in features dictionary when it’s created.
for line in open('full_train.txt', encoding="utf8"):
reviews_train.append(line.strip().lower())
for i in range(len(reviews_train)):
reviews_train[i] = re.sub(r'<br.?.?>','',reviews_train[i])
reviews_test = []
for line in open('full_test.txt', encoding="utf8"):
reviews_test.append(line.strip().lower())
for i in range(len(reviews_test)):
reviews_test[i] = re.sub(r'<br.?.?>','',reviews_test[i])
2. Vectorization and Labeling
In our dataset first 12500 are positive and the other 12500 are negative reviews for both train and test data. We will convert our data (training data and test data) into vectors with the help of CountVectorizer. You can learn more about CountVectorizer here. It creates features dictionary and with the help of it, we can convert our reviews into vectors.
training_labels = [1 if i < 12500 else 0 for i in range(25000)]
test_labels = [1 if i < 12500 else 0 for i in range(25000)]
# Defining bow_vectorizer:
bow_vectorizer = CountVectorizer()
# Defining training_vectors:
training_vectors = bow_vectorizer.fit_transform(reviews_train)
# Defining test_vectors:
test_vectors = bow_vectorizer.transform(reviews_test)
.fit_transform() method is for creating features dictionary based on the parameter given and transforming the input into vectors. Similarly, the .transform() method only converts the input into vectors based on the features dictionary.
3. Prediction!
It’s time we train and predict our Naive Bayes model. We will feed in the training reviews in the form of vectors along with its labels. After training, we need to take our input and convert it into vector before feeding it to the model for prediction.
classifier = MultinomialNB()def pos_or_not(label):
return "Positive" if label else "Negative"classifier.fit(training_vectors, training_labels)
accuracy = classifier.score(test_vectors, test_labels)
print("Accuracy: ", accuracy*100, "%")sentence = input().strip().lower()
input_vector = bow_vectorizer.transform([sentence])print("Probability for review being Negative:",classifier.predict_proba(input_vector)[0][0])
print("Probability for review being Positive:",classifier.predict_proba(input_vector)[0][1])
predict = classifier.predict(input_vector)
print(pos_or_not(predict))
Let’s test our model by predicting for a review: “I liked the scene at the fashion show, when Jesse walked on the catwalk, looking at a blue neon triangle in front of her.” Below is the output of predicting the label for our review along with the probabilities.
If we change the review like this: “I did not like the scene at the fashion show, when Jesse walked on the catwalk, looking at a blue neon triangle in front of her.” We will get the following output.
This is the simplest movie review classifier made with the Naive Bayes classifier. The above code and the dataset can be found here. You can find me here on LinkedIn. Happy coding!