Text Classification Using TF-IDF

Shraddha Anala
The Startup
Published in
4 min readJul 21, 2020

Classifying reviews from multiple sources using NLP

I’m expanding with more posts on ML concepts + tutorials over at my blog!

Hi there, here’s another tutorial from my random dataset challenge series, where I build Machine Learning models on datasets hosted at the UCI Machine Learning Repository.

This series is a continuous effort to improve my data science skills by playing with different datasets; numerical, categorical and even, as you’ll see in this tutorial, text data. So if you’d like to check out some interesting techniques, keep reading and also take a peek at my previous articles.

Photo by Dmitry Ratushny on Unsplash

About the Dataset:

This tutorial uses the Sentiment Labelled Sentences Dataset which is a collection of user reviews and ratings pulled from 3 sites; Amazon, Yelp and IMDB. Each review is either labelled 0 for negative sentiment, or 1 for a positive sentiment related to the user’s experience of a product, film or place.

Acknowledgements —

This dataset was created for the paper, ‘From Group to Individual Labels using Deep Features’, Kotzias et. al,. KDD 2015.

Term Frequency-Inverse Document Frequency : TF-IDF determines how important a word is by weighing its frequency of occurence in the document and computing how often the same word occurs in other documents. If a word occurs many times in a particular document but not in others, then it might be highly relevant to that particular document and is therefore assigned more importance.

1) Data Preprocessing —

There are 3 separate datasets, one for each site and in the first gist below I’ve combined them into one, giant dataset. There are only 2 columns; ‘reviews’ and ‘ratings’. Two of the reviews do not have a sentiment rating so I simply assigned the score 1 to them but feel free to drop those reviews if you’re implementing this tutorial.

The next preprocessing step involves cleaning up the reviews themselves using NLP techniques.

This is done to make sure that special characters and commonly occurring words are removed as they do not contain any useful information for the machine learning algorithm to learn.

Lemmatizing is also done here to convert the different inflected forms of a word to its base meaning (eg. happily, happiness -> happy). Again this is helpful to retain context-based information about the word without increasing the dimensionality of the TF-IDF matrix.

2) EDA with Word Clouds —

Word Clouds are fun, little graphs that tell us what words are commonly occurring in a corpus. Generating word clouds for each of the 3 datasets plus the big, complete one seems like a good way to explore the most common words in each of the 3 different areas.

Word Cloud of the complete dataset. Image by the Author.

To start off, the Word Cloud object needs a corpus of text which in our case is simply merging together all the reviews into one giant string variable.

Then we can instantiate the WordCloud class and set different parameters such as background colour, font_size, colour map to display the words in, etc.

Word Cloud of the Yelp Reviews. Image by the author.

And here are the word clouds for the other 2 datasets.

The word cloud of the complete dataset is a mixture of the top occurring words from all the fields. Words such as movie, acting clearly relate to the IMDB dataset while, words such as product, restaurant correspond to the Yelp and Amazon reviews.

Word Cloud of the Amazon Reviews. Image by the Author.
Word Cloud of the IMDB Reviews. Image by the Author.

3) Model, Predictions & Performance Evaluation —

Now that the preprocessing and the exploratory data analysis steps are done, the next step is to split the dataset into training & testing subsets.

Then the classification model is fitted with the training data and predictions are obtained with the test dataset.

For this dataset, I found that the Multinomial Naive Bayes classifier showed the best performance compared to the other classifiers. Additionally, according to its documentation, this classifier is suitable for our use case of text classification with word counts.

I was able to achieve a classification accuracy of 81% with similar precision and recall scores while labelling reviews as either positive (1) or negative sentiments (0).

If you’re able to achieve higher metrics with your model, then let me know in the comments below.

I’ve come across other sentiment-based datasets previously in my random dataset series, so if you’d like to, you can read about Sports Articles and Objectivity or take a look at the analysis of articles covering the Strauss-Kahn allegations.

Thank you so much for reading and I’ll see you next time!

--

--