CODEX

Approaching My NLP Data Set From Four Different Models

Published in

CodeX

5 min readMar 5, 2021

In a recent NLP project, I took a data set from Kaggle and investigated it from several different angles. In the process, I implemented four different modeling techniques, each for a different purpose. I plan to shed a light on each technique and show how one project can be split up into several mini projects, each with their own goal, to achieve results.

The Data

5 million podcast reviews with both written and numerical ratings. The data is updated monthly and organized very well. The only feature I would request added is a User ID for the consumers leaving the reviews. The data is available here.

Our First Model — Logistic Regression

Every entry in the data set has a written and numerically rated review, making it ideal for labeled classification. Approaching the data set, it’s important to set your standards and goals. Our data set is large, and the goal of our first model is simply to eliminate the easily predicted reviews. We begin with a simple model and accept a large error. The goal of the next model will be to deal with that error.

Our standard first steps for NLP apply:

Clean the data
Tokenize
Lemmatize
Count Vectorize

The biggest discovery in the regular process was the shape of the data itself. It was hugely imbalanced in favor of the 5 rating.

It’s important to inspect the confusion matrix of a model for a massively imbalanced data set like this one to ensure that our model isn’t picking the majority class every time.

Logistic regression confusion matrix heatmap

I chose a Logistic Regression classifier for our initial classification model. As you see on the left, it had pretty lousy results picking the class for any class that was not 5. But it’s important here to remember our established goal. This model tends to pick the majority class but when it doesn’t it tends to be pretty accurate. In fact, by calculating the error, each misclassified data point is wrong by an average of 1.57.

In terms of a numerical rating system, what actually is the difference between an item rated a 2 or an item rated a 3? A 4 vs a 5? Ratings are based on very subjective means and thus the goal of this model is to predict within one rating of the actual. I remove all correctly predicted data points and all within one rating of being correct.

The Second Model — Random Forest

Now we’ve whittled our data set down to only items that are difficult to predict. Now we get serious with our modeling and optimizing parameters. The goal this time is accuracy. Several models are tried and Random Forest classification shows great success.

Wow what a stark difference. Only one class is predicted less than 90%, and most of the misclassification is pointed at the other extreme. This looks to me that the NLP process might have trouble distinguishing between a 1 and a 5. Now I have an area to focus on when I return to earlier steps to improve our modeling.

The Third Model — LDA/K Means Clustering

The goal for the next two steps of our process is to create a Collaborative Recommender System. In order to do that, I import the surprise library which specializes in user-based recommender systems. It has three requirements: User IDs, Item IDs, and ratings. The ratings are clear and have been used to create our classification models. The Item IDs are the names of our podcasts being rated. User IDs, however, are missing. So let’s use clustering to predict them!

I’m lumping two unsupervised models together for this step, as they are used here to complement each other. First, I used K Means clustering to predict clusters based on the written text of the reviews. It’s important to choose the proper number of clusters. Too few and they will all be centered around our imbalanced majority class. Too many and we spend extra time and processing power to produce increasingly redundant information. After some experimentation, 500 clusters were chosen. Now we can assign the data points in our set based on their clustering, and we have users! As a note for future projects, assigning new users to a predetermined cluster is a possible solution to the common ‘cold start’ problem.

LDA — Each small blob is a different predicted topic. Topics are grouped by predicted similarity.

To simulate the clustering, I turn to a different unsupervised learning process: Latent Dirichlet Allocation (LDA). LDA is used on text data, like we have, and uses machine learning to organize our data into topics. An expected outcome for our current data set is to have outcomes based on the sentiment of the reviews. Topics may be similar based around highly positive reviews and they may be located far from topics extracted from negative reviews. Note: This is not the same process K Means uses to assign clusters, but it could be assumed to be similar and thus is a helpful visual and helps with exploring the text data.

The Fourth Model

The fourth model comes directly from our previously discussed goal: to build the recommender system. I import the SVD model from the surprise library and supply our predicted clusters in place of the missing User IDs. Voila! We’ve created a user-based recommender system without any users!

Conclusion

It is important to note that every step of this process had room for improvement. But doing it this way and dividing each step cleanly and identifying new project goals at each step will be hugely helpful to return to the beginning and use what we’ve learned in future steps. This project will be polished and optimized in a way that would be much more difficult if we spent time and energy getting each step perfect before moving on. Organization sometimes means ignoring perfection.

Thanks for reading and I hope you’ve taken heed of some advice!