Multiclass Text Classification From Start To Finish

Rob Salgado
14 min readMar 31, 2019

So you have some text and you want to classify it. So you have multiple classes for your text and you want to classify it. Well, what are you waiting for?

I’ll be using python and scikit-learn and as always, my jupyter notebooks can be found on GitHub along with the original dataset.

Data

Text classification is a supervised learning technique so we’ll need some labeled data to train our model. I’ll be using this public news classification dataset. It’s a manually labeled dataset of news articles which fit into one of 4 classes: Business, SciTech, Sports or World.

This is what the dataset looks like:

Exploratory Data Analysis & Text Processing

Lets look at how many articles we have per class:

All of the classes are perfectly balanced which is something you will almost never find in the wild so I will take a sub sample of the business and sports categories to make it imbalanced (i.e. more realistic). I’ll do 1K from Business and 800 from Sports.

I’ll also hold out 5 articles from each category to use for predictions at the end to evaluate how well the classifiers did on unseen data which is the true test.

Let’s visually inspect the json file:

--

--