Text Classification Using Machine Learning

Published in

the ML blog

3 min readMay 26, 2017

Diversity : the art of thinking independently together — Malcolm Forbes

The amount of data being generated has risen exponentially over the past decade. In fact 90% of the data we have today has been generated in the last three years and a majority of that data is text based.

Text based data finds the widest scale of application on any platform and this data needs to be analysed and stored in an efficient manner.

In this post, we will be trying to make a text classifier that will make use of the 20 news groups dataset originally developed by Ken Lang to classify documents into different categories based on their content.

Install the following packages before starting to write the actual code:

pip install sklearnpip install numpy

These are the imports needed to complete the model for our classifier

Import packages

The sklearn.datasets package imports the predefined datasets for scikit learn. The sklearn.feature_extraction.text package imports the necessary methods for feature extraction and tf-idf transformation. The last import sklearn.pipeline deals with building the pipeline which is quintessential for our model to work in a scalable manner.

The dataset can be downloaded from the github link provided at the bottom of the article

The first part of the code deals with fetching the data and splitting into a training and test set. We will use the sklearn.datasets.load_files() to fetch our data.

Fetching the data

The categories variable are the class labels or the various fields in which our documents will be segregated.

The predefined method fetch_20newsgroups() copies the text data into an scikit learn bunch object variable.The extension .data denotes a scikit learn bunch object which as predefined attributes such as .target_names which outputs the categories or labels.

Next, we will be using the CountVectorizer() method to convert our text data into an feature vector. This is required because the scikit learn algorithms cannot work with text data and need an integer vector variable as an input.

The .fit_tranform() converts text data to features.

Feature Extraction

Dataset Characteristics

Next, we will be removing the unnecessary and redundant words which can hinder our classifier and give us erroneous results. We will do so by using the TfidfTransformer() method which is included in the sklearn package.

Tf-idf Implementation

Shape of model

Now we move on to training a classifier model. We are going to use the NaiveBayes algorithm to train our model.

Training the classifier

Test on two values

Naive Bayes is one of the most used classifier algorithms in real time application. The docs_new variable is defined as a test for the model and the predicted values are stored in predicted . This is a simple implementation of our classifier on two test values.

We will also a build a pipeline which is a useful method in python used to create a object containing the vectorizer,transformer and classifier methods.

We can change the arguments of the pipeline method to alter the classifier or transformer for that particular model. This method increases the scalablity of our code by leaps and bounds.

Building the pipeline

Now our model is completely ready to be experimented on a full size test set so that we can asses our models performance and accuracy.

So, we test our model on the test dataset.

Assessing the performance

Final Accuracy

We have our finished text classifier ready to be implemented on any kind of dataset. The classifier can be adapted to the size of the dataset by changing the classifier algorithm such as SVM.

Here is the code for the SVM implementation for our model.

Accuracy achieved by SVM model

Various optimization algorithms could also be used such as RmsProp and Nesterov momentum to improve the accuracy of the classifier.

Cheers!!

Text Classification Using Machine Learning

Written by Tathagat Dasgupta