Text Classification Using Machine Learning
Diversity : the art of thinking independently together — Malcolm Forbes
The amount of data being generated has risen exponentially over the past decade. In fact 90% of the data we have today has been generated in the last three years and a majority of that data is text based.
Text based data finds the widest scale of application on any platform and this data needs to be analysed and stored in an efficient manner.
In this post, we will be trying to make a text classifier that will make use of the 20 news groups dataset originally developed by Ken Lang to classify documents into different categories based on their content.
Install the following packages before starting to write the actual code:
pip install sklearnpip install numpy
These are the imports needed to complete the model for our classifier
The sklearn.datasets
package imports the predefined datasets for scikit learn. The sklearn.feature_extraction.text
package imports the necessary methods for feature extraction and tf-idf transformation. The last import sklearn.pipeline
deals with building the pipeline which is quintessential for our model to work in a scalable manner.
The dataset can be downloaded from the github link provided at the bottom of the article
The first part of the code deals with fetching the data and splitting into a training and test set. We will use the sklearn.datasets.load_files()
to fetch our data.
The categories
variable are the class labels or the various fields in which our documents will be segregated.
The predefined method fetch_20newsgroups()
copies the text data into an scikit learn bunch object variable.The extension .data
denotes a scikit learn bunch object which as predefined attributes such as .target_names
which outputs the categories or labels.
Next, we will be using the CountVectorizer()
method to convert our text data into an feature vector. This is required because the scikit learn algorithms cannot work with text data and need an integer vector variable as an input.
The .fit_tranform()
converts text data to features.
Next, we will be removing the unnecessary and redundant words which can hinder our classifier and give us erroneous results. We will do so by using the TfidfTransformer()
method which is included in the sklearn package.
Now we move on to training a classifier model. We are going to use the NaiveBayes algorithm to train our model.
Naive Bayes is one of the most used classifier algorithms in real time application. The docs_new
variable is defined as a test for the model and the predicted values are stored in predicted
. This is a simple implementation of our classifier on two test values.
We will also a build a pipeline which is a useful method in python used to create a object containing the vectorizer,transformer and classifier methods.
We can change the arguments of the pipeline method to alter the classifier or transformer for that particular model. This method increases the scalablity of our code by leaps and bounds.
Now our model is completely ready to be experimented on a full size test set so that we can asses our models performance and accuracy.
So, we test our model on the test dataset.
We have our finished text classifier ready to be implemented on any kind of dataset. The classifier can be adapted to the size of the dataset by changing the classifier algorithm such as SVM.
Here is the code for the SVM implementation for our model.
Various optimization algorithms could also be used such as RmsProp and Nesterov momentum to improve the accuracy of the classifier.
Cheers!!