Classifying Documents with Sklearn’s Count/Hash/TDiF Vectorizers
The Sklearn library provides several powerful tools that can be used to extract features from text. In this article, I will show you how easy it can be to classify documents based on their content using Sklearn.
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizerYou will need to import pandas (of course) and CountVectorizer. We will use the 20newsgroups sample dataset, which contains sample text and categorizations for over 11,000 news articles, although this could be used on any large enough set of texts.
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))We will work with articles that fall into just 4 categories for the sake of simplicity. The 20newsgroups dataset comes pre-divided into a training and testing set for your convenience.
cvec = CountVectorizer()
train_data_features=cvec.fit(data_train.data)
cvecdata= cvec.transform(data_train.data)
df=pd.DataFrame(cvecdata.todense(),columns=cvec.get_feature_names())This code block takes your set of texts and turns it into a dataframe where each column contains the number of times one word appears in each document. We initialize an empty CountVectorizer object and fit it to our data, which creates the feature columns. We then transform our data, which actually populates those columns with word counts. Finally, we store it as a dense dataframe that we can use for building models. In this example, we will use logistic regression to categorize our articles, although there are many other methods that can be used for classification.
from sklearn.linear_model import LogisticRegression y=data_train.target
model=LogisticRegression()
log_model=model.fit(df,y)Now we can use our CountVectorizer object ‘cvec’ to transform any text into its word counts (as long as they appeared in our original training set), and we can use ‘log_model’ to predict the category.
X_test=cvec.transform(data_test.data)
y_test=data_test.target
log_model.score(X_test,y_test)This exceedingly simple code was able to categorize our documents with 73% accuracy, which I found very impressive. The Sklearn library also includes a ‘hashing vectorizer’ and a ‘term frequency-inverse document frequency vectorizer’ which can be used in the exact same way. I included the import code below.
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer