Introduction to Text Classification in Python

Siddhant Sadangi
Analytics Vidhya
Published in
15 min readSep 4, 2019

Stuck behind the paywall? Click here to read the full story with my friend link.

Natural Language Processing (NLP) is a huge and ever-growing field with innumerable applications ranging like sentiment analysis, Named-Entity Recognition (NER), text classification, and more.

This article is intended to be a beginner’s guide to basic text classification using Python. Basic machine learning experience with Python is preferable as a prerequisite as we won’t be discussing commonly used libraries, data structures, and other Python functionalities.

We will be using the News Category Dataset from Kaggle. The kernel used is available here.

Lets jump right in!

First, importing the libraries…

Natural Language ToolKit (NLTK) is the backbone for NLP in Python. It provides a variety of text processing functions and corpora to make any data scientist’s work a lot easier! Find the official documentation here.

CountVectorizer converts the corpus to something called a Bag-of-Words (BoW). It is one of the simplest method of representing text data for Machine Learning algorithms. It basically puts all the words in the corpus together, and creates a matrix having the count of each word in each document (or in our case, each news story) of the corpus. An example from the official documentation:

Here, vectorizer.get_feature_names() gives us the bag of words i.e., all the distinct words in the corpus. The matrix can be visualized as:

CountVectorizer

It is called a ‘bag of words’ as it puts all the words together, without taking into account their position in the document. In this method, “this is the first document” and “the first is this document” will have the same representation. There are methods which take the position of words into account, but we won’t be discussing those in this article.

An important parameter in the CountVectorizer() function is ‘ngram_range’. The simplest definition for ‘n-grams’ will be a sequence of ’n’ words. For example, bi-gram means a sequence of 2 words. ngram_range specifies the boundary of this range which will be extracted from the corpus. For example, with an ngram_range of (1,2), we will extract all the uni and bi-grams.

This is how the sentence “This is the first document” will be tokenized if we use an ngram range of (1,2):
‘This’, ‘is’, ‘the’, ‘first’, ‘document’, ‘This is’, ‘is the’, ‘the first’, ‘first document’.

The advantage of using higher ranges is that they help the model learn from the sequence of text, and thereby increasing model accuracy. This information would otherewise have been lost if only unigrams were used. The tradeoff is an increase in the feature space, and thereby the time and computational power required. Note that the sentence “This is the first document” will be reduced to only 5 tokens with unigrams, but 5+4=9 with bigrams, and 5+4+3=12 with trigrams. An ngram_range of greater than 3 is seldom used.

Then, we load the dataset into a pandas dataframe:

Some Exploratory Data Analysis (EDA) on the data:

df.head()
News Category Dataset

The ‘category’ column will be our target column, and we will be using just the ‘headline’ and ‘short_description’ columns as our features as of now.df.info()

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
category 200853 non-null object
headline 200853 non-null object
authors 200853 non-null object
link 200853 non-null object
short_description 200853 non-null object
date 200853 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB

There are no NULLs in this dataset, which is good. However, this is not the case most often with real world datasets, and NULLs will need to be handled as a part of preprocessing either by dropping the NULL rows, or replacing them with a blank (‘’).

Now let us see the different ‘category’s present in the dataset…

labels = list(df.category.unique())
labels.sort()
print(labels)
['ARTS', 'ARTS & CULTURE', 'BLACK VOICES', 'BUSINESS', 'COLLEGE', 'COMEDY', 'CRIME', 'CULTURE & ARTS', 'DIVORCE', 'EDUCATION', 'ENTERTAINMENT', 'ENVIRONMENT', 'FIFTY', 'FOOD & DRINK', 'GOOD NEWS', 'GREEN', 'HEALTHY LIVING', 'HOME & LIVING', 'IMPACT', 'LATINO VOICES', 'MEDIA', 'MONEY', 'PARENTING', 'PARENTS', 'POLITICS', 'QUEER VOICES', 'RELIGION', 'SCIENCE', 'SPORTS', 'STYLE', 'STYLE & BEAUTY', 'TASTE', 'TECH', 'THE WORLDPOST', 'TRAVEL', 'WEDDINGS', 'WEIRD NEWS', 'WELLNESS', 'WOMEN', 'WORLD NEWS', 'WORLDPOST']

We see that there are a few categories which can be merged together, like ‘ARTS’, ‘ARTS & CULTURE’, and ‘CULTURE & ARTS’. Let us do that:

This looks better. We have reduced the number of labels from 41 to 36. Also, we see that the dataset is pretty imbalanced. We have around 35000 POLITICS news, but less than 1000 EDUCATION news (pretty much sums up the current state of affairs too tbh :p). We generally want a balanced dataset to train our model, but most datasets out in the real world will almost never be balanced. There are augmentation and sampling techniques available to balance out a dataset, but these are not in scope of the current article.

Now to the preprocessing, by far the most important step!

This is a standard text preprocessing user-defined function (UDF) I use. Let us go through it in detail.

lower = col.apply(str.lower)

This converts the corpus to lower-case, because otherwise the CountVectorizer would consider ‘hello’, ‘HELLO’, and ‘hElLo’ to be different words, which is not a good idea.

This removes HTML tags from the corpus . This is very important if the corpus is scraped from a web-page. The BeautifulSoup library provides a more refined method of doing this. You can check it out here.

Stemming ‘ is the process of producing morphological variants of a root/base word’. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming works by removing trailing characters off a word to ‘try’ to reach the root word. As a result, the root word might not be a dictionary word. The main advantage of stemming is reduction in the feature space, i.e., reduction in the number of distinct words in the corpus for the model to train on. Another way of reaching the root word is Lemmatization. Unlike stemming, lemmatization follows a dictionary based approach, so that the words are more often than not reduced to their actual dictionary roots. The trade-off for this is the speed of processing. Learn more about stemming and lemmatization here.

Stopwords are common words which generally don’t add much meaning to the data. Removing stopwords from the corpus as this reduces the size of the feature space by a good amount. However, stopwords cannot be used blindly. Some words which are in the NLTK stopwords corpus might hold significance in the dataset. For example, you wouldn’t want to remove the word ‘not’ (which is an NLTK stopword) from a corpus you are doing sentiment analysis on. Doing so will cause sentences like ‘It is a good movie’ and ‘It is not a good movie’ to mean the same.

In our case, removing stopwords improved model performance, so we will be going ahead with doing that.

rem_lngth1 = rem_num.apply(lambda x: re.sub(r'[^\w\s]',' ',x))

Here we remove all words with a length of 1 as they generally don’t add meaning to the corpus. Words like ‘a’ will be removed. Which other words have length 1 you might ask? Remember earlier we replaced punctuations with a space? This will transmute like “John Doe’s kernel” to “John Doe s kernel”. The ‘s’ here does not add any meaning. Now after we remove words with a length of 1, we’ll be left with “John Doe kernel”. So unless ownership is of importance in your kernel, this is a good thing.

We do this using regular expressions (regex) in the python re module. Regex is widely used across NLP for various tasks like information extraction (email adresses, phone numbers, zip codes, etc.), data cleaning, etc. The official python documentation on regex is a great place to get yourself familiar with regex expressions.

This can be a bit tricky to understand. Let’s take a deeper look into this.

h_pct is the percentage of the most frequent words in the corpus which we want to remove. l_pct is the percentage of the least frequent words which we want to remove.

counts = pd.Series(''.join(df.short_description).split()).value_counts()
counts
the 166126
to 111620
of 95175
a 94604
and 89678
...
catch!" 1
big-day 1
incarcerates 1
323-square-foot 1
co-trustee, 1
Length: 208227, dtype: int64

These are the number of times all the words occur in our dataset. Out dataset had 208227 distinct words, so having a h_pct of 1.0 will remove the top 1% of the most frequent words in the corpus.

high_freq = counts[:int(pd.Series(''.join(df.short_description).split()).count()*1/100)]
high_freq
the 166126
to 111620
of 95175
a 94604
and 89678
...
butternut 5
NGO 5
Mary, 5
songwriter, 5
distracted, 5
Length: 39624, dtype: int64

These are the most frequent words which will be removed from the dataset.

The intuition behind doing this is that since these words are so common, we expect them to be spread across multiple unrelated documents (or news in our case), and so, they wouldn’t be of much use in classifying the text.

low_freq = counts[:-int(pd.Series(''.join(df.short_description).split()).count()*1/100):-1]
low_freq
co-trustee, 1
323-square-foot 1
incarcerates 1
big-day 1
catch!" 1
..
Brie. 1
non-plant 1
fetus? 1
Techtopus” 1
Look). 1
Length: 39623, dtype: int64

These are the top 1% of the least frequent words. All of these words occur only once in the vocabulary, and thus don't have much significance and can be removed.

As with stopwords, there is no hard and fast number of words which can be removed. This depends on your corpus, and you should ideally experiment with different values which work the best for you. This is exactly what I’ve done next.

Checking for optimum h_pct and l_pct combination

This finds the optimal values of h_pct and l_pct. We start with integer values from 0 to 10, and based on the results, can further tune the percentages to 0.5% steps. Keep in mind that this is very time consuming. The model will be trained i*j times, where i is the number of h_pct values and j is the number of l_pct values. So for h_pct and l_pct values between 0 to 10 (both inclusive), my model was trained a total of 121 times.

For me, the first iteration returned values of 0.0 and 1.0 for h_pct and l_pct respectively, the below are the results when running for values between 0.0 to 0.5% for h_pct and 0.5 to 1.5% for l_pct:

SVC max: 63.79560061555173%, pct:0.0|1.0

We see that the optimal values are still 0.0 and 1.0 for h_pct and l_pct respectively. We’ll go ahead with these values.

df.loc[df.short_description.str.len()==df.short_description.str.len().max()]
df.loc[58142]['short_description']'This week the nation watched as the #NeverTrump movement folded faster than one of the presumptive nominee\'s beachfront developments. As many tried to explain away Trump\'s reckless, racist extremism, a few put principle over party. The wife of former Republican Senator Bob Bennett, who died on May 4, revealed that her husband spent his dying hours reaching out to Muslims. "He would go to people with the hijab [on] and tell them he was glad they were in America," she told the Daily Beast. "He wanted to apologize on behalf of the Republican Party." In the U.K., Prime Minister David Cameron called Trump\'s proposal to ban Muslims from entering the U.S., "divisive, stupid and wrong." Trump\'s reply was that he didn\'t think he and Cameron would "have a very good relationship." The press is also doing its part to whitewash extremism. The New York Times called Trump\'s racism "a reductive approach to ethnicity," and said Trump\'s attitude toward women is "complex" and "defies simple categorization," as if sexism is suddenly as complicated as string theory. Not everybody\'s going along. Bob Garfield, co-host of "On the Media," warned the press of the danger of normalizing Trump. "Every interview with Donald Trump, every single one should hold him accountable for bigotry, incitement, juvenile conduct and blithe contempt for the Constitution," he said. "The voters will do what the voters will do, but it must not be, cannot be because the press did not do enough."'

This is the longest story in our dataset. We will use this as reference to see how our preprocessing function works.

Model building

This one function takes the dataframe, the values of h_pct and l_pct, the model and a verbosity flag as inputs, and returns the predictions, model accuracy, and the trained model.

Lets dissect the components:

df['short_description_processed'] = preprocessing(df['short_description'],h_pct,l_pct)
df['concatenated'] = df['headline'] + '\n' + df['short_description_processed']
df['concat_processed'] = preprocessing(df['concatenated'],0,0)

First we run the preprocessing function on the ‘short_description’ column with out values of h_pct and l_pct and store the result in ‘short_description_processed’.
Then we add the ‘headline’ to this column, and store the result in ‘concatenated’.
Finally, we run the preprocessing function again on ‘concatenated’, but this time without removing any of the more and less frequent words, and store the result in ‘concat_processed’. Not removing the words from this set is a way of increasing words which occur in the headline compared to those which occur in the story.

X = df['concat_processed']
y = df['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

bow_xtrain = bow.fit_transform(X_train)
bow_xtest = bow.transform(X_test)

We use ‘concat_processed’ as our variable column, and ‘category’ as our target.

Then we generate the Bag-of-Words for the train and test corpus respectively using the bow object of CountVectorizer. As a rule of thumb, the CountVectorizer is fit and transformed on the train set, but only transformed on the test set. This is so that the model does not learn anything from the test set.

model.fit(bow_xtrain,y_train)
preds = model.predict(bow_xtest)

The model is trained on the training BoW, and predictions are generated for the test BoW.

Putting it together

Running the prep_fit_pred function on the dataframe with 0 and 1 as the h_pct and l_pct (as found earlier) and with the LinearSVC() module, with verbose as True.

preds_abc, acc_abc, abc = prep_fit_pred(df, 0, 1, LinearSVC(), verbose=True)Number of words in corpus before processing: 3985816
Number of words in corpus after processing: 2192635 (55.0%)
Number of words in final corpus: 3498319 (88.0%)
Raw story:
This week the nation watched as the #NeverTrump movement folded faster than one of the presumptive nominee's beachfront developments. As many tried to explain away Trump's reckless, racist extremism, a few put principle over party. The wife of former Republican Senator Bob Bennett, who died on May 4, revealed that her husband spent his dying hours reaching out to Muslims. "He would go to people with the hijab [on] and tell them he was glad they were in America," she told the Daily Beast. "He wanted to apologize on behalf of the Republican Party." In the U.K., Prime Minister David Cameron called Trump's proposal to ban Muslims from entering the U.S., "divisive, stupid and wrong." Trump's reply was that he didn't think he and Cameron would "have a very good relationship." The press is also doing its part to whitewash extremism. The New York Times called Trump's racism "a reductive approach to ethnicity," and said Trump's attitude toward women is "complex" and "defies simple categorization," as if sexism is suddenly as complicated as string theory. Not everybody's going along. Bob Garfield, co-host of "On the Media," warned the press of the danger of normalizing Trump. "Every interview with Donald Trump, every single one should hold him accountable for bigotry, incitement, juvenile conduct and blithe contempt for the Constitution," he said. "The voters will do what the voters will do, but it must not be, cannot be because the press did not do enough."
Processed story:
week nation watch nevertrump movement fold faster one presumpt nomine beachfront developments mani tri explain away trump reckless racist extremism put principl party wife former republican senat bob bennett die may reveal husband spent die hour reach muslims would go peopl hijab tell glad america told daili beast want apolog behalf republican party u k prime minist david cameron call trump propos ban muslim enter u divisive stupid wrong trump repli think cameron would veri good relationship press also part whitewash extremism new york time call trump racism reduct approach ethnicity said trump attitud toward women complex defi simpl categorization sexism sudden complic string theory everybodi go along bob garfield co host media warn press danger normal trump everi interview donald trump everi singl one hold account bigotry incitement juvenil conduct blith contempt constitution said voter voter must cannot becaus press enough
Adding additional columns to story:
Sunday Roundup
week nation watch nevertrump movement fold faster one presumpt nomine beachfront developments mani tri explain away trump reckless racist extremism put principl party wife former republican senat bob bennett die may reveal husband spent die hour reach muslims would go peopl hijab tell glad america told daili beast want apolog behalf republican party u k prime minist david cameron call trump propos ban muslim enter u divisive stupid wrong trump repli think cameron would veri good relationship press also part whitewash extremism new york time call trump racism reduct approach ethnicity said trump attitud toward women complex defi simpl categorization sexism sudden complic string theory everybodi go along bob garfield co host media warn press danger normal trump everi interview donald trump everi singl one hold account bigotry incitement juvenil conduct blith contempt constitution said voter voter must cannot becaus press enough
Final story:
sunday roundup week nation watch nevertrump movement fold faster one presumpt nomin beachfront develop mani tri explain away trump reckless racist extrem put principl parti wife former republican senat bob bennett die may reveal husband spent die hour reach muslim would go peopl hijab tell glad america told daili beast want apolog behalf republican parti u k prime minist david cameron call trump propo ban muslim enter u divis stupid wrong trump repli think cameron would veri good relationship press also part whitewash extrem new york time call trump racism reduct approach ethnic said trump attitud toward women complex defi simpl categor sexism sudden complic string theori everybodi go along bob garfield co host media warn press danger normal trump everi interview donald trump everi singl one hold account bigotri incit juvenil conduct blith contempt constitut said voter voter must cannot becaus press enough
Predicted class: POLITICS
Actual class: POLITICS
Confusion matrix
precision    recall  f1-score   support

ARTS & CULTURE 0.56 0.47 0.51 1280
BLACK VOICES 0.59 0.40 0.48 1494
BUSINESS 0.51 0.48 0.49 1959
COLLEGE 0.48 0.42 0.45 377
COMEDY 0.48 0.43 0.45 1708
CRIME 0.57 0.59 0.58 1124
DIVORCE 0.85 0.72 0.78 1131
EDUCATION 0.43 0.31 0.36 331
ENTERTAINMENT 0.64 0.75 0.69 5299
ENVIRONMENT 0.67 0.26 0.37 437
FIFTY 0.37 0.15 0.22 462
FOOD & DRINK 0.64 0.73 0.68 2055
GOOD NEWS 0.40 0.20 0.27 461
GREEN 0.41 0.37 0.39 865
HEALTHY LIVING 0.35 0.33 0.34 2209
HOME & LIVING 0.75 0.72 0.73 1384
IMPACT 0.44 0.26 0.33 1141
LATINO VOICES 0.66 0.29 0.40 373
MEDIA 0.55 0.46 0.50 929
MONEY 0.56 0.32 0.41 563
PARENTING 0.66 0.76 0.71 4169
POLITICS 0.71 0.84 0.77 10804
QUEER VOICES 0.79 0.69 0.74 2084
RELIGION 0.55 0.50 0.53 843
SCIENCE 0.59 0.47 0.53 719
SPORTS 0.68 0.74 0.71 1612
STYLE & BEAUTY 0.78 0.81 0.80 3928
TASTE 0.37 0.16 0.22 692
TECH 0.58 0.41 0.48 687
TRAVEL 0.69 0.76 0.73 3263
WEDDINGS 0.80 0.78 0.79 1205
WEIRD NEWS 0.41 0.26 0.32 881
WELLNESS 0.63 0.74 0.68 5883
WOMEN 0.41 0.29 0.34 1152
WORLD NEWS 0.51 0.17 0.26 718
WORLDPOST 0.56 0.59 0.57 2060

accuracy 0.64 66282
macro avg 0.57 0.49 0.52 66282
weighted avg 0.63 0.64 0.62 66282

Accuracy: 63.83%

The ‘short_description’ column has 3985816 words. This is reduced to 2192635 (a 45% reduction) after applying the preprocessing function.
The final corpus, after adding the heading and running preprocessing again, has 3498319 words.

We get to see the longest story in its raw and final formats. You can see that the final version is all lowercase, is free of punctuation, symbols and numbers, and is considerably shorter as words have been removed.

The classification report shows that the F1 score is lowest for FIFTY and TASTE, both of which have pretty low number of news stories, and highest for STYLE AND BEAUTY, which has a large number of news stories.

The average accuracy is 63.83%.
While this may not look appealing, for 36 labels, random guessing would have an accuracy of just 2.78%. So our model is 23 times better than random guessing. Sounds much better this way! 😄

This concludes this (short?) introduction. Hopefully, you’ll be able to start with your own text classification projects now.

Your next steps should be trying different models (which can be done easily just by passing a different model to the prep_fit_pred function), exploring and playing around with the preprocessing steps, feature engineering (is there any relation between the length of the story and its label?), and going into more detail of why around 40% of stories have been misclassified (hint: 20% of EDUCATION stories have been classified as POLITICS).

Once you are confident with the basics, you might want to follow some of the techniques used by some of the top Kagglers in their NLP submissions. The good folks at Neptune.ai have you covered.

As already mentioned in the beginning, both the dataset and the code are available on Kaggle. They are also available on GitHub if you want to try this on your local or Google Colab.

Thanks for sticking around till here. Any feedback will be more than welcome!

You can reach out to me at siddhant.sadangi@gmail.com, and/or connect with me on LinkedIn.

Medium still does not support payouts to authors based out of India. If you like my content, you can buy me a coffee :)

--

--

Siddhant Sadangi
Analytics Vidhya

ML Developer Advocate @Neptune.ai | Ex — Data Scientist @Reuters | Ex — ETL Developer @Deloitte | linkedin.com/in/siddhantsadangi | siddhant.sadangi@gmail.com