Step-by-Step Text Classification

Jonathan Adiel Pranoto
Tokopedia Data
Published in
8 min readMar 4, 2019

In previous article, Niko has explained what Text Analytics is and the process behind it. If you are still not familiar with Text Analytics, you can read it before we continue further. You can find it here :

Preprocessing in Text Analytics

Source: Pixabay

What Text Classification is

Simply, Text Classification is a process of categorizing or tagging raw text based on its content. Text Classification can be used on almost everything, from news topic labeling to sentiment analysis from user’s review.

For Example :

“Phone was terrible. Super slow. Had some serious bloatware and pop up ads like no other. I do not recommend this phone to anyone”

From text above, our classification model can decide particular category or tag that is relevant to our needs, which in this case, is negative reviews.

How Text Classification Works

There are several ways to create automatic Text Classification models.

A. Rule-Based

Rule-based model applies rules that are derived from relevant elements, or text patterns in determining text category. In other word, when new text is introduced as input to the model, the model will decide the appropriate category based on rules that was created earlier. Let’s say, if we want to categorize user’s review as a sentiment analysis, first, we need to set up a list of what defines positive reviews (compliment, satisfaction, good experience, etc) and negative reviews (complaint, critic, bad experience, etc). Then, we can count words that define those category to decide whether it’s a positive review or negative review. Rule-based model may have good accuracy, but creating those rules needs deep analysis and numerous testings which consume a lot of time. This method is also not scalable and is hard to maintain because every new data may need new rules which can affect existing data.

B. Machine Learning-Based

With advance growth of machine learning, nowadays it is just easier to create model using machine learning and feed data to the model and wait until the model is complete. With the machine learning model, it’s much easier and faster to classify category from input text. One important step to use machine learning is feature extraction. We transform text to numeral representation in from of vector, one way of doing it is using bags of word, or basically, we count every words in a text, or using tfidf (term frequency inverse document frequency).

The picture above shows a simple flow of Text Classification using machine learning. At the first stage, we use text input as train data. Then we need to do feature extraction to convert text into numerical representation, because most machine learning algorithms only understand numerical features. After getting the features that we want, we feed those features into machine learning algorithm along with predefined tag/category. When it finishes its training, we will have our classification model.

After we have our classification model, then we can input data that we want to predict, use the same flow, we do feature extraction and then feed it into our model. When it’s done we will get our predicted class.

Using machine learning makes text classification much easier and faster-with higher accuracy too.

Playground — Try it yourself!

Ok, now let’s try it in real case. Here we have data about news headline ( you can download it here)

TASK

Classify news based on its title

STEP 1 : ‘Load csv data into our dataframes’

import pandas as pd
df = pd.read_csv(‘uci-news-aggregator.csv’)
df.head()

The ‘TITLE’ would be the input, and the ‘CATEGORY’ would be the category output that we want to predict

STEP 2 : ‘Cleaning the data’

2a. First we want to remove unnecessary columns and also add category name to our dataframes

# — b : business — t : science and technology — e : entertainment — m : health
category_name = pd.DataFrame({‘id’ : [‘b’,’t’,’e’,’m’],
‘cat_name’ : [‘business’,’science and technology’, ‘ entertainment’, ‘health’]})
df = df[[‘TITLE’, ‘CATEGORY’]]
df = pd.merge(df,category_name,left_on = “CATEGORY”,right_on = “id”)[[‘TITLE’,’cat_name’]]

2b. Observe the data distribution

df.groupby(‘cat_name’).TITLE.count().plot.bar(color = ‘blue’)

We can see there is data imbalance, data imbalance may cause some negative effects, like, the minor class will be ignored and has less weight on deciding the class. This could be a big problem in some cases, such as fraud, or medical classification. It can be solved with several way, such as oversampling or undersampling the data in order to make it balance. However in this practice, this data imbalance problem will be ignored.

2c. Remove empty data and useless punctuation

import re
def remove_punc(sentence) :
sentence = sentence.lower()
sentence = re.sub(‘[^a-z]+’, ‘ ‘, sentence)
return sentence

2d. Remove stop words that are not needed in our news title

from nltk.corpus import stopwords
sw = stopwords.words(‘English’)
def stop_word(sentence):
new_sentence = []
for word in sentence.split():
if word not in sw:
new_sentence.append(word)
return(“ “.join(new_sentence))
df[‘TITLE’] = df[‘TITLE’].apply(stop_word)
df.head(10)

2e. Stem the words, using NLTK SnowballStemmer

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(‘english’)
def stem(sentence):
new_sentence = []
for word in sentence.split():
word = stemmer.stem(word)
new_sentence.append(word)
return(“ “.join(new_sentence))
df[‘TITLE’] = df[‘TITLE’].apply(stem)
df.head(10)

You may notice that some words lose some of alphabets. It’s expected because stemming itself is the process of reducing the words to their ‘stem’ word.

Why not use Lemmatization instead? Unlike stemming, which keeps reducing words until reaching it ‘stem’, lemmatization depends on correctly identifying the intended part of speech and its meaning.

NLTK also provides lemmatization, so lets give it a try.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize(sentence):
new_sentence = []
for word in sentence.split():
word = lemmatizer.lemmatize(word)
new_sentence.append(word)
return(“ “.join(new_sentence))
df[‘TITLE’] = df[‘TITLE’].apply(lemmatize)
df.head(10)

Looks way better than stemming.

STEP 3 : ‘Feature Extraction’

As I already mention above, to let machine learning algorithm to understand and process the data we are going to feed, we need to transform text data into numerical representation.

One way of doing that is using tfidf, which stands for term frequency inverse Document Frequency. It is used to show how important words in document or corpus. Tfidf uses weight to value how many times a word appears in the document and is offset by the number of documents in the corpus that contain the word, because sometimes, some words will appear more often than other words.

We will use ‘ sklearn.feature_extraction.text.TfidfVectorizer’ to calculate a tfidf value for each customer complaints.

from sklearn.feature_extraction.text import TfidfVectorizer
Tfidf = TfidfVectorizer(min_df=5, ngram_range=(1, 2))
tfidf_features = Tfidf.fit_transform(df.TITLE)
tfidf_features.shape

(422419, 97542)

From 422,419 rows of data, it produce 97,542 features. Representing each tfidf score for bigrams and unigrams

Let’s find out top 5 words for each category and each gram using chi square.

from sklearn.feature_selection import chi2
import numpy as np
N = 5
Number = 1
for category in df[‘cat_name’].unique():
features_chi2 = chi2(tfidf_features, df[‘cat_name’] == category)
indices = np.argsort(features_chi2[0])
feature_names = np.array(Tfidf.get_feature_names())[indices]
unigrams = [x for x in feature_names if len(x.split(‘ ‘)) == 1]
bigrams = [x for x in feature_names if len(x.split(‘ ‘)) == 2]
print(“{}. {} :”.format(Number,category))
print(“\t# Unigrams :\n\t. {}”.format(‘\n\t. ‘.join(unigrams[-N:])))
print(“\t# Bigrams :\n\t. {}”.format(‘\n\t. ‘.join(bigrams[-N:])))
Number += 1

1. business :
# Unigrams :
. fed
. oil
. china
. bank
. stocks
# Bigrams :
. in march
. home sales
. bank of
. wall street
. us stocks
2. science and technology :
# Unigrams :
. facebook
. microsoft
. samsung
. apple
. google
# Bigrams :
. climate change
. xbox one
. google glass
. galaxy s5
. samsung galaxy
3. entertainment :
# Unigrams :
. thrones
. movie
. season
. kim
. kardashian
# Bigrams :
. kanye west
. miley cyrus
. game of
. of thrones
. kim kardashian
4. health :
# Unigrams :
. virus
. study
. mers
. cancer
. ebola
# Bigrams :
. ebola virus
. linked to
. west africa
. west nile
. ebola outbreak

STEP 4 : ‘Feed features into Machine Learning Algorithm’

We will use some models that sklearn already provide

  1. Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
DTClass = DecisionTreeClassifier(criterion=”gini”, splitter=”best”, random_state=77)
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df[‘cat_name’], test_size = 1/5, random_state = 50)
DTClass.fit(X_train,y_train)
prediction = DTClass.predict(X_test)
from sklearn.metrics import accuracy_score
print(“accuracy score:”)
print(accuracy_score(y_test, prediction))

accuracy score:
0.9134747407793191

2. Linear Support Vector Classification

from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
svc = LinearSVC()
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df[‘cat_name’], test_size = 1/5, random_state = 50)
svc.fit(X_train,y_train)
prediction = svc.predict(X_test)
print(“accuracy score:”)
print(accuracy_score(y_test, prediction))

accuracy score:
0.9575067468396383

3. Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
RFClass = RandomForestClassifier(n_estimators=500, criterion=”gini”, random_state=77)
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df[‘cat_name’], test_size = 1/5, random_state = 50)
RFClass.fit(X_train,y_train)
prediction = RFClass.predict(X_test)
print(“accuracy score:”)
print(accuracy_score(y_test, prediction))

accuracy score:
0.9435751148146394

4. Stochastic Gradient Descent

from sklearn.linear_model import SGDClassifier
SGDC = SGDClassifier()
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df[‘cat_id’], test_size = 1/5, random_state = 50)
SGDC.fit(X_train, y_train)
prediction = SGDC.predict(X_test)
print(“accuracy score:”)
print(accuracy_score(y_test, prediction))

accuracy score:
0.9263765920174234

LinearSVC has the highest performance, with 0.96. It means, it has the best algorithm compared to others to solve our problem / task. To increase the score, we can do more feature engineering which may result in a better feature, with a better feature, our model can learn better thus improve its accuracy. If you want to know more about feature engineering, you can follow our blog.

--

--