Rosetta Stone

Text Classification Using Scikit-learn, PyTorch, and TensorFlow

Donglin Chen
The Startup
Published in
9 min readJul 3, 2020

--

Text classification has been widely used in real-world business processes like email spam detection, support ticket classification, or content recommendation based on text topics.

Thanks to the popular machine learning and deep learning libraries like scikit-learn, PyTorch, and TensorFlow, we can leverage them to build text classification models for text classification.

I was interested in learning how the three frameworks compare to each other, I had not found working examples that build text classifiers using all of them and compare the performances. Here I am going to build multi-class text classifiers using the above popular libraries and see how they perform against each other. I will not go in details about NLP, machine learning models, and deep learning network structures, there are already plantly of relavent information availbale online.

The source code I used to build and train text classification in this article are available at https://github.com/donglinchen/text_classification:

Gather Data

The first step to train a model is to gather data that can be used for training. For example, if we were to build a support ticket problem classifier to automatically assign support ticket to support team bases on the problem description, we would gather the problem description for the support cases and their queue or class category related to a support team.

For demostration purpose I used the BBC articles fulltext and category which is freely available online. After download the data and load into pandas data frame we can take a look at the first 5 rows of data:

(source code available at: https://github.com/donglinchen/text_classification/blob/master/gather_explore_data.ipynb)

There are total 2225 rows and the number of samples for each categories are:

Categoriy_Name   Record_Count
sport 511
business 510
politics 417
tech 401
entertainment 386

We can see the data set size is in good balance that each category has similar number of samples. In reality we might find most sample data belong to just a few top categories, in that case we may want to filter out some top category samples to rebalance the data size.

Next we want to check the number of words in each row, below shows the word count statistics. So within the total 2225 number of records, there are average 390 words in each text, the minimum number of words in text is 90, maximum number of words in text is 4492, and the medium number of words is 337

count    2225.000000
mean 390.295281
std 241.753128
min 90.000000
25% 250.000000
50% 337.000000
75% 479.000000
max 4492.000000

The samples/words-per-sample ratio is: 2225 / 390 = 6. According to this article (https://developers.google.com/machine-learning/guides/text-classification/step-2-5) When the value for this ratio is small (<1500), small multi-layer perceptrons that take “bag of words” input perform better or at least as well as sequence models. So when I build deep learning model using PyTorch and TensorFlow I will choose MLP model instead of sequence model.

Label encoding

Before feeding data to train deep learning model, the text and label category need to be converted to numeric data as below. Converting label category to numeric value can be done using scikit learn’s LabelEncoder.

df['encoded_category'] = LabelEncoder().fit_transform(df["category"])

Feature extraction

During data exploration we learnt we can use “bag of words” approach to extract input features from text. Here I choose to convert a collection of raw documents to a matrix of TF-IDF (term frequency-inverse document frequency) features, tf-idf provides a weight of how relavent a perticular word is to the document or text context. Applying TF-IDF encoding to text can be done using scikit learn’s TfidfVectorizer class. Below TfidfVectorizer takes all default parameters which applies standard tokenization and lower case to input texts.

https://github.com/donglinchen/text_classification/blob/master/feature_extraction.ipynb

vectorizer = TfidfVectorizer()
x_train_transformed = vectorizer.fit_transform(x_train)
x_test_transformed = vectorizer.transform(x_test)

Alternatively Kares provides tokenizer and pad_sequences to covert text sentences into sequences matrix. In addition, we can apply word embedding to allows words with similar meaning to have a similar representation.

tokenizer = Tokenizer(num_words=50000, oov_token=’<oov>’)
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index
x_seq = tokenizer.texts_to_sequences(x_train)
train_padded = pad_sequences(x_seq, padding=’post’, maxlen=20000)
test_padded = pad_sequences(tokenizer.texts_to_sequences(x_test), padding='post', maxlen=max_len)

After transforming 1780 sample training texts using TfidfVectorizer, the sample data became a 1780 by 26501 matrix, each word in the text became a floating point number. i.e. the first training sample of:

tv future in the hands of viewers with home theatre systems  plasma high-definition tvs and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time...

became:

[0.0953176 , 0.0616137 , 0.07514611, 0.05589101, 0.06576305, ...,
0.13084726, 0.04227424, 0.03356176]

On the other hand, if we used keras tokenizer and pad sequence the training sample would be converted into something like:

[4353, 2571,  361,  629, 3112, 1528, 3331, 4353, 2572,  653, ...,
0, 0, 0, 0, 0]

Build, train, and evaluate models

Before we build models we need to split the data into train and test dataset so we can train model using the train dataset and then test the model accuracy using the test dataset.

X_train, X_test, y_train, y_test = train_test_split(
df[‘text’], df[‘category’], test_size=.2, stratify=df[‘category’], random_state=42)

1. Build classification models using scikit learn

Let’s first build traditional machine learning models using scikit learn, the popular open source machine learning library.

I built two machine learning models using scikit learn: Stochastic Gradient Decent and Support Vector Machine. For simplicity I used Pipeline which combines both feature extraction and model definition into a single pipeline. Below shows the pipeline that combine TfidfVectorizer with SGDClassifier.

https://github.com/donglinchen/text_classification/blob/master/model_scikit_learn.ipynb

Pipeline([(“tfidf_vector_com”, TfidfVectorizer(
input=”array”,
norm=”l2",
max_features=None,
sublinear_tf=True,
stop_words=”english”)),
(“clf”, SGDClassifier(
loss=”log”,
penalty=”l2",
class_weight=’balanced’,
tol=0.001))])

To train the model:

pipeline.fit(X_train, y_train)

After training completed, I used the model to predict the test dataset and compare with the real label of the test dataset.

pred_test = pipeline.predict(X_test)
pred_train = pipeline.predict(X_train)
print("test accuracy", str(np.mean(pred_test == y_test)))
print(metrics.classification_report(y_test, pred_test))

The result shows the model achieved test accuracy of 98.2% with the below metrics report.

                  precision  recall  f1-score   support

business 0.98 0.97 0.98 102
entertainment 0.97 1.00 0.99 77
politics 0.98 0.96 0.97 84
sport 1.00 1.00 1.00 102
tech 0.97 0.97 0.97 80

accuracy 0.98 445
macro avg 0.98 0.98 0.98 445
weighted avg 0.98 0.98 0.98 445

2. Build deep learning classification model using PyTorch

Similarly as above, we can also use TF-IDF to extract features out of the sample input text. However, we would have to covent both the tf-idf encoded sparse matrix and the label encoded values into PyTorch tensors, as below:

https://github.com/donglinchen/text_classification/blob/master/model_pytorch.ipynb

x_train = torch.tensor(scipy.sparse.csr_matrix.todense(x_train)).float()
x_test = torch.tensor(scipy.sparse.csr_matrix.todense(x_test)).float()
y_train = torch.tensor(y_train.values)
y_test = torch.tensor(y_test.values)

Then we can build a simple neural network. There are many different optimizer functions avaiable in PyTorch API, here I chose the popular Adam optimizer and I found it worked pretty well. Below is the neural network structure I constructed.

from torch import nn
model = nn.Sequential(
nn.Linear(x_train.shape[1], 64),
nn.ReLU(),
nn.Linear(64, df[‘category’].nunique()),
nn.LogSoftmax(dim=1))
# Define the loss
criterion = nn.NLLLoss()
# Forward pass, log
logps = model(x_train)
# Calculate the loss with the logits and the labels
loss = criterion(logps, y_train)
loss.backward()# Optimizers need parameters to optimize and a learning rate
optimizer = optim.Adam(model.parameters(), lr=0.002)

To train the neural network with 50 epochs:

epochs = 50
for e in range(epochs):
optimizer.zero_grad()
output = model.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()

To evaluate the model while training, we can use the following:

with torch.no_grad():
model.eval()
log_ps = model(x_test)
test_loss = criterion(log_ps, y_test)
ps = torch.exp(log_ps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == y_test.view(*top_class.shape)
test_accuracy = torch.mean(equals.float())

The training process took about 3 seconds to complete. After 13 epochs the model has achieved 90% test accuracy and continue to go up to 98.2% at 50 epochs..

3. Build deep learning classification model using TensorFlow

I have used TF-IDF to extract features from input text. We can do the same with TensorFlow or we can use padded sequences and word embedding. I am going to implement both approaches so we can compare the performance of three libraries and see how we can train model using different feature extraction methods.

3.1. Use TfidfVectorizer to extract features

https://github.com/donglinchen/text_classification/blob/master/model_tensorflow_tfidf.ipynb

After transforming input texts into tf-idf encoded matrix, we can constuct a neural network with two hidden layers with “relu” activation fuinction, and a output layer with “softmax” activation function:

model = tf.keras.Sequential([
tf.keras.layers.Dense(48, activation=’relu’, input_shape=. (x_train.shape[1],)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(24, activation=’relu’),
tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(df[‘category’].nunique(), activation=’softmax’)
])

Dropouts were added at each hidden layer to reduce model overfitting. The model used “sparse_categorical_crossentropy” as the loss function because we need to classify multiple output labels. I also choose the popular “Adam” as the optimizer.

model.compile(loss=’sparse_categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

And here is the model summary:

Model: "sequential_21"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_62 (Dense) (None, 48) 1286208
_________________________________________________________________
dropout_45 (Dropout) (None, 48) 0
_________________________________________________________________
dense_63 (Dense) (None, 24) 1176
_________________________________________________________________
dropout_46 (Dropout) (None, 24) 0
_________________________________________________________________
dense_64 (Dense) (None, 5) 125
=================================================================
Total params: 1,287,509
Trainable params: 1,287,509
Non-trainable params: 0

Then we can train the model with 10 epochs:

history = model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test), verbose=2)

Using TF-IDF vectorized input data to feed into a two hidden layer network, and applying dropout at each hidden layer to prevent overfitting, we can achieve 98.4% test accuracy in just 8 epochs. It took 10 seconds to run though all 10 epochs.

3.2 Use padded sequence and word embedding for feature extraction

https://github.com/donglinchen/text_classification/blob/master/model_tensorflow.ipynb

vocab_size = 20000
tokenizer = Tokenizer(num_words=vocab_size, oov_token=’<OOV>’)
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index
x_seq = tokenizer.texts_to_sequences(x_train)
padding_type = ‘post’
max_len = 5000
x_train = pad_sequences(x_seq, padding=padding_type, maxlen=max_len)
x_test = pad_sequences(tokenizer.texts_to_sequences(x_test), padding=padding_type, maxlen=max_len)

Below network structure included word embedding so the input sequences were encoded into multi-deminsional vectors with sytactic analysis. And GlobalAveragePooling1D reduces the input sizes for efficient computing. Additionally dropouts were added to reduce overfitting.

model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, 32, input_length=x_train.shape[1]),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(32, activation=’relu’),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(df[‘category’].nunique(), activation=’softmax’)
])
model.compile(loss=’sparse_categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

The below shows the computational graph.

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_11 (Embedding) (None, 5000, 32) 640000
_________________________________________________________________
global_average_pooling1d_8 ( (None, 32) 0
_________________________________________________________________
dropout_23 (Dropout) (None, 32) 0
_________________________________________________________________
dense_24 (Dense) (None, 32) 1056
_________________________________________________________________
dropout_24 (Dropout) (None, 32) 0
_________________________________________________________________
dense_25 (Dense) (None, 5) 165
=================================================================
Total params: 641,221
Trainable params: 641,221
Non-trainable params: 0

The training process took about 41 seconds to complete. After 26 epochs the test accuracy reached 95%, and after 32 epoches the accuracy stayed at 96%. Without TF-IDF encoding, the accuracy is slightly worse than the previous models trained.

Conclusion

All three popular machine learning / deep learning frameworks can be used to build multi-class text classification models. In this experiment, all 3 frameworks gave us the similar model accuracy at about 98%. Using TF-IDF to encode text into float point matrix provided a better model accuracy.

Scikit learn is ralatively simple to use and it contains a great number of ready to use estimators. In addition it provides a model_selection module for us to automatically tune the hyper-parameters of an estimator, however, exhaustive grid search may take pretty long time.

On the other hand, deep learning framework like PyTorch and TensorFlow are a little more complex to use in my opnion. But the frameworks makes it easy to continue to train based on previously trained model weights, and also allow us to apply transfer learning to save training effort and time. It could allow us to increase model complexity by adding more network layers.

In this specific text classification example the training data size is relatively small and different categories are well balanced, using scikit learn to build classification models was simple and just worked very well. However, if we have large number and complex texts we might want to try building a deep learning network to serve the classification purpose.

Scikit Learn, PyTorch, or TensorFlow, which would you choose for your next text classificatio project?

--

--