Text Classification with Machine Learning vs Deep Learning

19 min readJan 15, 2023

How a simple Machine Learning model can outperform an overconfident Neural Network when precision is key

In this article, we will be doing an in-depth exploration of text classification, a very popular task in natural language processing, by examining a real-world project. We will delve into key concepts and techniques used in this field, as well as the challenges and considerations involved in implementing it in a practical setting.

The ability to accurately categorize and classify text based on its content is crucial in today’s world of vast amounts of unstructured data, so in this opportunity we will be exploring the intricacies of labeling data that includes a diverse array of URLs and HTML files. By using cutting-edge techniques and algorithms, we will compare the performance of several machine learning and deep learning models to predict the themes present within these texts.

Topics:

Motivation
Extracting the relevant text
The approaches
How to preprocess the text
First approach: Bag of Words and TF-IDF + Machine Learning
What was misclassified and why
Precision vs Recall
Second approach: DistilBERT + Machine Learning
Third approach: DistilBERT + Deep Learning
Conclusions

1. Motivation

The goal of this project is to create a pipeline that, given a URL or HTML file as input, can handle both scenarios and predict its subject matter.

The six classes that we are going to work with are Webinar, Event, Press release, Article, Blog, and MISC, being this last one a catch-all category that comprises all the various texts that don’t belong to any of the other five classes.

In order to keep things simple, we will refer to both URLs and HTMLs as simply URLs regardless of whether the URL or the HTML file was provided. When we later encounter a column containing URLs, in cases where the URL is not provided, it will be replaced with a missing value instead. This will help to avoid confusion and keep the naming consistent throughout the article.

2. Extracting the relevant text

We want to gather all the text from within a webpage that will help our model to classify what is the theme of its text. But a problem we may face when web scraping the URLs and extracting all the texts inside is that a lot of them are irrelevant and talk about other subjects such as different articles that are not related and won’t contribute to the classification task and will even hurt the model.

So we need some type of filter, and for this, we will be using the newspaper3k library that will do the work for us. This module is mostly geared towards newspaper texts and it provides useful tools to extract the main text from an HTML.

In this article, we are going to focus on the modeling part and the metrics of the different algorithms, without delving deep into the preprocessing of the text.

3. The approaches

We will be exploring three different models for predicting the correct label for our URLs:

Bag of Words (BoW) with a weighting of term frequency-inverse document frequency (TF-IDF). This weighting considers the importance of a word based on its frequency in a document and its rarity across other documents. We will then evaluate the model’s performance using cross-validation and compare the results of different classifiers such as Logistic Regression, SVM, KNN, Multinomial NB, and Random Forest. The one with the highest accuracy will be selected and we will assess its performance on the test data.
The best path to take may be intuitively making a model interpret a task the same way humans would, and the previous approach lacks something crucial for that, which is context. To capture the relationship between words, we will be using Transformers with their attention mechanism. Our model will be the pre-trained DistilBERT base model from HuggingFace, which will be fine-tuned with our data. We will then use a machine learning model, selected through cross-validation in the same way explained before, and check its metrics on the test data.
A similar insight to the second one, but now we will aim to uncover more complex patterns through the use of Deep Learning and neural networks. We will be applying this method to predict the correct labels for our URLs and make a final comparison between all three models.

4. How to preprocess the text

The preprocessing part of the pipeline is a very important step, as it can impact greatly the model’s performance. Depending on which model will be used, the original text may need to be modified so it has the most appropriate format to feed the model.

When using Bag of Words, we want all similar words (e.g. we and us, or read and reading) to be reduced to their base form, so the dimensionality of the model is reduced by eliminating synonyms and related words, which results in a more efficient model. To do this, we will extract the lemma of every token in the text and remove all stop words and every symbol that won’t contribute to the model, which translates into lemmatization and cleaning of the text.

On the other hand, if the context of the text is what we aim to focus on, then different words should not be merged into a single base form. DistilBERT was trained on a huge corpus of text that includes the whole English Wikipedia and Toronto Book Corpus and is able to interpret all available information in a sentence. This is why it is more appropriate to keep the text the same way it has been written and take it as input for the model.

5. First approach: Bag of Words and TF-IDF + Machine Learning

As a result of all the preprocessing steps needed for the text, we will start here with a df variable representing a pandas DataFrame. Its shape is(857, 4) where 857 is the total amount of URLs to classify, and 4 is the number of columns, being:

url, a pandas Series of strings that contains all URLs,
text, a pandas Series of strings containing the relevant text extracted from each URL,
lemmatized_text, a pandas Series of strings of a lemmatized and cleaned version of the text column, mainly useful for this first model, and
label, a pandas Series of integers that represent an encoded version of the true labels.

Let’s write our first code cell by assigning all these four strings to their respective variable names, since they will be used repeatedly later on, and encoding straightforwardly our labels to work with numbers instead:

URL = 'url'
TEXT = 'text'
LEMMATIZED = 'lemmatized_text'
TARGET = 'label'

labels_encoded = {'Article': 0, 'Blog': 1, 'Event': 2, 
                  'Webinar': 3, 'PR': 4, 'MISC': 5}

labels_decoded = {y: x for x, y in labels_encoded.items()}

num_labels = len(labels_encoded)

df[TARGET] = df[TARGET].replace(labels_encoded)

Now it’s time to split our data into train and test sets, stratifying it by the labels in order to make sure that every class is well represented for training and testing since we don’t count with so much data. Let’s write a function for this that will also be useful afterwards:

import pandas as pd
from sklearn.model_selection import train_test_split
from typing import List, Dict, Tuple, Union, Any 


def split_data(df: pd.DataFrame, column: str, test_size: float = 0.2, 
               val_size: float = None, random_state: int = None
               ) -> Union[Tuple[pd.DataFrame, pd.DataFrame, pd.Series, 
                                pd.Series], 
                          Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, 
                                pd.Series, pd.Series, pd.Series]]:

  X = df[[column]]
  y = df[TARGET]

  X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                      test_size=test_size,
                                                      stratify=y, 
                                                      random_state=random_state
                                                      )
  
  if val_size:
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                      test_size=1/((1-test_size)/val_size),
                                                      stratify=y_train, 
                                                      random_state=random_state)
    
    return X_train, X_val, X_test, y_train, y_val, y_test

  return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(df, LEMMATIZED, 
                                              test_size=0.2, random_state=0)

To conclude the preprocessing part of our first model, we will use TF-IDF to vectorize our data using only unigrams:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1, 1))
X_train_tr = vectorizer.fit_transform(X_train[LEMMATIZED])
X_test_tr = vectorizer.transform(X_test[LEMMATIZED])

We could also have used bigrams by setting ngram_range = (1, 2) or even trigrams, but this would have greatly increased the output’s dimensionality, which would make the model more complex and, as it was also tested for this particular task, cause even worse results on the metrics.

To determine the most effective classifier for this first approach, we will experiment with a variety of them and select the one that performs with the highest accuracy.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

clfs = [
    ('LogisticRegression', LogisticRegression(max_iter=3000,
                                              class_weight='balanced')
    ), 
    ('RandomForest', RandomForestClassifier(max_depth=18,
                                            n_estimators=75,
                                            random_state=0)
    ), 
    ('KNN 5', KNeighborsClassifier(n_neighbors=5)
    ),
    ('SVM C1', SVC(C=1, 
                   class_weight='balanced')
    ),
    ('MultinomialNB', MultinomialNB()
    ),
    ]

We will be using StratifiedKFold from sklearn.model_selection and print the accuracy for each of the five splits in each classifier and their average, which will let us decide which one to choose.

Let’s write the functions to do so:

import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score


def print_val_scores(scores: List[float]) -> None:

  print(f'Cross validation scores: mean: {np.mean(scores):.3f}, '
        f'all: {[round(score, 3) for score in scores]}')


def print_stratified_kfold(clfs: List[Tuple[str, Any]], X_train: pd.DataFrame, 
                           y_train: pd.Series, n_splits: int = 5, cv: int = 5, 
                           ) -> None:
  
  for clf in clfs:
    print(f'\nStratifiedKFold - classifier: {clf[0]}:\n')
    skf = StratifiedKFold(n_splits=n_splits)

    scores = cross_val_score(clf[1], 
                            X_train,
                            y_train,
                            cv=cv)
    
    print_val_scores(scores)

print_stratified_kfold(clfs, X_train_tr, y_train)

Output:

Logistic Regression outperforms every other classifier with an average accuracy of 0.78, so this is the model that will be used for the test set.

When performing cross-validation, we are choosing our model based on their accuracy just to simplify the task, given that is also a good metric to pay attention to, but since this is a multiclass classification problem with imbalanced data, we’ll put the most focus on the model’s f1-score when evaluating its performance on the test data.

Let’s see our results on the test set with our selected model:

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, classification_report

clf = LogisticRegression(
    max_iter=3000, 
    class_weight='balanced',
    )

clf.fit(X_train_tr, y_train)

y_pred = clf.predict(X_test_tr)
y_probs = clf.predict_proba(X_test_tr)
accuracy = np.mean(y_pred == y_test)

y_test_labeled = [labels_decoded[x] for x in y_test]
y_pred_labeled = [labels_decoded[x] for x in y_pred]

ConfusionMatrixDisplay.from_predictions(y_test_labeled, y_pred_labeled)
plt.title(f'Logistic Regression - acc {accuracy:.3f}', size=15)
plt.show()

Output:

print(classification_report(y_test_labeled, y_pred_labeled))

Output:

Classification report of BoW + ML approach

Here we have the confusion matrix and classification report of our first model. The accuracy is 0.79, and the f1-score is 0.78. We can observe what classes look alike for the model, such as articles and press releases, or webinars and events.

When a model gets confused between different classes, it’s important to inspect the data to understand why this happens. It could be that the algorithm is not properly working, or that the nature of the data is such that confusion between certain classes is to be expected.

In this particular case, when visually inspecting the misclassified samples (which will be covered in the next step) there are cases where articles, blogs and press releases, and webinars and events are difficult to tell apart even for humans, so most mistakes made by the model are understandable.

6. What was misclassified and why

Proceeding to the next step, we will examine the mistakes our model made when trying to predict the correct label. For this, we will create a pandas DataFrame with shape (n, 6), being n the number of misclassified samples and 6 the columns, where:

URL is an array of strings containing the relevant URLs,
LEMMATIZED is an array of strings containing the preprocessed texts that were input to the model,
y_true contains the true label of each misclassified row,
conf_true tells the model’s confidence in each true -but not predicted- label,
y_pred has the predicted label of the model on each misclassified sample, and
conf_pred gives the model’s confidence in each predicted label, which is the highest across all six classes.

Let’s translate this into a function:

def create_df_mistakes(df: pd.DataFrame, column: str, 
                        X_test: pd.DataFrame, y_test: pd.Series, 
                        y_pred: Union[np.ndarray, pd.Series], 
                        y_probs: np.ndarray) -> pd.DataFrame:
  
  if type(y_test) == pd.core.series.Series:
    y_pred = pd.Series(y_pred, index=y_test.index) 

  elif type(y_test) == np.ndarray:
    y_pred = pd.Series(y_pred)
    y_test = pd.Series(y_test)

  mask = y_pred != y_test
    
  df2 = X_test.copy()[mask]
  df2['y_true'] = y_test[mask].replace(labels_decoded)
  df2['y_pred'] = y_pred[mask].replace(labels_decoded) 

  assert (df2['y_true'] != df2['y_pred']).all()

  df_mistakes = pd.merge(df2, df[[URL, column]], on=column)
  df_mistakes.index = df2.index

  df_confidences = df_mistakes[['y_true', 'y_pred']]\
                   .applymap(lambda x: labels_encoded[x])

  confidence_pred = y_probs[mask, df_confidences['y_pred']]
  confidence_true = y_probs[mask, df_confidences['y_true']]

  df_mistakes['conf_true'] = confidence_true.round(2)
  df_mistakes['conf_pred'] = confidence_pred.round(2)

  df_mistakes = df_mistakes[[URL, column, 'y_true', 
                             'conf_true', 'y_pred', 'conf_pred']]

  return df_mistakes

df_mistakes = create_df_mistakes(df, LEMMATIZED, X_test, 
                                 y_test, y_pred, y_probs)

df_mistakes.drop(columns=[URL, LEMMATIZED]).head()

Due to non-disclosure agreement reasons, the URL and text input columns have been removed from the view of the pandas DataFrame by using the drop function .drop(columns=[URL, LEMMATIZED]).

Output:

7. Precision vs Recall

This is perhaps the most interesting part and the one we should pay the most attention to of the project. Our first results are quite good, but when we are carrying out a real-world project, it’s essential to focus on the specific needs and requirements of our customer, who could be a colleague from within our department, another department within the company, or even an external client from another company. This ensures that the project meets their expectations and delivers the desired results.

In the original task that inspired this article, being certain that every classified sample is correctly classified is significantly more important than correctly classifying every existing sample. This means: precision is much more crucial than recall.

So being this the situation, we can maybe classify only the samples where the model’s confidence is above a certain threshold (for example 60%) and expect the precision of the classifier to be better since it makes sense that the model’s confidence level will be higher in the samples it classified correctly than the ones where it was wrong.

Let’s write some code to plot a histogram of the model’s confidences and the cumulative distribution in both scenarios:

import seaborn as sns


def plot_distributions_of_confidence(y_test: pd.Series, y_pred: np.ndarray, 
                                     y_probs: np.ndarray, 
                                     print_statistical_measures: bool = False
                                     ) -> None:

  sns.set_theme()

  mask = y_test != y_pred

  wrong_conf_pred = np.max(y_probs[mask], axis=1)
  right_conf_pred = np.max(y_probs[~mask], axis=1)
  assert y_probs.shape[0] == wrong_conf_pred.shape[0] + right_conf_pred.shape[0]

  if print_statistical_measures:
    print(f'Confidence of incorrectly classified samples \t- Median: {np.median(wrong_conf_pred):.4f}, Mean: {np.mean(wrong_conf_pred):.4f}.')
    print(f'Confidence of correctly classified samples \t- Median: {np.median(right_conf_pred):.4f}, Mean: {np.mean(right_conf_pred):.4f}.\n')

  fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

  ax[0, 0].hist(wrong_conf_pred, bins=np.linspace(0, 1, 11), density=None, color='r', alpha=0.6)
  ax[0, 0].set_title('Incorrectly classified samples', size=16)
  ax[0, 0].set_xlabel('Confidence')
  ax[0, 0].set_ylabel('Number of samples')

  ax[0, 1].hist(right_conf_pred, bins=np.linspace(0, 1, 11), density=None, color='g', alpha=0.6)
  ax[0, 1].set_title('Correctly classified samples', size=16)
  ax[0, 1].set_xlabel('Confidence')
  ax[0, 1].set_ylabel('Number of samples')

  ax[1, 0].hist(wrong_conf_pred, bins=np.linspace(0, 1, 11), density=True, color='r', alpha=0.6, cumulative=1)
  ax[1, 0].set_xlabel('Confidence')
  ax[1, 0].set_ylabel('Cumulative distribution')

  ax[1, 1].hist(right_conf_pred, bins=np.linspace(0, 1, 11), density=True, color='g', alpha=0.6, cumulative=1)
  ax[1, 1].set_xlabel('Confidence')
  ax[1, 1].set_ylabel('Cumulative distribution')

  plt.tight_layout()
  plt.show()

  sns.reset_orig()

plot_distributions_of_confidence(y_test, y_pred, y_probs)

Output:

Distributions of confidence with BoW + ML approach

From our perspective, there are both positive and negative aspects to consider when analyzing the plot.

Let’s examine the graphs on the left. We can see that when the model makes an error in predicting a sample, its confidence is typically quite low (most fall between 20–30%). If we set a threshold as previously mentioned, at 60%, we would effectively eliminate 100% of misclassified samples, resulting in a precision of 100% as every sample we label would be accurate.

Let’s now turn our attention to the graphs on the right. We can see that by implementing this strategy, our recall rate will be significantly lower. This is because the model is not always highly confident in its correct classifications either, and by setting a threshold of 60% we will be disregarding roughly 75% of the correctly classified samples. And that’s a lot…

Precision is more important than recall, but usually we can’t afford such a loss in the latter. The model’s overall low confidence levels prevent us from having better results, so let’s now inspect our second approach and see if the results are more promising.

8. Second approach: DistilBERT + Machine Learning

Time to put some context into the equation. These variables will be used for our both Machine Learning and Deep Learning approaches:

from transformers import AutoTokenizer, TFAutoModel

distilbert_model = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(distilbert_model)

tf_model = TFAutoModel.from_pretrained(distilbert_model)

We will start this approach by writing and using three functions that we need for creating a DatasetDict object from our train and test datasets, tokenizing the text data in a batch, and getting its hidden states:

from datasets.dataset_dict import DatasetDict
from datasets import Dataset


def create_dataset_dict(X_train: pd.DataFrame, X_test: pd.DataFrame, 
                        y_train: pd.Series, y_test: pd.Series, 
                        X_val: pd.DataFrame = None, y_val: pd.Series = None
                        ) -> DatasetDict:
 
  datasets = {
    'train': Dataset.from_dict(
        {TEXT: X_train[TEXT],
         TARGET: y_train,
         }
         ),
    'test': Dataset.from_dict(
        {TEXT: X_test[TEXT],
         TARGET: y_test,
         }
         )
    }
  
  if X_val is not None and y_val is not None:
    datasets['validation'] = Dataset.from_dict(
        {TEXT: X_val[TEXT],
         TARGET: y_val,
         }
         )

  dataset = DatasetDict(datasets)
  
  return dataset


def tokenize(batch: Dict[str, Any]) -> Dict[str, Any]:

  tokenized = tokenizer(batch[TEXT], padding=True, truncation=True)
  
  return tokenized


def get_hidden_states(batch: Dict[str, Any]) -> Dict[str, Any]:

  inputs = tokenizer(
      batch[TEXT], 
      padding=True, 
      truncation=True, 
      return_tensors='tf',
      )

  outputs = tf_model(inputs)

  hidden_states = {'hidden_state': outputs.last_hidden_state[:, 0].numpy()}
  
  return hidden_states

X_train, X_test, y_train, y_test = split_data(df_text, column=TEXT, 
                                              test_size=0.2, random_state=0)

dataset = create_dataset_dict(X_train, X_test, y_train, y_test)

dataset_encoded = dataset.map(
    tokenize, 
    batched=True, 
    batch_size=None,
    )

dataset_encoded.reset_format()

dataset_hidden = dataset_encoded.map(
    get_hidden_states, 
    batched=True, 
    batch_size=16,
    )

X_train_hidden = np.array(dataset_hidden['train']['hidden_state'])
y_train_hidden = np.array(dataset_hidden['train'][TARGET])

X_test_hidden = np.array(dataset_hidden['test']['hidden_state'])
y_test_hidden = np.array(dataset_hidden['test'][TARGET])

We have completed the necessary preprocessing for the second model, and to select it, we will perform cross-validation once more, but this time, excluding Multinomial NB from the clfs variable as it is not compatible with negative values:

clfs.pop() # Multinomial NB was our last element

print_stratified_kfold(clfs, X_train_hidden, y_train)

Output:

Logistic Regression is once again the winner among all classifiers, and it is going to be the chosen model:

clf = LogisticRegression(
    max_iter=3000, 
    class_weight='balanced',
    )

clf.fit(X_train_hidden, y_train_hidden)

y_pred = clf.predict(X_test_hidden)
y_probs = clf.predict_proba(X_test_hidden)
accuracy = np.mean(y_pred == y_test)

y_test_labeled = [labels_decoded[x] for x in y_test]
y_pred_labeled = [labels_decoded[x] for x in y_pred]

ConfusionMatrixDisplay.from_predictions(y_test_labeled, y_pred_labeled)
plt.title(f'Logistic Regression - acc {accuracy:.3f}', size=15)
plt.show()

Output:

Confusion Matrix of DistilBERT + ML approach

print(classification_report(y_test_labeled, y_pred_labeled))

Output:

Classification report of DistilBERT + ML approach

While our accuracy and f1-score using DistilBERT embeddings are slightly improved compared to those achieved with Bag of Words and TF-IDF, they are far from being game-changing.

And what about the model’s confidence in its predictions? Let’s take a look:

plot_distributions_of_confidence(y_test, y_pred, y_probs)

Output:

Distributions of confidence with DistilBERT + ML approach

Upon examination, we can observe a significant difference when comparing this graph with our previous one. Let’s dig into it.

Similar to before, we will start by analyzing the graphs on the left. We can see that the model is now more certain when it misclassifies a sample, which is not ideal, but expected given that the model’s confidence with Bag of Words was typically low in every scenario. However, we also notice that the number of incorrectly classified samples decreases as we increase the confidence level. And this is definitely what we want to see.

On the other hand, the graphs on the right show the opposite trend: the model has a high level of confidence when it correctly classifies a sample. This is very positive as it allows us to set a threshold and still retain a high level of precision while only losing a small percentage of correctly classified samples.

We can experiment with different thresholds to observe how precision and recall are affected, and we could even plot an ROC curve for a more detailed analysis. Here are some results we would obtain by setting different thresholds of confidence:

Correlation threshold and metrics with DistilBERT + ML approach

As we can see, there is a clear trade-off between precision and recall. In simple terms, the higher the precision (the more certain we want the model to be when making a prediction), the lower the recall (given the risk of the model not classifying a sample at all due to lack of confidence).

To understand where the values shown come from, let’s take the example of a threshold of 90%. We can see in the previous Distributions of confidence plot that we would be losing all misclassified samples except one, while still keeping 63 correctly classified samples. This translates to a 63/64 * 100% = 98.4% precision. On the other hand, we would be correctly classifying only 63 out of 172 total samples, resulting in a 63/172 * 100% = 36.6% recall.

The optimal threshold for this particular case will depend on the relative importance of precision versus recall for the end-user of the project, which is beyond the scope of this article.

In conclusion, while the accuracy and f1-score of the two models we’ve seen so far are relatively similar, there is a considerable difference in their level of confidence. And we will leverage this to significantly improve the relevant metrics of this second approach.

It’s time to test the performance of our third and last model by comparing what we did so far with the mighty neural networks.

9. Third approach: DistilBERT + Deep Learning

Let’s get into it. Up until now, we have been working without a fixed validation set, but doing cross-validation instead. Moving forward, it won’t be necessary to test the metrics of different models to choose the optimal among them, so to train the neural network we will create a unique validation set that will give us a more consistent measure of the performance of the model.

For the preprocessing phase of this approach, we will now create a function that generates a tensorflow.data.Dataset object for each of the train, validation, and test sets:

import tensorflow as tf
from transformers import DataCollatorWithPadding


def create_tf_dataset(dataset_encoded: DatasetDict, tokenizer: AutoTokenizer, 
                      batch_size: int = 16) -> Tuple[tf.data.Dataset, 
                      tf.data.Dataset, tf.data.Dataset]:

  tokenizer_columns = tokenizer.model_input_names

  data_collator = DataCollatorWithPadding(tokenizer=tokenizer, 
                                          return_tensors='tf',
                                          )

  tf_train_dataset = dataset_encoded['train'].to_tf_dataset(
      columns=tokenizer_columns, 
      label_cols=[TARGET], 
      shuffle=True,
      batch_size=batch_size,
      collate_fn=data_collator,
      )

  tf_val_dataset = dataset_encoded['validation'].to_tf_dataset(
      columns=tokenizer_columns, 
      label_cols=[TARGET], 
      shuffle=False,
      batch_size=batch_size,
      collate_fn=data_collator,
      )

  tf_test_dataset = dataset_encoded['test'].to_tf_dataset(
      columns=tokenizer_columns,
      label_cols=[TARGET], 
      shuffle=False,
      batch_size=batch_size,
      collate_fn=data_collator,
      )
  
  return tf_train_dataset, tf_val_dataset, tf_test_dataset

We are ready to carry out the whole preprocessing of our data so that it can be fed to our model:

X_train, X_val, X_test, y_train, y_val, y_test = split_data(
    df_text, column=TEXT, test_size=0.2, val_size=0.1, random_state=0
    )

dataset = create_dataset_dict(X_train, X_test, y_train, y_test, X_val, y_val)

dataset_encoded = dataset.map(
    tokenize, 
    batched=True, 
    batch_size=None,
    )

tf_train_dataset, tf_val_dataset, tf_test_dataset = create_tf_dataset(
    dataset_encoded, tokenizer
    )

To fine-tune our pre-trained model, we will need to create the following three functions:

import transformers
from transformers import DistilBertConfig
from tensorflow.keras.callbacks import EarlyStopping


def create_distilbert_config(dropout: float = 0.1, attention_dropout: 
                             float = 0.1) -> transformers.DistilBertConfig:
 
  config = DistilBertConfig(
      dropout=dropout, 
      attention_dropout=attention_dropout,
      output_hidden_states=True,
      num_labels=num_labels,
      )
  
  return config


def compile_model(tf_model: tf.keras.Model, learning_rate: float = 5e-6
                  ) -> tf.keras.Model:

  tf_model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
      metrics=tf.metrics.SparseCategoricalAccuracy()
      )  
  
  return tf_model


def train_model(tf_model: tf.keras.Model, tf_train_dataset: tf.data.Dataset, 
                tf_val_dataset: tf.data.Dataset, epochs: int = 100, 
                patience: int = 3) -> tf.keras.Model:

  callback = EarlyStopping(
    monitor='val_loss',
    patience=patience,
    restore_best_weights=True
    )

  tf_model.fit(
      tf_train_dataset,
      validation_data=tf_val_dataset, 
      epochs=epochs,
      callbacks=[callback]
      )
  
  return tf_model

create_distilbert_config() will let us provide some regularization to the model that helps it generalize better and avoid overfitting by establishing the dropout and attention_dropout parameters. The latter is a variant of the first one applied to the attention mechanism in the transformer model, randomly selecting a fraction of the attention weights to zero during training,
compile_model() is going to let us set the learning_rate parameter of the optimizer, which in this case will be Adam, and
train_model() will expect two input parameters to determine when to stop training the model, which are epochs, for the maximum number of training rounds, and patience, which represents how many epochs to wait before stopping the training if the validation loss does not improve.

Time to implement them and train our final model:

from transformers import TFAutoModelForSequenceClassification

config = create_distilbert_config(dropout=0.1, attention_dropout=0.1)

tf_model = (TFAutoModelForSequenceClassification.from_pretrained(
    distilbert_model, 
    config=config, 
    )
)

tf_model = compile_model(tf_model, learning_rate=1e-6)

tf_model = train_model(tf_model, tf_train_dataset, tf_val_dataset, 
                       epochs=1000, patience=5)

Output:

Start of the training of the neural network

The number of training epochs is set high so the actual end of the training is determined by the patience parameter.

With the training complete, we can now evaluate the model’s performance by analyzing the results:

output_logits = tf_model.predict(tf_test_dataset).logits
y_pred = np.argmax(output_logits, axis=-1)
y_probs = tf.nn.softmax(output_logits).numpy()
accuracy = np.mean(y_pred == y_test)

y_test_labeled = [labels_decoded[x] for x in y_test]
y_pred_labeled = [labels_decoded[x] for x in y_pred]

ConfusionMatrixDisplay.from_predictions(y_test_labeled, y_pred_labeled)
plt.title(f'Deep Learning - acc {accuracy:.3f}', size=15)
plt.show()

Output:

Confusion Matrix of DistilBERT + DL approach

print(classification_report(y_test_labeled, y_pred_labeled))

Output:

Classification report of DistilBERT + DL approach

Our metrics had an interesting boost. The f1-score reaches a promising 0.83 and the metrics across the different classes are overall improved as well compared to the Logistic Regression model.

Nevertheless, we can see upon closer examination that there is still a clear confusion within the model when it comes to differentiating articles from press releases, for example. This is likely due to the limited number of samples of Article currently available, and the best solution would be to gather more data.

To arrive at a conclusion about which of the three models we’ve reviewed so far is the best, let’s examine the distributions of confidence produced by this model:

plot_distributions_of_confidence(y_test, y_pred, y_probs)

Output:

Distributions of confidence with DistilBERT + DL approach

Our latest plot reveals a striking contrast compared to its previous one. The model now appears to have a higher level of confidence in both correctly and incorrectly classified URLs, which is not desirable.

This is the opposite of what occurred with Bag of Words and TF-IDF, where the model was typically uncertain in any scenario. As a result of these contrasting behaviors, both of them share the drawback of making it quite challenging to adjust the threshold in order to boost the precision without greatly compromising the recall. Neither a model that is always uncertain nor one that is always confident is helpful for that.

These are the different metrics we got with different thresholds when using the DistilBERT embeddings fine-tuned with Logistic Regression (our second model):

Let’s compare them with the results we obtain with a neural network (our current model):

Correlation threshold and metrics with DistilBERT + DL approach

We can observe that when setting a threshold of 60%, our last approach results in a significantly better recall with the same precision. However, if our goal is to achieve a very high precision, this can only be achieved using our second approach with Logistic Regression.

When modifying the threshold, the impact on the precision using neural networks is almost insignificant, as the majority of samples are concentrated at a confidence level greater than 90% for both misclassified and correctly classified URLs, causing the remaining data to have little effect.

10. Conclusions

We have explored a wide range of topics and models, and it is now time to determine the best approach for the task at hand.

Using Transformers with their attention mechanism has proven to be more reliable than Bag of Words and TF-IDF, being a key factor the model’s average confidence level when making a prediction. As a result, we can safely discard the initial approach and focus on the fine-tuning of DistilBERT embeddings.

The question remains of whether to use a simple machine learning model or a deep learning model. And the answer depends on the primary metric of concern. In this real-world task, where precision is highly prioritized over recall, a logistic regression model would be the best choice. However, if the importance of both metrics is more balanced, or if even classifying every sample correctly is the most crucial factor, a neural network would be more suitable.

Ultimately, it is important to have a clear understanding of the desired outcome in order to select the appropriate approach. In this situation, we end up with two models at our disposal that can effectively handle any scenario, depending on what we (or the end-user of the project) want to achieve.

For any inquiries, you are welcome to contact me on LinkedIn.

Text Classification with Machine Learning vs Deep Learning

1. Motivation

2. Extracting the relevant text

3. The approaches

4. How to preprocess the text

5. First approach: Bag of Words and TF-IDF + Machine Learning

6. What was misclassified and why

7. Precision vs Recall

8. Second approach: DistilBERT + Machine Learning

9. Third approach: DistilBERT + Deep Learning

10. Conclusions

Written by Hernan Matzner