Document classification with ELMo(Embeddings from Language Models)

Ahmet Taşdemir
7 min readMay 16, 2023

--

Word embeddings serve as the feature representation for words in various applications, including image caption generation and machine translation. However, these tasks typically involve the integration of different learning models, such as Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) models, or combinations of LSTM models. To explore a practical application of word embeddings, let’s focus on a simpler task: document classification.

Document classification is a widely used objective in Natural Language Processing (NLP). It holds immense value for individuals dealing with vast amounts of data, such as news websites, publishers, and universities. Consequently, it becomes intriguing to examine how the learning of word vectors can be adapted for real-world tasks like document classification. This adaptation involves the concept of embedding entire documents rather than focusing solely on individual words.

Dataset

To tackle this task, we will utilize a pre-organized collection of text files comprising news articles from the BBC. Each document within this dataset is categorized into one of five distinct groups: Business, Entertainment, Politics, Sports, or Technology.

Our initial step involves downloading and loading the data into memory. We will employ an existing function, download_data(), for data retrieval. Furthermore, we will make slight modifications to the read_data() function. In addition to returning a list of articles, with each article represented as a string, it will now also provide a list of filenames. These filenames will serve the purpose of creating labels for our classification model.

Subsequently, we will create and train a tokenizer using the data, similar to our previous procedures. This tokenizer will assist in processing the text effectively.

Moving forward, our focus shifts to generating labels. Since our objective is to train a classification model, we require both inputs and corresponding labels. Our inputs will consist of document embeddings, which we will explore how to compute shortly. The labels will be represented as ID numbers ranging from 0 to 4, with each category (e.g., business, technology) assigned a distinct label. Leveraging the information contained in the filenames, which include the category as a folder, we can extract the necessary label ID for each document.

We will use the pandas library to create the labels. First we will convert the list of filenames to a pandas Series object using:

labels_ser = pd.Series(filenames, index=filenames)

An example entry in this series could look like data/bbc/tech/127.txt. Next, we will split each item on the “/” character, which will return a list [‘data’, ‘bbc’, ‘tech’, ‘127.txt’]. We will also set expand=True. expand=True will transform our Series object to a DataFrame by turning each item in the list of tokens into a separate column of a DataFrame. In other words, our pd.Series object will become an [N, 4]-sized pd.DataFrame with one token in each column, where N is the number of files:

abels_ser = labels_ser.str.split(os.path.sep, expand=True)

In the resulting data, we only care about the third column, which has the category of a given article (e.g. tech). Therefore, we will discard the rest of the data and only keep that column:

labels_ser = labels_ser.iloc[:, -2]

What we did here can be written as just one line by chaining the sequence of commands to a single line

labels_ser = pd.Series(filenames, index=filenames).str.split(os.path.sep, 
expand=True).iloc[:, -2].map(
{'business': 0, 'entertainment': 1, 'politics': 2, 'sport': 3,
'tech': 4}
)

With that, we move on to the next important step, i.e. splitting the data into train/test subsets. When training a supervised model, we generally need three datasets:

  • A training set — This is the dataset the model will be trained on.
  • A validation set — This will be used during the training to monitor model performance (e.g. signs of overfitting).
  • A testing set — This will be not exposed to the model at any time during the model training. It will only be used after the model training to evaluate the model on unseen data.

In this exercise, we will only use the training set and the testing set. This will help us to keep our conversation more focused on embeddings and keep the discussion about the downstream classification model simple. Here we will use 67% of the data as training data and use 33% of data as testing data. Data will be split randomly:

from sklearn.model_selection import train_test_split
train_labels, test_labels = train_test_split(labels_ser, test_size=0.33)

Now we have a training dataset to train the model and a test dataset to test it on unseen data. We will now see how we can generate document embeddings from token or word embeddings.

Generating document embeddings

Let’s first remind ourselves how we stored embeddings for skip-gram, CBOW, and GloVe algorithms.

ELMo embeddings are an exception to this. Since ELMo generates contextualized representations for all tokens in a sequence, we have stored the mean embedding vectors resulting from averaging all the generated vectors.

To compute the document embeddings from skip-gram, CBOW, and GloVe embeddings, let us write a function.

The generate_document_embeddings() function takes the following arguments:

  • texts — A list of strings, where each string represents an article
  • filenames — A list of filenames corresponding to the articles in texts
  • tokenizer — A tokenizer that can process texts
  • embeddings — The embeddings as a pd.DataFrame, where each row represents a word vector, indexed by the corresponding token.

The initial step of the function involves preprocessing the texts. This preprocessing includes converting the strings into sequences and then converting them back into a list of strings. This allows us to leverage the tokenizer’s built-in preprocessing capabilities for text cleaning. Subsequently, each preprocessed string is split using spaces, resulting in a list of tokens. These tokens are then used to index the corresponding positions in the embeddings matrix. Finally, the mean vector for the document is computed by taking the average of all the selected embedding vectors.

Once these steps are completed, we proceed to load the embeddings from various algorithms such as skip-gram, CBOW, and GloVe. We then compute the document embeddings using these loaded embeddings. Although we will only demonstrate the process for the skip-gram algorithm, it can be easily extended to the other algorithms since they share similar inputs and outputs.

Classifying documents with document embeddings

We will be training a simple multi-class (or a multinomial) logistic regression classifier on this data.

It’s a very simple model with a single layer, where the input is the embedding vector (e.g. a 128- element-long vector), and the output is a 5-node softmax layer that will output the likelihood of the input belonging to each category, as a probability distribution.

We will be training several models, as opposed to a single run. This will give us a more consistent result on the performance of the model. To implement the model, we’ll be using a popular general-purpose machine learning library called scikit-learn (https://scikit-learn.org/stable/). In each run, a multi-class logistic regression classifier is created with the sklearn.linear_model. LogisticRegression object. Additionally, in each run:

  1. The model is trained on the training inputs and targets
  2. The model predicts the class (a value from 0 to 4) for each test input, where the class of an input is the one that has the maximum probability from all classes
  3. The model computes the test accuracy using the predicted classes and true classes of the test set

By setting multi_class=’multinomial’, we are making sure it’s a multi-class logistic regression model (or a softmax classifier). This will output: Skip-gram accuracies: [0.882…, 0.882…, 0.881…, 0.882…, 0.884…] When you follow the procedure for all the skip-gram, CBOW, GloVe, and ELMo algorithms, you will see a result similar to the following. This is a box plot diagram. However, as performance is quite similar between trials, you won’t see much variation present in the diagram:

The results reveal that the skip-gram algorithm achieves approximately 86% accuracy, closely followed by CBOW, which exhibits comparable performance. Surprisingly, GloVe falls significantly behind skip-gram and CBOW, achieving an accuracy of only around 66%. This observation suggests a potential limitation in the GloVe loss function. Unlike skip-gram and CBOW, which consider both positive (observed) and negative (unobserved) target-context pairs, GloVe solely focuses on observed pairs. This discrepancy may hinder GloVe’s ability to generate effective word representations.

Notably, ELMo outperforms all other models with an accuracy of around 98%. However, it’s important to note that ELMo has been trained on a much larger dataset than the BBC dataset. Therefore, it would be unfair to solely compare ELMo’s performance based on this metric.

In this article we explored the extension of word embeddings into document embeddings and their utilization in a downstream classifier model for document classification. Initially, we delved into word embeddings using various algorithms like skip-gram, CBOW, and GloVe. We then proceeded to create document embeddings by averaging the word embeddings of all the words present in each document. This approach applied to skip-gram, CBOW, and GloVe algorithms. However, with the ELMo algorithm, document embeddings were directly inferred from the model.

Subsequently, these document embeddings were employed to classify BBC news articles across different categories such as entertainment, technology, politics, business, and sports.

Thank you for your time and reading. I hope you found my article useful.

For Notebook: https://github.com/AhmetTasdemir/NLP_with_TensorFlow/blob/master/Ch04-Advance-Word-Vectors/ch4_document_classification.ipynb

Please don’t hesitate to contact me if you notice anything wrong or would like to provide feedback:
ahmettsdmr1312@gmail.com
https://www.linkedin.com/in/ahmet-tasdemir/

--

--