By Zhanna Terechshenko, Vishakh Padmakumar, and Megan A. Brown
At CSMaP, we are committed to supporting open and accessible science. For us, this means promoting the creation and use of open-source software, providing high-quality replication materials for our publications, and contributing to existing open-source tools and frameworks. This series is an extension of that work: We’ll publish explainers on new open-source packages, methods, and publications by our Social Media and Political Participation Lab.
In this post, we show how to classify textual data using SMaBERTa — a wrapper for a stable version of RoBERTa language models. The code we provide was adapted from version 0.6 of simpletransformers. It uses the Simple Transformers library, which is built on top of the Transformers library by Hugging Face.
To install the package for this tutorial, run:
pip install smaberta
You can download and follow along via this notebook.
Automated text classification has become a staple toolkit for computational social scientists, which is why keeping up with state-of-the-art models in Natural Language Processing (NLP) is necessary for improving classification tasks. However, due to the fast-moving nature of NLP, code bases can change daily, making long term projects tricky and code less usable for replication or extensions of existing projects. To this end, we focus on a specific type of model — the transformer — and created “smaberta,” a Python package for using existing transformer models.
Transformer models are a class of deep neural network models. They have made large strides in improving the state-of-the-art scores on a wide array of standard NLP tasks — primarily by leveraging large-scale pre-training and transfer learning. Huggingface provides pre-trained models to the open source community for a variety of transformer architectures and we can use the same to perform any specific classification task. These models have shown promise in improving results on tasks with a small amount of labeled data — a regime common in the social sciences. For a comparison of transformer-based and traditional machine learning models across the various datasets, check out our research paper on transfer learning language models for politics.
First, we load the libraries and set the random seeds for replicability.
import pickleimport randomimport warningswarnings.filterwarnings('ignore')
import numpy as npimport pandas as pdfrom smaberta import TransformerModelimport torchrandom.seed(1)np.random.seed(1)pd.set_option('precision', 0)torch.cuda.manual_seed(1)torch.manual_seed(1)
For the classification task, we need paired data consisting of free-form text accompanied by supervised labels towards the particular task.
For the purpose of this post, we are using a sample from the New York Times Front Page Dataset (Boydstun, 2014). Here the NYT headlines are coded according to the Comparative Agendas Project topics codebook. We have a total of 25 possible labels, each one associated with different topics and represented by a separate number.
train_df = pd.read_csv("nyt_train.csv")test_df = pd.read_csv("nyt_test.csv")# here’s what this dataset looks liketrain_df.head()
At the training stage, we have two parameters to be tuned: 1) learning rate, the step size at each iteration, and 2) the number of epochs, the number of passes through the entire training dataset. We set some sample values for the purposes of the tutorial. But we would recommend performing a grid search or random search cross-validation to find the best parameters for the model for your task.
lr = 1e-5epochs = 5
Initializing The Model
Let’s go through the main arguments needed for the model initialization:
- The first argument indicates which architecture to use. In this case, we use the Roberta architecture (alternatives include Bert, XLNet and others as provided by Huggingface). It also specifies the correct tokenizer and classification head.
- The second argument provides an initialization point as provided by Huggingface. In this case, since we are using
roberta, the correct initialization point is
roberta-base. Other examples are
- The number of labels to initialize the classification head appropriately is specified below. As per the classification task you would change this. In the example training set above, we have 25 labels, each corresponding to a Comparative Agendas Project topic, so we set
25. If we were doing a binary classification task, we would set
- For the number of epochs and learning rate, we use the parameters determined above:
1e-5for the learning rate and
5for the number of epochs.
fp16refers to floating point precision, which you set according to the GPUs available to you. It shouldn’t affect the classification result; rather it will just affect the performance. If you are not running the model on a GPU, you do not need to set this flag.
- Finally, we specify where the model logs and outputs will be saved and indicate that we would like to overwrite the output directory if it already exists. And if you’re rerunning the same experiment with different parameters, you might not want to reprocess the input every time, so the cache in the output directory allows you to reuse the same tokenization from different experiments. However, if your corpus changes, or you have new vocabulary, you will likely want to rerun the tokenization step.
model = TransformerModel('roberta', 'roberta-base', num_labels=25, reprocess_input_data=True, num_train_epochs=epochs, learning_rate=lr, output_dir='./saved_model/', overwrite_output_dir=True, fp16=False)
Finally, we train the model!
The classification model is the Roberta transformer with a sequence classification head (simple linear layer with dropout) on top. Similar to a traditional classifier, at training time, it fits the sequences to the labels sent as arguments to the train function.. The transformer first performs the encoding of the sentences based on its tokenizer followed by a forward pass on the neural network and an optimization step on cross entropy loss. Here’s the underlying code.
model.train(train_df['text'], train_df['label'])>>> Starting Epoch: 0>>> Starting Epoch: 1>>> Starting Epoch: 2>>> Starting Epoch: 3>>> Starting Epoch: 4>>> Training of roberta model complete. Saved to ./saved_model/.
To see more in depth logs, set flag
show_running_loss=True on the function call of
Inference from model
After training, the model is saved to the output directory that was passed in at initialization. We can either continue retaining the same model object, or load from the directory where it was previously saved. This example shows the loading to illustrate how you would do the same. This is helpful when you want to train and save a classifier and use the same one sporadically. For example, in an online setting where you have some labelled training data, you would train and save a model, and then load and use it to classify tweets as your collection pipeline progresses.
model = TransformerModel('roberta', 'roberta-base', num_labels=25,location="./saved_model/")
Evaluate on test set
At inference time we have access to the model outputs (saved to the output directory where you specified the output of your model on model initiation), which we can use to make the kind of predictions shown below. Similarly, you may perform any empirical analysis on the output before or after saving the same. You would then save the results for replication purposes. You can use the model outputs as you would on a normal PyTorch model (to perform subsequent analysis asynchronously). So by extension you could calculate the various class probabilities by using a softmax operation over the model outputs but here we just show label predictions and accuracy.
result, model_outputs, wrong_predictions =model.evaluate(test_df['text'], test_df['label'])preds = np.argmax(model_outputs, axis = 1)correct = 0labels = test_df['label'].tolist()for i in range(len(labels)):if preds[i] == labels[i]:correct+=1accuracy = correct/len(labels)print("Accuracy: ", accuracy)
We can save the outputs so we can perform an analysis later:
pickle.dump(model_outputs, open("../model_outputs.pkl", "wb"))
If we want to make predictions on a set of new text documents that we do not yet know the labels for, we can use the
model.predict function. Note that this function returns the list of model predictions and the list of raw model outputs, where the predictions are the single-class output, and model outputs can be used for further analysis.
texts = test_df['text'].tolist()
preds, model_outputs = model.predict(texts)test_df['predicted_labels'] = preds
Below is the table of model predictions for the first five records in the holdout set. You can see the classifier predicts the true label for each of the texts except for the second row, which was misclassified.
Boydstun, Amber E. (2014). “New York Times Front Page Dataset.” www.comparativeagendas.net. Accessed April 26, 2019.
Terechshenko, Zhanna and Linder, Fridolin and Padmakumar, Vishakh and Liu, Fengyuan and Nagler, Jonathan and Tucker, Joshua Aaron and Bonneau, Richard, A. (2020). “Comparison of methods in political science text classification: Transfer learning language models for politics.” Available at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3724644