WAI203: Part 1 — Fine-tuning our model

Published in

Warwick Artificial Intelligence

13 min readJan 20, 2022

“[…] financial analytics firms are turning to natural language processing to parse textual data hundreds of thousands of times faster and more accurately than humans can”
— Mikey Shulman, Head of Machine Learning at Kensho and MIT Sloan lecturer

Modern language models are trained on very large text corpora. This allows these models to develop a generalized “understanding” of language which can then be leveraged for many purposes, including sentiment analysis. This is called Transfer Learning.

In this tutorial, we will use transfer learning to create our very own model for financial sentiment analysis. In particular, we’re using a Language Perceiver.

The Perceiver (more precisely, the Perceiver IO) is a very exciting architecture developed at DeepMind which can be applied to multiple data modalities (including textual data!) individually or at the same time (e.g. processing audio and images at the same time for transcriptions) and includes improvements over the Transformer (another very popular architecture).

The Perceiver can be used for optical flow estimation, useful for autonomous driving (DeepMind)

Side note: If you’re interested in the ClimateHack, SatFlow, a piece of software developed by the OpenClimateFix, can leverage the Perceiver IO for satellite optical flow to predict future satellite images from current and past ones.

Let’s get started!

Setting up the environment

We have setup a Google Collaboratory notebook to make this tutorial more interactive.

Click 📓here to open the notebook.

First, lets change our notebook environment to a TPU environment. Select “Change runtime type” and then “TPU” as the hardware accelerator.

Finally, run the first cell to install the required dependencies. This tutorial uses the HuggingFace Transformers with PyTorch to train the model, and the HuggingFace Datasets to load our data.

Loading our data

We’re using the Financial PhraseBank by Malo et al. (2014) to train our model. This dataset consists of financial news categorised by sentiment polarity (that is, positive, negative, and neutral) which was annotated by 16 people with background knowledge in financial markets.

Loading the data using 🤗Datasets is as simple as calling load_dataset with the name of the dataset (you can explore other datasets here). In this tutorial a mirror of the original dataset is used.

dataset = load_dataset(
    'warwickai/financial_phrasebank_mirror',
    split='train'
)

The original dataset includes different configurations depending on the percentage of agreement between the annotators: ≥ 50%, ≥ 66%, ≥ 75% or 100% agreement. We’re using all sentences with ≥ 50% agreement sentences_50agree in this case.

Note that this dataset does not include training and testing splits, and the train split includes the entire dataset. We will split our dataset in the next section.

Dataset exploration

If we plot the frequency of each class or sentiment, we can see that our dataset is imbalanced (the majority of the sentences are neutral).

This imbalance may be caused because of data collection issues (bias or errors) or it may also be a property of the domain of the problem at hand.

This dataset consists of annotated sentences from financial news articles, and most sentences in an article only provide context. The annotators were also asked to classify the sentences depending on the possible influence on the stock price, therefore this is imbalance is likely a property of the domain of the problem we are trying to solve.

You can find out more about imbalanced classification here.

Dataset processing

Before we can train our model, we need to some pre-processing.

With similar language models based on the Transformers architecture, we need to tokenize our sentences before training the model. With the Perceiver IO, this is much simpler. We simply convert a sentence to its raw UTF-8 bytes.

Note that the IDs created by the Perceiver tokenizer are 6 higher than the raw UTF-8 decimal value. This is because the first 6 IDs are assigned to special values.

After tokenizing (i.e. encoding into raw UTF-8 bytes), we also need to pad the encoded data so that all sentences are 2048 bytes long. Let’s translate this into code.

tokenizer = PerceiverTokenizer.from_pretrained('deepmind/language-perceiver')
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='max_length')

In the first line, we instantiate a Language Perceiver tokenizer (note that deepmind/language-perceiver is the tokenizer from the model that we will fine-tune) so that we can tokenize our data later on.

After that we need to also instantiate a data collator. Data collators are used to create batches and may apply some transformations to our data. In this case, we need to pad our tokenized data, so we use a data collator with padding DataCollatorWithPadding. When we instantiate the data collator, it takes a tokenizer (needed because different tokenizers have different representations for padding, in the case of a Perceiver, a 0 is used) and a padding strategy. We use the tokenizer we instantiated before and set the padding strategy to max_length , meaning that all sentences are padded so that they are 2048 bytes long.

Now that we have our tokenizer ready, we can use it to tokenize the sentences in our dataset:

tokenized_dataset = dataset.map(
    lambda examples: tokenizer(
        examples['sentence'],
        truncation=True
    ),
    batched=True
)

map applies the tokenizer in batches to our dataset. Note that examples represents a batch, and we specify the sentence column which contains the sentences we want to classify. Finally, truncation is set to true, so that sentences greater than 2048 bytes long (the maximum size) are truncated.

Our dataset is now ready! Almost…

As a last step, we need to, as usual, split our data into training and testing. This can be done using the 🤗Datasets built-in method train_test_split.

tokenized_splits = tokenized_dataset.train_test_split(
    train_size=0.90,
    test_size=0.10,
    shuffle=True
)

We use a 90% training, 10% testing split and the data is also shuffled to avoid splits which do not accurately represent our data.

Metrics

One last thing before we dive-in and train our model. Metrics.

In general, we may refer to the accuracy of our model, however, there are other useful metrics that we can use to measure the performance of our model.

Precision

TP (True positive — correct predictions), FP (False positive — predictions for a sentiment with a different true sentiment)

The precision is the ratio of correct classifications for a given class to the overall number of predictions made for that class. Using the confusion matrix from before, the precision for the positive sentiment would be

127 / (127 + 5 + 0) = 0.962....

For this hypothetical confusion matrix, this would mean that when our model predicts that a sentence is positive, it is correct 96% of the time.

Note how we calculated the precision for an individual sentiment. What if we want to calculate an aggregate precision for all sentiments? There are 2 ways of doing this: macro precision and micro precision.

The micro precision consists of counting the total number of correct (true positive) and incorrect predictions (false positive). However, in a problem where each example (sentence in this case) can only have one classification (each sentence can only be positive, negative or neutral), this is in fact the same as calculating the accuracy.

Hence, we are using the macro precision, the mean precision of all sentiments. Here’s a simplified example to help you understand why this metric is useful.

Suppose we have an imaginary model which predicts whether a sentence is positive. Using the imaginary model we run predictions for 20 sentences, 7 positive and 13 non-positive. The model predicts that 8 sentences are positive and only 5 of those were actually positive. It also predicts that 12 sentences are non-positive and 10 of those sentences were actually non-positive.

Calculating the accuracy, we arrive at 10+5 / 20 = 0.75. If we only consider the accuracy, 75% might not seem bad. What happens if we calculate the precision? We obtain 5 / (5+3) = 0.625. This means that while our model has an accuracy of 75%, when it classifies sentences as positive, only 62.5% of those are positive in truth.

Recall

TP (True Positive — correct predictions), FN (False Negative — examples of a given sentiment that were incorrectly classified)

The recall is similar to precision. It is the ratio of correct classifications for a given sentiment to the number of examples (sentences) of that sentiment. Using the confusion matrix from before, the recall for the positive sentiment would be

127 / (127 + 12 + 2) = 0.9007....

For this hypothetical confusion matrix, this would mean that positive sentences in our dataset are predicted correctly 90% of the time.

Again, we might want to calculate an aggregate recall for all sentiments, and, again, we can calculate the aggregate recall using a macro or a micro average.

In this tutorial, we are using the macro recall which is simply the mean recall of all sentiments.

Training

In order to train our model, we are using the Transformers, and analysing our results using Weights & Biases.

Weights & Biases is a fantastic tool that allows us to visualize in real-time our experiments and improves collaboration by allowing progress to be shared. It is used by OpenAI and GitHub.

In order to use W&B you’ll need to first create an account 🔗here. After creating an account, you should be ready. While this step is not necessary it is recommended.

You can also choose to upload your model to the HuggingFace Hub once it has finished training. This allows you to easily share your model. Additionally, it allows you (or anyone) to try out your model on a web interface (you can try ours 🔗here). Again, this step is not necessary, but it is recommended and will be useful if you attend the next session of WAI203.

In order to use the HuggingFace Hub, create an account 🔗here.

Finally, in order to finalize the setup process for the two services above, tick the “Upload Model?” or “Use Weights & Biases?” boxes if you wish to do so, name your model (this step is required), and run the code cell.

After running the code cell you will be prompted for a HuggingFace account token. Follow the instructions displayed and click “Login”.

Create a token on your HuggingFace account settings

Now, onto the actual training code!

First things first, we need to load the pre-trained model which will be fine-tuned.

model = PerceiverForSequenceClassification.from_pretrained(
    'deepmind/language-perceiver',
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
)

We specify the name of the model we want to load, which in this case is the Language Perceiver from DeepMind. Finally, we need to specify how many labels we are classifying, and two mappings (dictionaries). id2label is used to map an id to the name of the label, for example, 0 -> negative, 1 -> neutral, 2 -> positive. label2id is similar, but maps a label to an id negative -> 0, neutral -> 1, positive -> 2.

We’re now ready to fine-tune the pre-trained model. In this tutorial, we are using the HuggingFace Trainer API. This API allows us to focus on training the model, abstracting other details that may not be necessarily relevant.

We start by instantiating a TrainingArguments object which contains the hyperparameters and other configurations.

training_args = TrainingArguments(
    output_dir=model_name,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    learning_rate=2e-5,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_strategy='steps',
    logging_first_step=True,
    logging_steps=5,
    push_to_hub=upload_model_tokenizer
)

If you have trained other models before, most if not all parameters here will be familiar. Let’s break this down.

output_dir is the directory where we want to save our model. In this case, we set this to the name of the model which you have defined before.
per_device_train_batch_size is the batch size per GPU/TPU for training. We found that a batch size of 16 works well.
per_device_eval_batch_size is batch size per GPU/TPU for evaluation. The same training batch size is used (16).
num_train_epochs is the total number of training epochs. We have found that 3 to 4 epochs tend to work well.
learning_rate is the initial learning rate which is then adjusted by AdamW an improved version of Adam .
evaluation_strategy is the evaluation strategy. That is, it controls how often we evaluate our model. epoch is used to only evaluate the model after each training epoch.
save_strategy controls how often a model checkpoint is saved. Again, epoch is used so that a checkpoint is only saved after each training epoch.
logging_strategy is set to steps so that the loss, learning rate and current epoch are logged every logging_steps which is set to 5.
logging_first_step is set to true so that the first step while training the model is logged.
push_to_hub controls whether our model should be automatically uploaded to the HuggingFace Hub. It is set to upload_model_tokenizer which represents the checkbox.

Finally, we put everything together!

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=tokenized_splits['train'],
    eval_dataset=tokenized_splits['test'],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)trainer.train()

We instantiate a Trainer object which takes our pre-trained model model, the tokenizer tokenizer, the training arguments training_args which we have just defined , our training tokenized_splits[‘train’] and tokenized_splits[‘test’] evaluation dataset, the data collator data_collator which will pad our tokenized sentences, and a compute_metrics function which will calculate the accuracy and the macro recall, accuracy, and precision when the model is evaluated with the testing split.

We call the train method, and, at last, our model is now being trained. If everything goes well, you should see something similar to the following.

In the Google Colab notebook, using the TPUs, it takes about 15 minutes to fine-tune the model. Now, how do we visualize the training progress using Weights & Biases?

First, head over to your profile on the Weights & Biases’ website. Click on “Projects” > “huggingface”. Here, you will be able to see different plots for the training loss, evaluation metrics (recall, accuracy, and so on) and other information.

Without Weights & Biases, sharing your results may involve creating separate documents or manually logging the outcome of your experiments using spreadsheets.

Using Weights & Biases, if you decide, for example, to train the model with different hyperparameters, a new run is created in the same project allowing you to compare results in a single location and all of this is done automatically.

Imagine for example, that you also want to share you results. Weights & Biases allows you to create reports, linked to this project, which can include the plots from the training and evaluation process, code snippets, or images, allowing you to clearly explain how your model works to others.

(You can see our Weights & Biases project for our pre-trained model for financial sentiment analysis, FINPerceiver, 🔗here)

Once the model finishes training, the final results are uploaded to Weights & Biases, and your model is saved on the HuggingFace Hub (uploading your model will take around 5 minutes, a progress bar will be shown).

Model Analysis

We can briefly analyse our model by plotting the confusion matrix of our testing data. By running the two cells under “Model Analysis”, you should see something similar to the following.

We have also added an “Inference” section where you can try your model.

Next Steps

We have successfully fine-tuned a language perceiver for financial sentiment analysis! However, you may ask, how do we apply this model in a real-world context?

Join us for the final WAI203 session in week 7 (date TBC, join the WAI203 course 🔗here to be notified), where we will create an application that analyses tweets regarding different financial instruments and displays this information on a nice dashboard.

You will be given the opportunity to use the model you have trained today or use our pre-trained Perceiver for financial sentiment analysis, FINPerceiver, which can be found 🔗here.

If you’re interested in further applications of the Perceiver, we recommend the following 📝article from HuggingFace for a Perceiver deep dive (includes several notebooks).

Looking for a challenge? Check out possible extensions in the following section.

Extensions

K-fold cross-validation

In this tutorial, we train our model once, in a randomly selected 90/10 split of the original dataset. However, this is only one of many possible dataset splits, and our model may perform differently in another 90/10 split.

In order to more better evaluate the performance of our model, we can use cross-validation. More precisely, 10-fold cross-validation. Using this technique, we create 10 different 90/10 splits, and train our model in each of these splits. Finally, we can summarize the results (for example, using the mean) to obtain a better evaluation of our model.

As an extension we propose modifying your existing code to perform 10-fold cross-validation. You may find the following 📝article and parts of our code in this repository useful.

Population-based Training Hyperparameter Optimisation

In this tutorial, we did not perform any hyperparameter optimisation.

The RayTune library provides multiple state-of-the-art algorithms and integrations with libraries such as Transformers, TensorBoard and Weights & Biases, just to name a few.

In particular, it includes an implementation of Population-based Training (PBT), a computationally efficient method developed by DeepMind (🔗 link) for hyperparameter optimisation. Inspired by genetic algorithms, it “trains many neural networks in parallel with random hyperparameters” and “uses information from the rest of the population to refine the hyperparameters and direct computational resources to models which show promise”.

“Population Based Training of neural networks starts like random search, but allows workers to exploit the partial results of other workers and explore new hyperparameters as training progresses” / DeepMind

As an extension, we propose using RayTune’s implementation of PBT to optimise the hyperparameters of your model. You may find the following tutorial and documentation useful.