Tracking and Monitoring Transformers with MLFoundry

Efficient tracking and monitoring of Transformer models for Financial Sentiment Analysis using MLFoundry, by TrueFoundry

Tezan Sahu
Analytics Vidhya


Source: Catchpoint Digital Monitoring: Offering lowest cost options without compromising quality

Natural language processing (NLP) has become increasingly popular in financial applications in recent years. Stock/forex market forecasting, volatility modeling, asset allocation, business taxonomy construction, credit scoring, initial public offering (IPO) valuation, and other applications are among them. One common task (or rather subtask, which can provide features for some tasks mentioned previously) in this financial domain is Financial Sentiment Analysis (FSA). The goal of FSA is to classify financial text as expressing bullish or bearish opinions on specific arguments.

In this new era of NLP, tasks like FSA have also been impacted by the dominance of transformers. Models like FinBERT perform significantly better compared to previous approaches.

In this tutorial, we will explore how to easily fine-tune any pretrained transformer for the FSA task using the Simple Transformers library. Then, we will experiment iteratively by changing parameters and base models and tracking the relevant information using MLFoundry. We will create a custom demo of our model using the MLFoundry Web App, which could be shared with others or showcased to users.

So, let’s dive in!

Tl;dr: If you wish to see only the working implementation of all that we cover in this article, please refer to this notebook. Although GitHub gists are used as code snippets throughout this article, if copied directly, they may not work as intended. Feel free to refer to the mentioned notebook in case you face such issues.


We first create a virtual environment for this project and install the necessary libraries:


Note: It is advisable to have some GPU access to train the transformer models because they are large and require considerable time for training otherwise. Using Google Colab can be a viable solution.

Next, we need to create an IPython notebook and import the required libraries. Optionally, we can also clear the cache in CUDA.

We do not need to set the device in PyTorch explicitly because simpletransformers library automatically takes care of that and uses GPU by default.

Sentiment Analysis for Financial News using Simple Transformers

Exploring & Processing the Dataset

For this tutorial, we use the processed FinancialPhraseBank data available from Kaggle. This dataset contains sentiments for financial news headlines as seen through the eyes of a retail investor. The all-data.csv file contains two columns, Sentiment and News Headline. The sentiment can be negative, neutral, or positive. Thus, the FSA task is treated as a multi-class classification problem.

Since the Simple Transformers library requires data to be in Pandas DataFrames with at least two columns, named text (of type str) and labels (of type int), we do the required processing before splitting the data into training & evaluation sets.

Training a BERT Model for FSA

We define a simple training function that takes in the specifications of the model to create a ClassificationModel, along with the training hyperparameters to perform the training loop. We also evaluate the model using Micro-F1 and Accuracy scores by leveraging scikit-learn.

As a wrapper around the Hugging Face library, Simple Transformers makes it extremely simple to do all this by abstracting away the heavy-lifting required. Thus, we can create, train & evaluate our transformer with just 3–4 statements.

In the constructor of ClassificationModel, the first parameter is the model_type, the second is the model_name, and the third is the number of labels in the data (set to 3 because we have three sentiments). Currently, Simple Transformers supports one of the available types as the model_type. The model_name can be from among any of the models available on Hugging Face.

After training the model, we evaluate it using eval_model(). This function returns the following:

  • A dict containing the performance metrics on the evaluation dataset (Matthews correlation coefficient and loss by default, along with micro-F1 and accuracy defined by us)
  • A list of model outputs for each evaluation instance
  • A list of inputs for which the model predicts incorrectly

The trained model checkpoints are stored in training_args['output_dir'].

There are many hyperparameters that you can change based on your requirement. The full list with their default values can be found below:

Before actually supplying the trainModel() function with the required parameters, we will introduce MLFoundry to set up the tracking required for our experimentation.

Introducing MLFoundry for Tracking and Monitoring

MLFoundry is the ML Monitoring & Experiment Tracking solution created by TrueFoundry, that allows users to track their experiments, models, metrics, data & features. Each experiment, with a unique combination of parameters, dataset, and metrics is considered a run, and multiple such runs can be grouped logically as a project. Each run has a unique run_id, but can also be given a run_name for easier referencing. Later, these runs can be inspected and compared using an interactive dashboard.

Logging Experiment Details with MLFoundry

First, we import MLFoundry and initialize the API. Then, we create our project (called financial-sentiment-analysis) with our first run named bert_3epochs. This experiment will involve fine-tuning a BERT model for three epochs.

This creates a mlf/ folder in the project directory that will contain all the information across the various runs logged by MLFoundry.

Now, we modify our trainModel() function to accept a run as input and log all the required information (parameters, dataset, metrics, and dataset stats).

  • To log the training & evaluation datasets, we use log_dataset()
  • We log the model specifications (type and name), along with the hyperparameters as a dictionary using log_params()
  • The dictionary containing the performance metrics on our evaluation set (accuracy and micro-f1) is logged using log_metrics().
  • Various metrics related to our dataset, along with statistics like counters, summaries, histograms, and most frequent values are estimated using whylogs automatically when logged using log_dataset_stats().

Now, we set the hyperparameters in the form of a dictionary along with the model specifications and call the trainModel() function for the run that we defined previously.

This loads weights from the pre-trained bert-base-uncased model and fine-tunes it based on the supplied hyperparameters for three epochs.

Navigating around the MLFoundry Dashboard

Once the training is complete, we can view all the logged information in the MLFoundry Dashboard by running the command mlfoundry ui from within the project folder (containing the mlf/ folder). This starts the dashboard on localhost:4200 by default (we can change the port by running export STREAMLIT_SERVER_PORT=<your-preferred-port>).

Under the Single Run View, select financial-sentiment-analysis as the Project Name and bert_3epochs as the Run Name. Now, we can inspect all the information about that specific run that has been tracked, using the different tabs:

  • The Model Health section shows the performance metrics of the current model on the evaluation dataset. These include a confusion matrix (since this is a multiclass classification task) along with other relevant plots.
Model Health section showing the various user-generated and auto-generated metrics for bert_3epochs run
  • The Data Health section contains various stats related to our dataset, which can be used to understand the data quality and compare it against other datasets if there is a data change later.
  • The Feature Health section shows the numerical distribution of labels and predicted values based on input features. For our case, there is only one input feature named headline, containing the financial news headline.
Feature Health section showing the numerical feature distribution of classes for the labels and predictions for bert_3epochs run
  • The Run Details section displays all the parameters and metrics logged for the run and also allows users to view the datasets and other artifacts related to the run that were tracked.

We see that after three epochs of training, our fine-tuned BERT model can achieve an accuracy of 0.82 and micro-F1 score of 0.82.

Efficient Tracking and Comparison of Multiple Experiments with MLFoundry

Now, we will see how easy it is to experiment with different models and parameters for a specific task while being able to track all the relevant information for reference by creating different runs within the same project.

To demonstrate experimentation by changing hyperparameters, we create a new run called bert_5epochs and pass it to the trainModel() function along with the training_args dictionary (this time, the value of num_train_epochs in this dictionary is set to 5 instead of 3) and model_params (no change).

To illustrate the use of different transformer architectures for experimentation, we create a new run called roberta. This time, we keep the training_args unchanged from the previous run (with 5 epochs) and change the model_params to use the weights of pre-trained RoBERTa instead of BERT.

Once the training of these models completes, we again go to the MLFoundry Dashboard to inspect the tracked information by selecting the specific run_name under Single Run View. Now that we have more than one run, we can also compare them under Run Comparison.

The runs to be compared can be selected either using their run_id or run_name, the latter being more comfortable.

  • When trained for more epochs, the BERT model showed better accuracy
  • The RoBERTa model gives the best accuracy of 0.866

Scrolling below, we can also see the consolidated plot of all the performance metrics tracked for each run.

Model Demo using MLFoundry Web App

Having understood the process of iterative experimentation, tracking, and comparison of transformer models with MLFoundry, we now wish to demonstrate the performance of our best model (i.e. RoBERTa) to a broader audience. This is where the MLFoundry Web App comes in handy.

For this, we can create a standalone web app file that will be registered with the appropriate run using log_webapp_file(). Since we want to make the demo for our RoBERTa model, we name this file (it could be named anything else as well). In this file, we first need to write a function that can load a saved Simple Transformer model and predict the sentiment using the loaded model for an input financial news headline. Then, we need to initialize the MLFoundry client, create a run, and call webapp() from the run by supplying it with the prediction function, type(s) of inputs and outputs (here, we have just one input and one output, each of type text). This defines a model demo interface on the dashboard.

Back in our notebook where we trained our models, we can register this file with mlf_run_3, which tracks all the information for our RoBERTa-based experiment.

In the dashboard, when we can select the roberta run and open the Model Demo section, we see an interactive demo (similar to the Model Card in Hugging Face) that leverages our saved RoBERTa checkpoint to predict the sentiment of a financial news headline.

Example of a financial news headline being classified correctly as ‘positive’ by our fine-tuned RoBERTa model as seen in the MLFoundry Web App for the run named ‘roberta’

Since this web app is built using Streamlit, we can import streamlit into the file and use it to add other elements (for example, to add explainability, etc.) to this model demo dashboard as well for customization.

Concluding Remarks

In this tutorial, we initially saw how to easily train transformer models for an NLP task like Financial Sentiment Analysis using Simple Transformers. Later, we used MLFoundry to track the parameters, metrics, datasets, and stats for the different experiments that we performed by varying some hyperparameters and the model architecture. In the end, we also created a web app to demonstrate the working of our trained model to predict the sentiments of financial news headlines.

Feel free to check out the References section below for more details regarding the documentation of the libraries mentioned in this tutorial. Please leave any feedback or suggestions in the comments section below.

TrueFoundry is building one of the fastest framework for ML Pipelines that relies on Open standards, saving 30–40% of a Data Science team’s time through their automated post-model pipeline. Feel free to sign up for early access to TrueFoundry’s ML monitoring and auto-scaling solution!



Tezan Sahu
Analytics Vidhya

Applied Scientist @Microsoft | #1 Best Selling Author | IIT Bombay '21 | Helping Students & Professionals Ace Data Science Roles |