Fine-tuning a large language model on Kaggle Notebooks for solving real-world tasks — part 2
Hands-on fine-tuning for financial sentiment analysis
We will deal with sentiment analysis of financial and economic information for this hands-on tutorial on fine-tuning a Llama 2 model on Kaggle Notebooks, showing how to handle such a task with limited and commonly available resources. Sentiment analysis on financial and economic information is highly relevant for businesses for several key reasons, ranging from market insights (gain valuable insights into market trends, investor confidence, and consumer behavior) to risk management (identifying potential reputational risks) to investment decisions (gauging the sentiment of stakeholders, investors, and the general public businesses can assess the potential success of various investment opportunities).
Before the technicalities of fine-tuning a large language model like Llama 2, we had to find the correct dataset to demonstrate the potentialities of fine-tuning.
Particularly within finance and economic texts, annotated datasets are notably rare, with many exclusively reserved for proprietary purposes. In 2014, scholars from the Aalto University School of Business introduced a set of approximately 5,000 sentences to address the issue of insufficient training data (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796 - https://arxiv.org/abs/1307.5336). This collection aimed to establish human-annotated benchmarks, serving as a standard for evaluating alternative modeling techniques. The involved annotators (16 people with adequate background knowledge of financial markets) were instructed to assess the sentences solely from an investor's perspective, evaluating whether the news potentially holds a positive, negative, or neutral impact on the stock price.
The FinancialPhraseBank dataset is a comprehensive collection that captures the sentiments of financial news headlines from the viewpoint of a retail investor. Comprising two key columns, “Sentiment” and “News Headline,” the dataset effectively classifies sentiments as negative, neutral, or positive. This structured dataset is a valuable resource for analyzing and understanding the complex dynamics of sentiment in financial news. It has been used in various studies and research initiatives since its inception in the paper published in the Journal of the Association for Information Science and Technology in 2014.
The data is available under the license CC BY-NC-SA 3.0 DEED, and it can be found complete with detailed descriptions and instructions at https://huggingface.co/datasets/financial_phrasebank. There are also a couple of Kaggle Datasets mirrors, too. In our example, we sample from all the available data (4840 sentences from English language financial news categorized by sentiment) 900 examples for training and 900 for testing. The examples in the training and testing sets are balanced and have the same number of examples of positive, neutral, and negative samples. We also use a sample of about one hundred examples, mainly of remaining positive and neutral examples (not so many negative examples were left) for evaluation purposes during training (we just use evaluation for monitoring; no decision is taken based on such a sample).
Without much ado, we just point out to the Kaggle notebook where all the cells are commented on step by step, showing how to structure the analysis:
In this article, we will illustrate instead the logical steps of fine-tuning. From a larger perspective, as in any machine learning project, you:
- retrieve data
- arrange data for training, validation, and testing
- instantiate your model
- evaluate your model as it is
- fine-tune (train) your model
- evaluate your model
When dealing with LLMs, however, it makes sense to evaluate the model, inducted just by hard prompting engineering, in order to establish a benchmark that can make sense to your work (if your LLM is already skillful enough in achieving the desired task, you actually do not need to perform any further fine-tuning).
Let’s now delve into the practicalities of instantiating and fine-tuning your model. First, you need to define what LLM you are going to tune.
model_name = "../input/llama-2/pytorch/7b-hf/1"
How choice fell on Llama 2 7b-hf, the 7B pre-trained model from Meta, converted for the Hugging Face Transformers format. Llama 2 constitutes a series of preexisting and optimized generative text models, varying in size from 7 billion to 70 billion parameters. Employing an enhanced transformer architecture, Llama 2 operates as an auto-regressive language model. Its fine-tuned iterations involve both supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF), ensuring conformity with human standards for helpfulness and safety. Apart from being an already well-performing LLM, the choice for this model resides on the fact that it is the most nimble of the Llama family and, thus, the most suitable to demonstrate how even the smaller LLMs are good choices for fine-tuning for specialistic tasks.
Our next step is defining the BitsAndBytes configuration.
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=False,
)
Bitsandbytes is a Python package developed by Tim Dettmers, which acts as a lightweight wrapper around CUDA custom functions, particularly 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. It allows running models stored in 4-bit precision: while 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit, and here any combination can be chosen (float16, bfloat16, float32, and so on). The idea behind Bitsandbytes has been formalized in the paper by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer (see https://arxiv.org/abs/2305.14314).
You can actually think of it as a compressor of the LLM that can allow us to safely store it both on disk and memory of a standard computer or server: the neural network is stored to 4-bit precision (normalized float 4 which has the better performances), potentially saving a lot from the typical 32-bit precision. Additionally, to increase the compression, one can opt for bnb_4bit_use_double_quant (but we don’t in our example), which implements a secondary quantization following the initial one, resulting in a supplementary reduction of 0.4 bits per parameter. However, when computing on the network, computations are executed according to the bnb_4bit_compute_dtype defined by us, which is 16-bit precision, a suitable numeric precision comprising both fast and exact computations. This decompression phase may take more time, according to the reductions previously obtained.
As a next step, once initialized the Bitsandbytes compression is to load our model using HuggingFace (HF) AutoModelForCausalLM and its tokenizer:
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Here, apart from the quantization_config (the Bitsandbytes compression) and the device_ap set to “auto” so you can leverage whatever you have on your system (CPU or GPUs), we have to notice as specifics for this model, the pretraining_tp parameter necessarily set to one (a value stated by HF documentation necessary to ensure exact reproducibility of the pretraining results) and the use_cache set to False (whether or not the model should return the last key/values attentions, not necessary for Llama). On the hand of the tokenizer, the pad token is equated to the eos token ( the end-of-sequence token used to indicate the end of a sequence of tokens), and the padding side is set to be the right one, commonly indicated as the right side to use when working with Llama models.
After instantiating the model, we have to prepare the training phase, which requires implementing a LoRA strategy based on a reduced number of parameters to update to adapt the original LLM to our task (see the previous article for more details).
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
The LoRA config specifies the parameters for PEFT. Following are the explained parameters that we use:
- r: The rank of the LoRA update matrices. The reduction coefficient represents a trade-off: the lower it is, the less memory is consumed, but with increased approximation during updates.
- lora_alpha: The learning rate for the LoRA update matrices. As a rule of thumb, remember that it should be the double of the r value.
- lora_dropout: The dropout probability for the LoRA update matrices.
- bias: The type of bias to use. The possible values are none, additive, and learned. We go for none because the option removes biases from the LoRA model, which can reduce the model size by up to 20%.
- task_type: The type of task that the model is being trained for. The possible values are CAUSAL_LM and MASKED_LM. Many say it doesn’t make a difference, but CAUSAL_LM is the right choice for our purpose.
Separately, we have to go for the training parameters:
training_arguments = TrainingArguments(
output_dir="logs",
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
optim="paged_adamw_32bit",
save_steps=0,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=True,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="cosine",
report_to="tensorboard",
evaluation_strategy="epoch"
)
The training_arguments object specifies the parameters for training the model. The following are some of the most important parameters:
- output_dir: The directory where the training logs and checkpoints will be saved.
- num_train_epochs: The number of epochs to train the model for.
- per_device_train_batch_size: The number of samples in each batch on each device.
- gradient_accumulation_steps: The number of batches accumulating gradients before updating the model parameters.
- optim: The optimizer to use for training the model. Our choice is for
the paged_adamw_32bit optimizer, a variant of the AdamW optimizer designed to be more efficient on 32-bit GPUs. It does this by breaking the model parameters into smaller pages and optimizing each page separately. This can reduce the memory usage of the optimizer and improve its performance on 32-bit GPUs. - save_steps: The number of steps after which to save a checkpoint.
- logging_steps: The number of steps after which to log the training metrics.
- learning_rate: The learning rate for the optimizer.
- weight_decay: The weight decay parameter for the optimizer.
- fp16: Whether to use 16-bit floating-point precision. Training on GPU with fp16 set to True, as we do, can reduce memory usage by up to half, improve training speed by up to 2x, and reduce training costs by up to half. However, it can also reduce the accuracy of the trained model and make the training process more difficult.
- bf16: Whether to use BFloat16 precision (not for our GPU).
- max_grad_norm: The maximum gradient norm. The maximum gradient norm is a hyperparameter used to control the magnitude of the gradient updates during training. It is relevant in training because it can help to prevent the model from becoming unstable and overfitting to the training data by taking too strong updates.
- max_steps: The maximum number of steps to train the model for.
- warmup_ratio: The proportion of the training steps to use for warming up the learning rate, i.e., the proportion of the training steps to gradually increase the learning rate from 0 to its final value. It is relevant in training because the warm-up can help improve the model's stability and performance.
- group_by_length: Whether to group the training samples by length to minimize padding applied and be more efficient.
- lr_scheduler_type: The type of learning rate scheduler to use. Our choice is the cosine scheduler, which gradually increases the learning rate at the beginning of training, thus helping the model learn the basic features of the data quickly. Then, it gradually decreases the learning rate towards the end of the training, which helps the model converge to a better solution.
- report_to: The tools to report the training metrics to. Our choice is to use TensorBoard.
- evaluation_strategy: The strategy for evaluating the model during training. By deciding on “epoch”, we have an evaluation of every epoch on the eval dataset, which can help us figure out if training and eval measures are diverging or not.
Finally, we can define the training itself, which is entrusted to the SFTTrainer from the trl package. The trl is a library by HuggingFace providing a set of tools to train transformer language models with Reinforcement Learning and other methods, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization step (PPO).
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
eval_dataset=eval_data,
peft_config=peft_config,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing=False,
max_seq_length=1024,
)
The SFTTrainer object is initialized with the following arguments:
- model: The model to be trained.
- train_dataset: The training dataset.
- eval_dataset: The evaluation dataset.
- peft_config: The PEFT configuration.
- dataset_text_field: The name of the text field in the dataset (we used the HuggingFace Dataset implementation).
- tokenizer: The tokenizer to use.
- args: The training arguments we previously set.
- packing: Whether to pack the training samples.
- max_seq_length: The maximum sequence length.
This basically completes our work because all that is left to do is the training itself and then save the updated model to disk:
trainer.train()
trainer.model.save_pretrained("trained-model")
This completes our tour of the step for fine-tuning an LLM such as Meta’s LLama 2 in Kaggle Notebooks (it can work on consumer hardware, too). As for many machine learning problems, after grasping the technicalities for running the learning, everything boils down to a good understanding of the problem, proper data preparation, and some experimentation to adapt your tools to the problem (and vice versa if necessary).