Fine-tuning a large language model on Kaggle Notebooks for solving real-world tasks — part 3

4 min readDec 18, 2023

Reprising hands-on fine-tuning for financial sentiment analysis with Mistral 7B Instruct v0.2 and Phi-2

After fine-tuning LLama 7B on a dataset for financial sentiment analysis on consumer-grade, easily accessible, and free GPUs (https://www.kaggle.com/code/lucamassaron/fine-tune-llama-2-for-sentiment-analysis/) you can re-use the very same code to fine-tune also most recently appeared large language models such as :

Mistral’s Mistral 7B Instruct v0.2
https://www.kaggle.com/code/lucamassaron/fine-tune-mistral-v0-2-for-sentiment-analysis
Microsoft’s Phi-2
https://www.kaggle.com/code/lucamassaron/fine-tune-phi-2-for-sentiment-analysis

In this article, I will present the exciting characteristics of these new large language models and how to modify the starting LLama fine-tuning to adapt to each of them.

The new models

Mistral 7B Instruct v0.2 builds upon the foundation of its predecessor, Mistral 7B Instruct v0.1, introducing refined instruct-finetuning techniques that elevate its capabilities. Everything starts from the Mistral 7B developed by Mistral, a Paris-based AI startup founded by former Google’s DeepMind and Meta employees, aiming to compete with OpenAI in constructing, training, and applying large language models and generative AI.

Such a 7.3B parameter model, Mistral 7B, stands out among its counterparts, consistently surpassing Llama 2 13B on all benchmarks and matching Llama 1 34B performance on numerous tasks. It even rivals CodeLlama 7B’s proficiency in code-related areas while maintaining its excellence in English-based tasks (but it can egregiously handle all European languages).

To achieve this remarkable level of performance, Mistral 7B employs two innovative techniques: Grouped-query attention (GQA) for accelerated inference and Sliding Window Attention (SWA) for efficiently handling lengthy sequences at a lower cost.

GQA streamlines the inference process by grouping and processing relevant query terms in parallel, reducing computational time and enhancing overall speed. SWA, on the other hand, tackles the challenge of operating on lengthy sequences by dividing them into smaller windows and applying attention mechanisms to each window independently, resulting in more efficient processing and reduced memory consumption.

The Mistral 7B Instruct model is designed to be fine-tuned for specific tasks, such as instruction following, creative text generation, and question answering, thus proving how flexible Mistral 7B is to be fine-tuned. As a caveat, it has no built-in moderation mechanism to filter out inappropriate or harmful content.

Phi-2 is instead a small language model (LLM) developed by Microsoft Research. It has only 2.7 billion parameters, significantly smaller than other LLMs. Its training has been based on a similar corpus than Phi-1 and Phi-1.5, focusing on “textbook-quality” data, including subsets of Python codes from The Stack v1.2, Q&A content from StackOverflow, competition code from code_contests, and synthetic Python textbooks and exercises generated by gpt-3.5-turbo-0301. Also Phi-2 has not undergone fine-tuning through reinforcement learning from human feedback, hence there is no filtering of any kind.

Code adjustments

Setting Mistral 7B Instruct to work is a breeze (no pun intended 😄). All you must consider is that to utilize instruction fine-tuning, you need to enclose your prompt between [INST] and [/INST] markers. That’s all! After the fine-tuning process, the results return a top performance for all the classes in terms of detected sentiment on our test set:

Accuracy: 0.868
Accuracy for label negative: 0.977
Accuracy for label neutral: 0.743
Accuracy for label positive: 0.883

Phi-2 requires more work because it has less stringent requirements for instructions and displays a very peculiar behavior. It tends to deal with the question as a quiz and to return even unrequited elements from the texts it has used for its original learning. Therefore, after evaluating the sentiment of a text, it eruditely starts a discussion about the Mughal empire. The most efficient way to obtain answers from the network is to limit the response tokens to at least 3 to allow extra spaces and answer letters to appear before the prediction (something that can’t be avoided) and to structure the prompt as:

The sentiment of the following phrase: ‘…’

Solution: The correct option is ...

Another essential fact about Phi-2 is that you need to declare the target modules you want to adjust when setting the parameters for the LoRA (Low-Rank Attention) module, a parameter reduction technique used to compress the attention matrices in a transformer model. Here, we found it is necessary to specify “Wqkv” and “out_proj explicitly”. “Wqkv” and “out_proj” are modules in the Transformer architecture used for attention and output projection.

Wqkv is a 3-layer feed-forward network that generates the attention mechanism's query, key, and value vectors. These vectors are then used to compute the attention scores, which are used to determine the relevance of each word in the input sequence to each word in the output sequence.

out_proj is a linear layer used to project the decoder output into the vocabulary space. The layer is responsible for converting the decoder’s hidden state into a probability distribution over the vocabulary, which is then used to select the next token to generate.

In the context of the Phi-2 model, these modules are used to fine-tune the model for instruction following tasks. The model can learn to understand better and respond to instructions by fine-tuning these modules.

By doing so, the results are relatively less performing than Mistral Instruct but better than LLama and with a much smaller model:

Accuracy: 0.856
Accuracy for label negative: 0.973
Accuracy for label neutral: 0.743
Accuracy for label positive: 0.850

Fine-tuning a large language model on Kaggle Notebooks for solving real-world tasks — part 3

Written by Luca Massaron