Augmenting Classification Datasets with Mistral Large for Deeper Reasoning

6 min readMar 5, 2024

As the landscape AI continues to innovate, the capabilities of these large language models becomes increasingly evident, especially to companies looking to leverage the potential of these LLMs. Often times we hear about how the key component to the effectiveness of these models is in fact the quality of the underlying datasets on which we are training the models. However, for most scenarios, the existing datasets, particularly those used for classification tasks, lack the ability to elicit the underlying reasoning these models are capable of when used for more complex real world scenarios. In light of this, we are going to explore how we can leverage high quality models such as Mistral Large to augment and enhance a classification dataset so that we can interact with the finetuned model in a more conversational manner.

For this walkthrough, we’re going to start off with a fairly common ML use case: sentiment analysis. What we often find with sentiment analysis datasets are that they only include the sentiment labels, typically positive, neutral, and negative. We are going to use the financial phrasebank dataset available on HuggingFace which consists of sentences from financial news and has been subsequently labeled by multiple annotators (which I assume to mean human annotators). This particular dataset has multiple subsets based on the level of agreement of the annotators; 50%, 75% and 100% agreement.

Upon inspection of this dataset (available in the dataset viewer), we see there are only two fields, sentence and label. The labels are numeric, mapping to values of 0 => negative, 1 =>neutral, and 2 => positive. In order to augment this dataset, we can leverage the existing labels that are available as a guide for the teacher model, in this case Mistral Large, to create a reasoning for the sentiment label, knowing the correct answer or ground truth. For the training dataset, we will use the sentences_allagree subset which contains ~2200 rows. The prompt that we use to generate the reasoning is:

prompt = f```Sentence: {row["sentence"]}
    
    Ground Truth: {label_map[row["label"]]}
    
    Analyze the sentence above and determine if the sentiment of the sentence 
is neutral, positive or negative. Explain your reasoning in detail before 
stating your conclusion of the sentiment. The conclusion should simply be 
the string value of the sentiment. Your conslusion must match the provided 
ground truth.
    
    Respond with JSON with fields `reasoning` and `conclusion`, i.e. 
{{ "reasoning": "...", "conclusion": "neutral|positive|negative" }}.
```

To simplify parsing the response into it’s various parts; reasoning and conclusion, we can use the instructor library to parse the response according to a structure that we can define in pydantic.

from pydantic import BaseModel

class AugmentedLabels(BaseModel):
    reasoning: str
    conclusion: Literal["neutral", "positive", "negative"]

While we could simply point to Mistral’s API directly for the inference, I opted to use litellm as an abstraction so that we could also leverage OpenRouter. OpenRouter acts as a proxy to various hosted models, and depending on the model that you wish to inference against, it can find the best pricing available. In our case, Mistral Large is only really available from Mistral, but adding this layer in now allows us to easily swap out the underlying model for augmentation without fewer additional code changes.

MODEL_NAME = "openrouter/mistralai/mistral-large"

From here, we simply need to loop over the dataset with our prompt, inference against Mistral Large, create an updated dataset based with the reasoning and upload the new dataset to HuggingFace.

For the evaluation part of the dataset, I opted to use the sentences_75agree subset, generate reasoning and conclusion labels from Mistral Large and only include rows where Mistral Large’s conclusion matched the annotator majority’s label. In this case, the prompt was simply:

prompt = f"""Sentence: {row["sentence"]}
        
    Analyze the sentence above and determine if the sentiment of the sentence 
is neutral, positive or negative. Explain your reasoning in detail before 
stating your conclusion of the sentiment. The conclusion should simply be 
the string value of the sentiment.
    
    Respond with JSON with fields `reasoning` and `conclusion`, i.e. 
{{ "reasoning": "...", "conclusion": "neutral|positive|negative" }}.
"""

These augmented training and validation datasets have been made public on HuggingFace.

Next, we move on to finetuning a base model to respond using Axolotl. Axolotl is a LLM finetuning tool that allows anyone to quickly and easily define the various configuration and hyperparameter settings to finetune a model. For this particular experiment, I used Mistral Instruct v0.2 because there seemed to be an issue downloading the base Mistral v0.1 model from HuggingFace.

While the actual YAML used to finetune this model and dataset is available on Weights & Biases, here are the important parameters:

base_model: mistralai/Mistral-7B-Instruct-v0.2
load_in_8bit: false

chat_template: chatml
datasets:
  - path: winglian/financial_phrasebank_augmented
    type: sharegpt
    split: train
    strict: false
test_datasets:
  - path: winglian/financial_phrasebank_augmented-validation
    type: sharegpt
    split: train
    strict: false

adapter: lora
sequence_len: 768

lora_r: 32
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00001

train_on_inputs: false
bf16: auto
gradient_checkpointing: true
flash_attention: true
warmup_steps: 10

special_tokens:
  eos_token: "<|im_end|>"
tokens:
  - "<|im_start|>"

You can see from the YAML that we opted to finetune a LoRA adapter for this dataset as I wanted to use a RTX4090 so that this sort of work could be reproduced in a more accessible manner. In fact, this single finetuning job took less than 30 minutes to run, although I did experiment with a few hyperparameters to ensure this was optimal. The various experiments were tracked and available on Weights & Biases.

oaaic

Workspace of financial-phrasebank-reasoning, a machine learning project by oaaic using Weights & Biases with 5 runs, 0…

wandb.ai

We ran the experiment on a cloud RTX4090 GPU on RunPod using the winglian/axolotl-cloud:main-latest image from Docker. Once we SSH’ed into the container, we simply ran

accelerate launch -m axolotl.cli.train mistral-sentiment.yml

from the CLI. Because this particular training is a LoRA, we needed to merge the LoRA adapter into the base model after training with

python -m axolotl.cli.merge_lora mistral-sentiment.yml --save_safetensors=True

The merged model was uploaded to HuggingFace and is publicly available at https://huggingface.co/winglian/financial-phrasebank-sentiment-reasoning.

Here is a quick test of our model using some unseen data from the 50% agree subset of the financial_phrasebank dataset.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("winglian/financial-phrasebank-sentiment-reasoning", attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("winglian/financial-phrasebank-sentiment-reasoning")

sentence = "With the new production plant the company would increase its capacity to meet the expected increase in demand and would improve the use of raw materials and therefore increase the production profitability."

messages = [
    {"role": "user", "content": f"Sentence: {sentence}\n\nAnalyze the sentence above and determine if the sentiment of the sentence is neutral, positive or negative."},
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The sentence describes the positive outcomes of the company constructing a 
new production plant. Specifically, it mentions an increase in capacity, 
a more efficient use of raw materials, and an improvement in production 
profitability as benefits of the new plant. The sentence does not contain 
any negative words or phrases, nor does it mention any problems or 
challenges that the company is facing. Therefore, the sentiment of the 
sentence is positive.

As demonstrated, we’ve been able to take an existing classification dataset, enrich the dataset with reasoning from Mistral Large as a teacher model, and finetune a model to explain the reasoning for classification tasks. As many may be aware, this is very similar to Microsfot’s Orca paper and is the basis for the production of the OpenOrca dataset that I helped to work on. Of course, augmentation of such datasets is not limited to classification tasks, but this should serve as an example of how we can use existing general models like GPT-4, Claude and others to enrich these datasets.

The code use to augment the dataset is available at https://github.com/winglian/financial-phrasebank-augmentation.

I’d like to thank all of our contributors and supporters of Axolotl to help get Axolotl to where it is today. Additionally, I’d like to thank a16z for supporting the OSS AI community and Axolotl with their AI grant. If you’d like to learn more about Axolotl, join our Discord community or reach out on X/Twitter @winglian.

Augmenting Classification Datasets with Mistral Large for Deeper Reasoning

oaaic

Workspace of financial-phrasebank-reasoning, a machine learning project by oaaic using Weights & Biases with 5 runs, 0…

Written by Wing Lian