Put the Human Back In The Loop with Small Language Models

Published in

Clarity AI Tech

8 min readJun 19, 2024

As language models have become larger and larger, we have gone from adapting pre-trained models to our use cases by applying transfer learning to being able to use them directly with careful prompting, perhaps providing just a handful of examples. This, in turn, has made the training of Large Language Models (LLMs) more economically viable, as the costs can be recouped across many use cases without the need for retraining. Beyond a certain size, these models have started to demonstrate emergent capabilities that were not explicitly taught during training. However, this trend towards larger Foundation Models has made it more difficult to specialize them for tasks where we have large amounts of labeled data. In this article, we discuss some techniques that allow us to do just that.

Putting the Human Back In The Loop

At Clarity AI, we maintain exceptionally high data quality standards and believe that models are still a long way from being able be used to fully automate our data collection and curation. As a result, we have built agile processes that optimally combine models and Subject Matter Experts (SMEs). The SMEs (or Humans In The Loop) not only validate the outputs from the models, but they also feed back informative examples to train the models in a cycle of continuous improvement.

Models may come and go, but data, processes, and subject matter expertise are our competitive edge.

So, how can we leverage the emergent capabilities of LLMs while at the same time putting the Human Back In The Loop?

Supervised Fine-Tuning

People learn language through exposure (unsupervised learning) and then go on to specialize in subjects by studying for exams (supervised learning). Similarly, Large Language Models are trained to predict the next word in a sentence (unsupervised or self-supervised learning) and can then be specialized with Supervised Fine-Tuning (SFT) on a labeled dataset. In contrast, we can think of In-Context Learning (ICL) as an open book exam, in which the model is given relevant information at the time of generating the responses.

ICL can range from few-shot learning, where the model is provided with a few examples, to many-shot learning, where thousands of examples are used. Compared to SFT, ICL has the advantage that it does not require retraining the model. However, utilizing longer context windows for ICL is more resource-intensive in terms of memory and computation at inference time.

QLoRA

Thanks to techniques like quantization and Low Rank adaptation (LoRA), it is possible to fine-tune “small” language models like Llama 3 (with “only” 8 billion parameters) using affordable and readily available hardware. These models can also be deployed very efficiently with serving frameworks such as vLLM and Hugging Face’s Text Generation Inference.

According to Tim Dettmers, the author of bitsandbytes, given a fixed amount of GPU VRAM, the best results can be obtained by cramming in the largest possible model using 4 bit quantization.

The LoRA paper claims that results as good as (or even better than) full fine-tuning are achievable using only a fraction of the trainable model weights. Motivated by the paper “Intrinsic dimensionality explains the effectiveness of Language Model fine-tuning”, the authors hypothesize that the model weight updates due to fine-tuning on a high level task can be effectively approximated by a low-rank decomposition (the product of a matrix with few rows and a matrix with few columns).

Source: https://arxiv.org/pdf/2106.09685

This approach also conveys the advantage that the model weight adapters for different tasks can be quickly swapped in and out of a base model (and served with LORAX, for example), or even combined in a single model.

These adapted models can truly be considered Small Language Models in that they have a number of trainable parameters in the millions.

SQuAD

As an example, we are going to consider the SQuAD 2.0 (Stanford Question Answering Dataset) task, but this could be any task for which you have a high quality dataset of several thousand labeled examples.

For the SQuAD task, the answers to the questions must be derived from the given context. However, in a large proportion of the cases, the context is not sufficient and no answer should be given. This is a good test for Language Models, which have a propensity to give plausible answers (or to “hallucinate”) when none are available.

Think step by step

Thanks to a growing number of frameworks such as TRL, torchtune, and LLaMA Factory, it is relatively straightforward to fine-tune a model like Llama 3. The key is to only apply the cross-entropy in the loss function to the tokens that make up the answer, as opposed to the whole conversation.

The model learns to directly provide the response in the required format, but it refrains from giving any reasoning as none was provided in the training examples. On the other hand, it has been observed that the accuracy of LLM responses increases with Chain-of-Thought prompting, in which the model is encouraged to break the problem down step by step. A possible explanation for this is that, because response tokens are generated auto-regressively, more computations are involved in the longer generation. Anthropomorphically, we can say that the model has to think for longer before answering.

In an ideal world, we would not only have a labeled dataset of answers, but we would also have the ground-truth reasoning on which we could train the model. We could generate such a dataset synthetically using an LLM such as GPT-4, in which case we would be effectively distilling GPT-4 into a smaller model like Llama 3. But we can do something better, something that beats GPT-4. First, let’s consider another way of fine-tuning Language Models called Prefix Tuning.

Prefix Tuning

Source: https://arxiv.org/pdf/2101.00190

Remember that tokens are converted into vectors via an embedding matrix. Instead of providing a prefix to a prompt like “Translate from Spanish to English” in tokens, we can directly provide vectors which represent this task, similar to an abstract “thought” that cannot be put into words. In practice, we form a prefix from tokens newly added to the vocabulary and train only the corresponding part of the embedding matrix. Llama 3, for example, has a hidden dimension of 4,096 weights per prefix token, which leads to very Parameter Efficient Fine-Tuning (PEFT).

Reasoning Tokens

We can adapt the idea of Prefix Tuning to represent instead the “reasoning” the model performs before generating the response. For example

messages:
- role: "system"
  content: |
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
- role: "user"
  content: |
    Extract from the following context the minimal span word for word that best answers the question. Think step by step and explain your reasoning. Then give the answer in JSON format as follows:
    ```json
    {
      "answer": ...
    }
    ```
    If the answer is not in the context, the answer should be "?".
    Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
    Question: When did Beyonce start becoming popular?
- role: "assistant"
  content: |
    <blah_0><blah_1><blah_2><blah_3><blah_4>```json
    {
      "answer": "in the late 1990s"
    }
    ```

Notice the <blah> reasoning tokens in the response. These are added to the vocabulary and the input embedding matrix is extended to include the corresponding weights. The output embedding matrix, on the other hand, is not modified, so the model is actually incapable of generating these tokens. In any case, we do not apply cross entropy to the outputs corresponding to the <blah> tokens as we do not actually care about the intermediate results (nor do we know what they should be). What we are interested in is whether they help improve the accuracy of the responses.

By fine-tuning the embedding weights for the reasoning tokens before the LoRA weights, we are forcing the model to “learn to think” effectively, before “learning to speak”.

Results

We calculated the percentage of Exact Matches of model responses to the ground truth answers in the SQuAD 2.0 test dataset, including correct abstention when no answer was provided. Amongst other models, we compared GPT 3.5 Turbo, GPT 4 and Llama 3 8B — both out-of-the-box and fine-tuned — and obtained the following results:

GPT 3.5 Turbo: 47.60%
GPT 4: 63.50%
Llama 3 8B: 51.85%
Llama 3 8B (SFT): 70.03% (176M trainable parameters)
Llama 3 8B (5 <blah>): 75.15% (20,480 trainable parameters)
Llama 3 8B (5 <blah> + SFT): 80.13% (176M trainable parameters)

As you can see, Llama 3 8B performs significantly worse out-of-the-box than GPT 3.5 Turbo and GPT 4 — as is to be expected due to its diminutive size. Nevertheless, even simply training embedding vectors for 5 new <blah> tokens and preprending them to the response beats GPT 4 by some margin. By further going on to fine-tune the model, we were able to achieve an accuracy of 80.13%.

Of course, this is far from the State Of The Art for SQuAD 2.0 of 90.9% (using ensembles of encoder models). But in practice, we don’t usually have the luxury of 150,000 labeled examples that cover all the cases we are interested in, and we need to leverage the knowledge and reasoning capabilities of Foundation Models. Indeed, we have been able to obtain superior results to GPT 4 for some of our tasks by fine-tuning Llama 3 8B with only a few thousand labeled examples.

The full table of results, more details on how the models were trained (including weights) and the accompanying GitHub repo can be found here:

GitHub - teticio/llama-squad: Train Llama 2 & 3 on the SQuAD v2 task as an example of how to…

Train Llama 2 & 3 on the SQuAD v2 task as an example of how to specialize a generalized (foundation) model. …

github.com

References

The approach is similar to one taken in the paper “Think before you speak” except that we use several distinct reasoning tokens (or pause tokens as they refer to them) and we first train the new embedding weights and then the LORA layers, as opposed to fully fine-tuning the model or pre-training from scratch.