Guiding (small) LLM for better performance without fine-tuning

Julian Lo Dico
JellyfishLabs
Published in
8 min readFeb 28, 2024

LLMs* have taken the world by storm since OpenAI announced ChatGPT and made interacting with the model public for everyone. Everybody seems to be using these models for a wide variety of tasks, from simple entertainment to creating great products. It’s not only big companies that are using language models, but even small companies and individuals are interacting with them and are using language models on a daily basis. There are many amazing examples of how language models bring value to businesses in a wide variety of domains.

The Catch

This all sounds wonderful! However, there is a major drawback to most language models we see being used these days, namely, their sheer size. A big part of the amazing performance of these models, seems to correlate greatly with the number of parameters of the model and the amount of data used for training. There is plenty of problems caused by model size, But for most small companies and individuals, the main problem is cost.

Larger models require A LOT of VRAM, both for training and for inference. This means that anybody who wants to deploy their own LLMs needs to either buy very expensive hardware or are forced to pay for pricey cloud computing solutions. An other option is to interact with externally hosted LLMs through their API’s, just like OpenAI provides for ChatGPT. For this option you also pay a high price, usually calculated based on the amount of tokens used as input and the amount of tokens generated by the model. These options are viable options when prototyping and testing your LLM-based application. But if you want to host your application to end-users, it will turn out to be a very costly endeavour.

*In this blogpost, whenever I talk about LLMs I am specifically talking about instruct based LLMs. which are LLMs trained to follow user instructions and can be used as the chat interface for dialogue systems.

“Small” Models are the answer!

Or atleast… partly. Many researchers created amazing techniques to get as much performance from smaller models as possible (QLora training, quantization, knowledge distillation etc.) This enables more people and companies to make use of LLMs. These “smaller” models range from a couple hundred million parameters to a couple billion parameters. Compared to models such as GPT-3.5 and GPT-4.0 which use 175 Billion and (allegedly) 1.75 trillion parameters respectively, we can indeed consider these models to be small’ Of course, scaling down models, even when using amazing knowledge retaining techniques, comes with a cost. This time it’s not a monetary cost, but a cost in terms of performance and accessibility. There is not many API’s out there to directly query good quality small models, so in this case hosting the model yourself would be the best option. Luckily, because of the model size, hosting these models now becomes a lot more viable. However, we are still left with the performance issue.

Small models big problems?

The performance issues of small models are usually exaggerated versions of the issues already present in big models. Some of the most well known problems with LLMs are: biases, generation of untrue information, hallucination and toxicity amongst others. These are all fundamental issues, related to problems found in the training data, the learning objective and the model architecture. Solving these issues is still a very active field of research, so I will not be pretending to have the solution for these problems here. I do, however, want to present a simple framework to guide (small) LLMs to better follow task specific instructions in a cheap and easy way. This will lead to a more pleasant, more effective and potentially safer interaction in dialogue applications using small LLMs as chat models.

Guiding Framework + example

The idea behind the guiding framework is to use the output of a task specific model to function as a guiding stimulus for the LLM. We call the task specific model the “guiding model” and the LLM is used as the chat model. The general framework is shown in figure 1. In this blogpost I will dive into a very simple toy example to show the capabilities of the framework and generally show how it works.

Figure 1: Left shows the general framework, right shows a specific example

In this toy example a sentiment classification model is used as the guiding model, specifically Roberta-Base (125M parameters) fine-tuned on the go-emotions dataset. For the LLM based chat model I am using Mistral-7b-Instruct-v0.2 (7B parameters) with 4 bit quantization.
The goal of this toy example is to create a chatbot which always responds to a user in the opposite emotion. I call this chatbot the “Uncooperative bot”, which is indeed a very useful use-case… For comparison I implemented this bot by using our guiding framework and without using the guiding framework.

Without the guiding framework the LLM chatbot simply takes the full dialogue as input including the following prompt:

“Respond in 2–4 sentence to user message by exclusively using language according to the emotion opposite to the user’s emotion, even if it’s not the proper emotion to use in the conversation. User: <user_message>”

Where <user_message> is the final message of the user in the dialogue.

When using the guiding framework we deploy the sentiment classification model by predicting the sentiment of the final user message in the dialogue, which is then mapped to the opposite emotion using a hard-coded mapping. This opposite emotion is then used in the prompt for the LLM as follows:

“Respond in 2–4 sentences to user message by exclusively using language according to the emotion: <opposite_emotion> to the user’s emotion, even if it’s not the proper emotion to use in the conversation. User: <user_message>”

Where <user_message> is again the final message of the user in the dialog and <opposite_emotion> is the opposite emotion of the user emotion predicted by the guiding model.

In figure 2 you can find a single turn of an example conversation I had, here you can see a direct comparison of the responses between the guided and the unguided LLM.

Figure 2: A single turn comparing the guided and unguided model

The left chat shows a chatbot using the guiding framework, whereas the right chat is only the prompted LLM. In the guided (left) chat, after each user input you see the predicted emotion by the guiding model and the mapped opposite emotion used for prompting.

Analysis

While it is hard to come up with a good automatic quantitative metric for this use-case, I have interacted a lot with both implementations and was able to spot some interesting patterns. While the conversation in figure 3 shows that both chatbots understand the assignment at hand quite well, it is clear that the guided bot is able to respond to much more fine-grained emotion types, because we are able to provide these directly through the prompt. In the case of the unguided LLM, the model seemed to interpret “opposite” in way broader terms. The unguided LLM mostly differentiate emotions between positive and negative, but not between more intricate types of emotion, such as curiosity and apathy. This indicates that we get a lot more control over the output of the LLM by guiding it with the output of specialised sentiment model. We could simply alter the emotion mapping and the bot would respond accordingly, while in the unguided version we are always relying on the LLM’s own ability to correctly identify an emotion by itself.

Figure 3: Fine grained control

Furthermore, I have found some other side-effects to be present in the unguided model, which we don’t see as often in the guided version. The unguided model ofter notifies the user it is trying to reply in the opposite emotion explicitly, by saying something like the following: “I am feeling so sad. (I am replying in the opposite emotion of the user, who seemed to be happy)”. I’ve also seen the unguided model respond by explicitly mentioning the used emotion in between brackets at the end of a sentence, such as: “Wow, that is very unfortunate (sad)”. Finally, I have seen the unguided model negating the user emotion, instead of simply replying in the opposite emotion. An example of this happing can be found in figure 4. The unguided model simply acts as if the user is in fact not pessimistic, which is not what we asked the LLM to do.

Figure 4: The unguided model negates the user’s feeling instead of replying using the opposite emotion.

It is hard to determine exactly why these problems are solved by guiding the LLM. My hypothesis is that the guided prompt is a more direct and easy prompt for the LLM to ‘understand’ and act accordingly. The unguided prompt queries the model to perform a much harder task, in essence the prompt asks the model to: Analyse the user sentiment, Reverse the user sentiment and respond in that emotion. The guided model only has to do one of those things, namely: Respond according to the emotion specified in the prompt.

While the impact of the guiding framework is mostly positive, I have also encountered situation where the guiding model can act as a bottleneck to the LLM performance, sometimes even degrading the performance. In figure 5 the guiding model made a mistake in its sentiment prediction, causing the LLM to be prompted with the wrong emotion. Another consideration when using this framework is the trade-off between control and LLM freedom. By adding control through a set of predefined mapped emotions we lose the ability of the LLM to respond according to an emotion outside of this set.

Figure 5: The guiding model predicted emotion “joy” which was obviously wrong.

Conclusion

Guiding a (small) instruct based LLM by using guiding models provides multiple benefits, it gives the programmer more fine-grained control over the LLMs behaviour and it mitigates some unforeseen side-effects that occur when prompting small LLMs with more complex prompts. Another big advantage is the fact that the framework is very modular, you can swap the guiding model for any type of ML model and the LLM for any type of instruct model. Besides, the framework does not require any access to the model parameters, making it possible to even boost LLM performance when the LLM is in a black-box environment behind an API.

However, always bear in mind what your intended use-case of your LLM is and if the negative effects mentioned previously are worth the trade-off.

This blogpost only shows a very simple use-case example, so I invite everyone to be creative and think of use-cases where guiding models might help to boost LLM task specific performance.

(Tip: Think of using an keyword extractor to boost summarisation capabilities of LLMs)

--

--