Proxy-Tuning: A Breakthrough in Customizing Large Language Models

Published in

neoxia

4 min readMar 18, 2024

Introduction

Large Language Models (LLMs), such as Chat GPT and Llama 70B, have revolutionized natural language processing, demonstrating remarkable capabilities in various domains. However, achieving specific behaviors or adapting these models to task-specific requirements often entails resource-intensive fine-tuning processes. Fine tuning is essential in scenarios involving question-answering chatbots tailored to specific themes or professions (such as healthcare, sports, or mathematics). This becomes particularly challenging when model weights are private or inaccessible.

Traditional approaches to adapt LLMs involve alignment phases, prompting, and sometimes costly training. Despite advancements like prompt engineering and optimization methods such as Qlora, challenges persist in efficiently customizing these powerful models for specific tasks.

Enters “proxy-tuning,” a groundbreaking concept introduced by researchers from the University of Washington and the Allen Institute for AI. Proxy-tuning is a lightweight, decoding-time algorithm designed to streamline the adaptation of large pretrained language models without the need to access their weights directly.

Background of Transformer-based LLMs:

Transformer-based LLMs operate by predicting the next token in a sequence based on the context of previous tokens. The model calculates logits for all potential next tokens, representing raw predictions. These logits are then transformed into probabilities through the softmax function, guiding the model to sample the next token. The generation process repeats until a stop condition is met. This is where Proxy-tunning stands.

Figure 1 : Example of next words probability

Proxy-Tuning Overview:

Proxy-tuning is designed as a decoding-time algorithm that operates on black-box LLMs, aiming to achieve results similar to direct model tuning while only accessing predictions over the output vocabulary. The core idea involves tuning a smaller language model and then applying the predictive differences between the small-tuned and an untuned model to shift the original predictions of the base model toward the desired tuning goal. The novelty lies in leveraging the benefits of larger, more comprehensive models without directly modifying their parameters.

Figure 2: Proxy-tuning “tunes” a large pretrained model without accessing its internal weights, by steering it using an “expert” (a small tuned model) and its corresponding “anti-expert” (the small model, untuned). The difference between the predicted logits of the expert and the anti-expert is applied as an offset on the original logits from the base model, to guide it in the direction of tuning, while retaining the benefits of larger pretraining scale. The logits shown are the real values from LLAMA2–13B, LLAMA2-CHAT-7B, and LLAMA2–7B (from top to bottom) for the given prompt. *[4]*

Methodology:

The proxy-tuning methodology entails having a large pre-trained language model (M) with inaccessible weights and a smaller pre-trained language model (M-) that can be fine-tuned. The fine-tuned version of the smaller model is denoted as M+. At decoding time, the logits of the large model are updated by adding the difference between the logits of M+ minus M-. This adjustment effectively guides the predictions of the base model in the direction of tuning without directly modifying its parameters. In the work presented by the authors in [4], they demonstrate that during decoding, the likelihood of a token being chosen is directly linked to two factors: the probability assigned by the “base LM” to select it and the ratio of probabilities assigned by the expert and anti-expert models.

Experimental Results:

The effectiveness of proxy-tuning is demonstrated through experiments, particularly on the Llama2–70B on AlpacaFarm and GSM benchmark. LLama2–70B-BASE achieves only 3.7% win rate on AlpacaFarm and 9.6% accuracy on GSM. Proxy-tuning 70B-BASE improves performance dramatically, to 88.0% on AlpacaFarm and 32.0% on GSM. For AlpacaFarm, this is only 2.4% short of the LLama2 CHAT model directly tuned on the benchmark.

Figure 3: Results for instruction-tuning. For each model size, Base refers to the pretrained LLAMA2 model, Directly tuned refers to LLAMA2-CHAT, and the Proxy-tuned model always uses LLAMA2–7B-CHAT as the expert and LLAMA2–7B as the anti-expert. Overall, proxy-tuning dramatically improves performance over the base model, on average closing 91.1% and 88.1% of the gap with the corresponding CHAT model at 13B and 70B size, respectively. Moreover, proxy-tuning a larger model outperforms the small expert alone in all scenarios except a 0.1% difference in ToxiGen, showing that the method also improves over the expert by reaping the benefits of large pretraining scale. *[4]*

Applications and Versatility:

Proxy-tuning exhibits versatility beyond language models, extending its application to domains such as code adaptation and task-specific fine-tuning for question-answering and mathematical problems. The method’s adaptability and efficiency suggest its potential for scaling across diverse applications. Regarding business-oriented applications, proxy-tuning is by nature designed for expert fields where semantic similarity is not sufficient to provide accurate and reliable results. The use of RAG for example, although seen as a simple yet powerful approach in such cases, can be subject to information loss (several factors come in, such as chunk size limitations or the similarity function used). In case of niche domains, RAG can be prone to incoherency due to the specific lexical field and the different use of words in the application domain.

Therefore, proxy-tuning is perfectly adapted to expert fields, such as specific scientific specialties (mathematics, medicine, pharma industry) or industries with restrained lexical domains (fraud detection, supply, luxury…).

Conclusion:

In conclusion, proxy-tuning stands out as a promising and resource-efficient approach to customize large language models. By tuning predictions at decoding time and avoiding direct modifications to model parameters, proxy-tuning demonstrates impressive results in closing performance gaps. Its applications in domain adaptation and task-specific fine-tuning further showcase its potential for widespread use. As the field continues to explore efficient ways to tailor language models, proxy-tuning emerges as a groundbreaking technique with far-reaching implications.

References

https://promptengineering.org/the-black-box-problem-opaque-inner-workings-of-large-language-models/
https://medium.com/@masteringllm/demystifying-the-temperature-parameter-a-visual-guide-to-understanding-its-role-in-large-language-d9e8ea4b9956
https://medium.com/ai-advances/tune-a-large-language-model-without-accessing-its-weights-a1427632d9f5
LIU, Alisa, HAN, Xiaochuang, WANG, Yizhong, et al. Tuning Language Models by Proxy. arXiv preprint arXiv:2401.08565, 2024.MLA