Understanding LoRA and QLoRA — The Powerhouses of Efficient Finetuning in Large Language Models

Murali Manohar
8 min readAug 8, 2023

Background

Large Language Models (LLMs) are currently a hot topic in the field of machine learning. Imagine you’re an ML Engineer and your company has access to GPUs and open-source LLMs like LLAMA/Falcon. You’re tasked with building tools for your customers, each with unique needs. You finetune your model for each customer, and everyone is satisfied.

But what happens when you have thousands of customers? Deploying thousands of GPU-hungry LLMs isn’t feasible unless you have an extensive supply of GPUs. You need a strategy that allows the model to be finetuned for each customer without breaking the bank or overloading your storage. This is where QLoRA and LoRA come into play.

Brief introduction to gradient descent

On a very abstract level, An LLM is essentially a function that takes some input, processes it and outputs something. We can represent it as f(x, W) = y, where x is the input sequence, y is the output sequence, and W is the set of weights of the model that are learned during training. W is black box that is doing the magic.

These weights are large matrices. For instance, the weights of GPT-3 number 175 billion. What makes a perfect W? — I mean how do you find the perfect combination of parameters in W? You train the model on a dataset to adjust the weights in W to minimize the difference between the output and…

--

--