Creating Large Language Models on Your Laptop

Making Fine-Tuning Possible on Your Personal Computer

Intel(R) Neural Compressor

Published in

Intel Analytics Software

5 min readDec 5, 2023

Xinyu Ye, Zhe Wang, Haihao Shen, Yu Luo, and Hanwen Chang, Intel Corporation

QLoRA is an approach to reduce the memory usage of large language models (LLM) fine-tuning. It backpropagates gradients through a frozen, quantized LLM into low rank adapters (LoRA). We have developed an API to support QLoRA on ordinary CPUs, where 4-bit NormalFloat (NF4), Float4 (FP4), INT4, and INT8 are supported data types for LLM quantization. Combined with gradient checkpointing, QLoRA is able to run on consumer systems. In this article, we will show how to leverage QLoRA to achieve comparable LoRA metrics while on modest computers like laptops.

Methods

LoRA and QLoRA

LoRA freezes weights of the pretrained model and injects pairs of trainable rank decomposition matrices into each layer of the Transformer architecture (Figure 1). This greatly reduces the number of trainable parameters for downstream tasks, thereby accelerating the training of large models while consuming less memory.

Figure 1. LoRA architecture (source: image from LoRA)

QLoRA, on the other hand, quantizes the pretrained language model first, then backpropagates gradients through this frozen, quantized pretrained language model into LoRA. In this way, the memory requirements for storing the weights of the pretrained language model are greatly reduced. QLoRA has one low-precision storage data type, usually 4-bit, and one computation data type that is usually higher precision (e.g., BFloat16). During the model’s forward and backward passes, whenever a quantized weight tensor is used, it is dequantized from the storage data type to the computation data type, then used in the computation.

Gradient Checkpointing

Gradient checkpointing is a technique to reduce the memory footprint during the training of deep neural networks, at the cost of more computation. All activations from the model’s forward pass are usually saved to compute the gradients of the model during the backward pass, this can become a large memory overhead. Alternatively, one could throw away all activations during the forward pass and recompute them when needed in the backward pass, but this would significantly increase computation and slow down training. Gradient checkpointing is a compromise between these two approaches. It only saves strategically selected activations so only a small portion of the activations need to be recomputed for the gradients.

As mentioned above, QLoRA needs to dequantize the quantized weight tensor to the computation data type during model’s forward and backward passes throughout training, which means without gradient checkpointing, besides quantized weights, we also need to store the dequantized weights as well as activations to compute the gradients, causing larger memory requirements than normal LoRA fine-tuning in the computation data type. This necessitates the gradient checkpointing for QLoRA, since we can now throw away not only partial activation tensors, but also dequantized weights with gradient checkpointing. This makes the memory requirements of QLoRA fine-tuning much smaller than that of LoRA.

Implementing QLoRA on a CPU

Intel Extension for Transformers internally implements a BLAS acceleration library called Jblas that supports different bit-width data types and most instruction set architectures of Intel CPUs. Jblas achieves high performance through optimized thread parallelism, instruction parallelism, data parallelism, and cache reuse. It also provides the weight-only-quantization function, which supports multiple data types (such as NF4/INT4/INT8), significantly reducing the memory and bandwidth overhead of LLM inference. For data types that are not supported by the hardware (such as the 4-bit bit-width NF4 used in QLoRA), Jblas uses SIMD instructions such as gather to convert the 4-bit value into an index, and quickly converts it into a type that can be handled by the hardware (such as bf16) through a lookup table, thereby ensuring the efficiency of the overall calculation.

At the same time, we found that Torch’s native dropout has an unreasonable performance gap between CPUs and GPUs. This is because the dropout operator on the CPU will separately sample a matrix that follows a Bernoulli distribution, write the matrix to memory, then scale or zero the activation based on this matrix. Obviously, this introduces redundant I/O overhead. Therefore, we adopted the idea of operator fusion to optimize the calculation of dropout. When sampling the Bernoulli distribution, we directly scale/zero the activation with the sampled value in SIMD vector registers. In addition, when generating a Bernoulli distribution, a uniform distribution needs to be generated in advance, then a Bernoulli distribution is sampled according to the probability, P. We implemented a SIMD random number generator with a small random number cycle (generating 4 x 10¹⁹ numbers per cycle) and computational overhead to sample the uniform distribution, which reduces the computational cost of dropout.

Example

The code snippet below shows how to get the QLoRA model through Intel Extension for Transformers. The complete fine-tuning example is available here. For a usage guide, please refer to the readme.

from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM 

model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-hf', torch_dtype = torch.float32, load_in_4bit = True,
) 

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training 

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing = True)
model.gradient_checkpointing_enable()
peft_config = LoraConfig(r = 8, task_type = TaskType.CAUSAL_LM)
model = get_peft_model(model, peft_config)

Results

We ran some QLoRA experiments on a consumer level CPU (an Intel Core i9–12900 processor with 32 GB RAM). We chose the pretrained meta-llama/Llama-2–7b-hf LLM, a partial Alpaca dataset as the instruction tuning dataset (about 2,000 samples), set the training batch size to 16, the computation data type to Float32, and the storage data type to NF4. We used the TruthfulQA-MC dataset to evaluate the fine-tuned model. With these settings, QLoRA finished fine-tuning within 21 hours, with training memory peaks at 16.97 GB (Figure 2). LoRA with gradient checkpointing needed 19 hours for fine-tuning and 34.1 GB peak memory.

Figure 2. QLoRA’s RAM usage during model loading and one training step

Model quality is measured using single-true (MC1) and multi-true (MC2) multiple-choice metrics, which are defined here. MC1 and MC2 of the fine-tuned QLoRA model on the TruthfulQA-MC dataset are 0.2901 and 0.4304, respectively. MC1 and MC2 of the fine-tuned LoRA model on the TruthfulQA-MC dataset are 0.2827 and 0.4211, respectively. For comparison, we also did experiments for QLoRA on an Nvidia GPU with the same conditions and achieved similar accuracy: 0.2925 for MC1 and 0.4290 for MC2.

Summary

We support QLoRA on Intel CPUs in Intel Extension for Transformers and encourage you to try it out to fine-tune your own chatbots. Please add a star to support this effort. You are also welcome to create pull requests or submit issues to the repository. Feel free to contact us if you have any questions.