Techniques for Efficient Inference of LLMs (I/IV)

Andrei Apostol
MantisNLP
Published in
8 min readOct 4, 2023

Introduction

Recent years have witnessed great advances in the capabilities of language models, particularly when scaled up. Models such as OpenAI’s GPT-3, Anthropic’s Claude and Google’s Bard have taken the world by storm and allowed many disruptive startups to emerge and create products that just one year ago would not have been possible. Moreover, ChatGPT has become the technology with the fastest adoption rate, having reached over 100M consumers in less than two months from its launch.

While this model-as-a-service approach has brought a surge of interest in the field, it brings with it several challenges, such as:

  • Privacy concerns. User conversations may be used to improve the product via training. In some cases, this may lead to unwanted leaks of private information, such as with the recent Samsung leak and subsequent internal ban on ChatGPT usage. This poses a clear security risk.
  • Closed-sourceness. Model code as well as the training data used is not publicly available. This poses a challenge for companies that want to do internal R&D since the service has to be treated as a black box and no meaningful modifications to its behavior can be made.
  • Opaque guardrails. Related to the above point, but worth mentioning separately. While these services typically implement guardrails preventing the model from exhibiting toxic or unwanted behavior, jailbreaking prompts, i.e. tricking the model into bypassing these restrictions, have been recently discovered by users. (WARNING: previous link may contain sensitive or toxic conversations between users and the AI model). This, again, poses a risk for companies that incorporate these services into their products.
  • Cost. Last but not least, cost is a serious consideration when using such models, especially when requiring fine-tuned models. While pricing on many of these services is attractive, mid-sized and large companies may find a better return on investment deploying their own models. This, of course, depends on expected traffic.

Together, these concerns may elicit companies to deploy their own models rather than using centralized services. I will be spending the next part of this blog post explaining why larger models are preferable to smaller ones in such a case, and then we will dive into techniques for accelerating inference of these models.

Importantly, the techniques presented in this blogpost are orthogonal to each other. One may choose any subset to apply to their own model, depending on their needs. This is the first installment in a four-part blogpost.

Why Large Models?

Numerous papers have shown that there exists a power-law relationship between model size, dataset size, compute and the empirical performance of the model [1]. Moreover, it is apparent that larger models are more parameter efficient, in that they can reach a lower loss using the same number of tokens. Practitioners may, thus, opt to train a larger model rather than scale up the dataset, especially in scenarios where collecting more data is prohibitively expensive.

Figure from [1]

You may wonder, however, whether this is worth it. Why should one increase the model size and/or dataset by 10x for a relatively small improvement in the loss function? It turns out that LLMs exhibit what’s known as emergent abilities [2]. Specifically, this refers to tasks where language models have trouble dealing with at small scales (i.e. their performance is close to random), but spike in performance past a certain threshold.

Fig. 2 from [2]

These charts are closer to a step function rather than the linear or logarithmic increase that one may expect. As such, for certain applications, small language models are simply not feasible.

Of course, larger models pose their own problems. For example, running a 40B Falcon model, one of the best open-source models available, in full FP32 precision will require roughly:

  • (4 * 40B) for storing the model weights, and
  • (4 * 40B) for the activations

This comes out at 320 GB of VRAM memory, or 4 NVIDIA A100 GPUs (each costing ~15k). Clearly, this is not an attractive ROI proposition.

Luckily, however, there are many optimization techniques that have emerged in recent years that alleviate this problem. We will go over some of them in the next sections.

Quantization

Floating point formats

To understand quantization, we will first explain the commonly used FP32 format. Simply put, it represents floating point numbers using:

  • 1 bit for the sign
  • 8 bits for the exponent
  • 23 bits for the mantissa (fractional part of the number)

This is illustrated below:

Fig. from [3]

The floating point number can be calculated from its FP32 representation using:

In other words, the exponent is a regular base 2 representation. The mantissa, on the other hand, is a summation of fractions of powers of 2. In the above example:

  • sign is 1 (i.e. +)
  • E = 124
  • The rightmost term comes out at 1.25

Thus the number represented above is:

2–3*1.25 = 0.15625

Of course, FP32 is not the only format that can be used for representing numbers. It turns out that in machine learning we don’t actually need this level of precision. In fact, training using half-precision (FP16) has been a longstanding practice [4]. Representing the activations and gradients using a lower bitwhich offers significant speedups and memory savings due to lower data movement and better arithmetic throughput. Moreover, efficient tensor cores have been developed that are specialized in such operations.

Many different memory formats have been developed that strike a balance between the representable range (how many bits are allocated to the exponent) and the precision (how many bits are allocated to the mantissa). Below are a few examples:

One should choose the appropriate data format depending on the magnitude of the gradients as well as the precision required to maintain accuracy. Training LLMs, for instance, is notoriously unstable using FP16, with researchers opting for BF16 instead.

8-bit quantization for inference

The previous section introduced different floating point formats. It’s nice to go down from 4 bytes down to 2. However, it is possible to go down even to 1 byte for inference, where having good gradient estimates is not a concern.

This is not done by using a separate representation, but rather by approximating the numbers via rounding. Quantization is usually applied to a tensor (i.e. to a row or column in the weight matrix) in order to obtain a scaling factor which will be used to convert the numbers to quantized representation. Thus, each row/column will have its own scaling factor.

The advantage is that this lets us perform the matrix multiplications using integers, a much faster format.

For more details on how quantization generally works, we refer the reader to this documentation page. We illustrate below how the process looks like conceptually:

It is important to note, however, that up until recently, traditional quantization techniques have struggled to maintain the same level of accuracy when applied to LLMs.

LLM.int8()

In LLM.int8() [6], the authors discover that outlier features are the cause for this underperformance. Specifically, these are features that skew the min and max values of the float vector and cause the quantization process to be excessively noisy. Moreover, these errors accumulate as we perform the forward pass, leading to unacceptable accuracy degradation.

As such, the proposed solution is as follows (direct quote from [7]):

  1. From the input hidden states, extract the outliers (i.e. values that are larger than a certain threshold) by column.
  2. Perform the matrix multiplication of the outliers in FP16 and the non-outliers in int8.
  3. Dequantize the non-outlier results and add both outlier and non-outlier results together to receive the full result in FP16.

The process is illustrated below:

In terms of benchmarks, doing quantization in this way has 0 accuracy degradation tested on a wide range of models and tasks.

In terms of speed, it is 15% to 23% slower than fp16, with the main advantage here being the reduced memory usage. This is benchmarked below:

For more benchmarks and a tutorial on how to use it with the transformers library, we refer the reader to the official blog post [7].

Since this method came out, using it within the HuggingFace transformers framework has become very straightforward. All you need to do is specify the “load_in_8bit” flag:

# pip install transformers accelerate bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bigscience/bloom-1b7"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

More recently, methods of reducing model inference down to 4 bits (so roughly 8x less memory used) have emerged and are available as well, and are equally simple to use:

# pip install transformers accelerate bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bigscience/bloom-1b7"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True)

Wrapping Up

In this first part of our blog series on efficient inference of LLMs, we discussed the challenges associated with centralized model-as-a-service approaches (privacy concerns, closed sourceness, cost and problematic behavior).

Despite these challenges, these larger models have been shown to follow a power-law relationship between dataset size, model size, compute and performance. Larger models are more parameter efficient and can achieve superior results, making them preferable in many cases to simply scaling up the dataset.

To get the best of both worlds, however, it would be ideal to deploy an open source LLM. This, however, presents considerable scalability challenges.

By representing numbers in lower precision formats, as it is done in the LLM.int8() method, we can reduce memory usage and accelerate matrix operations without sacrificing accuracy.

In the second part of this series, we will delve into pruning and how that can help. Stay tuned!

--

--