Reduce the Carbon Footprint of Large Language Models

Greener Generative AI with Intel Neural Compressor and Intel Extension for Transformer

Intel(R) Neural Compressor

Published in

Intel Analytics Software

5 min readOct 16, 2023

Tai Huang, Hanwen Chang, Wenxin Zhang, Liang Lv, and Haihao Shen, Intel Corporation

As large language models (LLMs) become more and more prevalent, the energy consumption and carbon emissions associated with LLMs are significant and have raised concerns about their environmental impact. One study suggested that the carbon emissions associated with AI are equivalent to the entire airline industry’s emissions, and LLMs like ChatGPT are among AI’s most computationally expensive applications. Another study contends that a single request in ChatGPT can consume a hundred times more energy than one Google search.

Training and inference are the two main processes in LLMs. Although there is still debate over whether training or inference consumes more energy, it is clear that inference is used more frequently as LLMs serve millions of users. Facebook estimates that the carbon footprint of their Transformer-based Universal Language Model for text translation is dominated by inference. Google estimates that 60% of the energy required by their AI workloads is used for inference.

Many researchers are thinking about the energy and environmental costs of AI and exploring the ways to reduce the energy consumption and carbon footprint of LLMs. This article introduces related efforts in Intel Neural Compressor and Intel Extension for Transformer. We share practices on carbon footprint reduction for LLMs and the rationales behind them. We also demonstrate their effectiveness by benchmarking LLMs inference carbon footprint between optimized and unoptimized models.

Estimating the Carbon Footprint of LLM Inference

Before we dig into the details, let’s define how the carbon footprint of LLM inference is estimated. The formula is simple:

Carbon footprint = E * C

where E is total energy consumption (in kWh) of inference and C is carbon intensity of electricity (in kgCO₂e/kWh), which means the amount of CO₂ equivalents (CO₂e) emitted from producing one of said unit of electricity.

Estimating and measuring are two ways to count total energy consumption. Measuring relies on external power meter or server hardware support, so we will use a simpler estimation that assumes LLM inference is the only workload in the server:

E = P * T * N

where P is the server power consumption (in Watts), T is token latency (in seconds), which refers to the time it takes for the LLM to process a single token during inference, and N is the number of tokens processed during inference.

If we’re using the CPU for inference, the CPU and memory subsystem are the two major components contributing to server power consumption:

P = Pcpu + Pmem * M

where Pcpu is the CPU power consumption (with TDP being a reasonable approximation), Pmem is the power consumption per unit of memory (in Watts/GB), and M is memory usage (in GB). For example, we are running LLM inference on an Intel Xeon Platinum 8480+ CPU (TDP is 350W). It generates 32 tokens during one inference. Token latency is 0.47s. Memory usage is about 60GB and DDR5 Pmem is about 0.1W/GB. Carbon intensity in this geographic region is 0.56 kgCO₂e/kWh.

E = P * T * N = (350W + 0.1W/GB * 60GB) * 0.47s * 32 = 5354.24 Joul = 1.49e-3 kWh

Carbon footprint = E * C = 1.49e-3 kWh * 0.56 kgCO₂e/kWh = 8.43e-4 kgCO₂e

You can try this simple carbon emission calculator to estimate the emission of your model.

How Intel Neural Compressor Helps Reduce Carbon Footprint

Compared to traditional INT8 quantization on both activation and weight, weight-only quantization (WOQ) is a better tradeoff to balance performance and accuracy for LLMs. Intel Neural Compressor supports WOQ quantization with state-of-the-art approaches like GPTQ , AWQ , and TEQ, as well as the simple yet effective round-to-nearest (RTN) approach.

Effective Weight-Only Quantization for Large Language Models with Intel Neural Compressor

Quantize Large Language Models with Just Few Lines of Code

medium.com

With 4-bit WOQ, model size can be reduced by up to 8x, allowing it to run on cheaper hardware and/or with higher speed because both memory footprint and bandwidth usage are much smaller. Memory power consumption decreases proportionally according to the approximation formula.

Although 4-bit WOQ dramatically decreased memory consumption, the actual data type in the computation is INT8 because there is currently no hardware acceleration support for 4-bit data types. Computational cost is reduced by the INT8 hardware acceleration instruction support, which is several times faster than the original fp32 model. The carbon footprint is significantly reduced due to the shorter inference time of the quantized model.

How Intel Extension for Transformer Helps Reduce Carbon Footprint

WOQ has effectively reduced the carbon footprint, although we did observe additional emissions during format conversion and communication between operations. Intel Extension for Transformers integrates low-precision kernels to eliminate overhead by directing dequantized data through registers for computation instead of shared memory. Moreover, it supports fusion to minimize memory usage between operators and introduces an advanced memory allocator to conserve activation memory. In addition to memory reduction, the implementation of kernels through hardware instructions guarantees peak performance. Consequently, computational cost is substantially diminished. With the combined advantages of reduced memory usage and enhanced computational efficiency, we can confidently say that Intel Extension for Transformers is an environmentally conscious accelerator.

Carbon Footprint Benchmark Results

We benchmarked inference carbon footprint on some popular LLMs with both original fp32 model and int4 WOQ model. The carbon footprint reduction ratio ranges from 2x to 19x; in general, the larger the model the larger the reduction:

Inference carbon footprint results of nine LLMs

System Configuration

CPU: Intel Xeon Platinum 8480+ (base frequency 2.8 GHz, maximum frequency 3.2 GHz, all-core maximum frequency 3.0 GHz), 56 cores per socket, TDP 350 Watts. Memory: 256 GB (16 x 16 GB DDR5 4800 MT/s). OS: CentOS Stream 8, Kernel version 5.16.0-rc1-intel-next-00543-g5867b0a2a125. Test configuration: batch size 1, 56 cores per instance, output token 32, group size 32 for int4 WOQ model.

Concluding Remarks

We are committed to greener AI and long-term sustainability through state-of-the-art model compression and runtime optimization techniques. This work demonstrates the effectiveness of Intel Neural Compressor and Intel Extension for Transformer in reducing the carbon footprint reduction of generative AI. We encourage you to try these and other tools in Intel oneAPI AI Kit. Please add a star to the Intel Neural Compressor and Intel Extension for Transformer repositories to receive notifications about our latest optimizations.