Offloading the optimal number of model layers for a given LLM and GPU card

2 min readOct 22, 2023

Tuning the machine for best perfomance [Image generated via Adobe Firefly]

When deploying an LLM, we want to minimize the inference time, given the available hardware.

goal: for a given LLM on the available GPU hardware, determine the optimal layer count to offload to the GPU.

Generally, we want to offload as many model layers as possible to the GPU to improve inference time (performance in terms of execution time). This is because GPU 3D graphics cards tend to be better at performing matrix operations in parallel, than a CPU.

The number of layers we can offload from the CPU onto the GPU, depends on the hardware (dedicated GPU RAM, not shared — at least when hosting via Python ctransformers) and it also depends on the LLM model. Even the same model can have different layer sizes if for example quantized to different precisions.

Offloading too many layers to the GPU usually results in an out of memory exception, but worse than that, in some cases and for more recent Nvidia drivers, apparently this can cause severe degradation in performance.

Typically, the highest layer count that does not result in an out of memory exception is the best performer. However this depends on how the LLM is being loaded and executed (which library, how the library was built and configured, which Nvidia drivers…

Offloading the optimal number of model layers for a given LLM and GPU card

Written by Sean Ryan