Maximizing Efficiency: A Comprehensive Guide to GPU and Memory Selection for Training, Tuning, and Serving Large Language Models

13 min readApr 15, 2024

Introduction:

In recent years, large language models (LLMs) have revolutionized natural language processing tasks, from text generation to summarization. However, training and serving these models efficiently pose significant challenges, especially when dealing with resource-intensive tasks. In this blog, we’ll explore practical strategies for optimizing GPU resource allocation during both training and serving phases of LLMs. We’ll delve into example calculations for fine-tuning a LLM using Parameter Efficient Fine-Tuning (PEFT), full fine-tuning, and serving a LLM to a large number of concurrent users.

Note: In this blog, I’ve utilized Mixtral Instruct 7B as an illustrative example to provide a general understanding of the concepts discussed. Feel free to substitute it with any other large language model (LLM) of your choice, using this blog as a reference for guidance and calculations.

Training Large Language Models:

Training LLMs involves fine-tuning pre-trained models on specific tasks or domains. Let’s consider two approaches: Parameter Efficient Fine Tuning aka PEFT and full fine-tuning.

Calculating GPU requirements for Large Language Models (LLMs) based on model parameters involves several factors, including the model architecture, batch size, memory requirements, and computational complexity. Here’s a general approach to estimate GPU requirements based on model parameters:

Determine Model Architecture: Different LLM architectures have different computational requirements. Examples include GPT (Generative Pre-trained Transformer) models and BERT (Bidirectional Encoder Representations from Transformers) models. Each architecture may have different memory and computational demands.
Understand Memory Requirements: LLMs typically require significant GPU memory (VRAM) to store model parameters and intermediate activations during computation. The required memory depends on the model size, number of layers, hidden dimensions, and sequence length. For large models, you may need GPUs with high VRAM capacity.
Consider Batch Size: Training LLMs with larger batch sizes can improve GPU utilization and training efficiency. However, larger batch sizes also require more memory. You may need to adjust the batch size based on the available GPU memory.
Estimate Computational Complexity: The computational complexity of LLMs depends on factors like the number of parameters, attention heads, layers, and sequence length. Larger models with more parameters and deeper architectures require more computational power.

Memory Requirements for LLMs:

Model Parameters Memory:

The size of the LLM’s parameters directly affects the memory requirements. Each parameter, such as weights and biases in neural network layers, occupies memory in the GPU.

Calculation: Total model parameters * size of each parameter (typically in bytes) = Memory required for storing parameters.

For example Mixtral Instruct 7B model has approximately 7 billion parameters.

In a trained model, various parameter types are utilized, each with its own memory footprint:

Assuming each parameter occupies 4 bytes (32 bits), the memory required for 7B parameters = 7e9 * 4 bytes ≈ 28 GB.

This refers to the memory required to store the model’s parameters and buffers. The estimate is 28.00 GB for float32 precision.

2. Gradient calculation:

This stage computes the gradients of the loss with respect to the model’s parameters. The memory usage is similar to the model size, 28.00 GB for float32.

3. Backward pass:

During training, LLMs compute and store intermediate activations (output of each layer) for backpropagation. These activations consume additional memory.

Calculation: Number of layers * size of activations per layer * batch size = Memory required for storing intermediate activations.

This stage computes the gradients of the loss with respect to the inputs, which requires storing the intermediate activations and gradients. As per Hugging face the estimated memory usage for Mixtral Instruct 7B is 54.98 GB for float32, which is roughly twice the model size. (In above calculation formula the number of layers per documentation are 32, they have not mentioned size of activations per layer , batch size can vary)

4. Optimizer step: This stage updates the model’s parameters using the gradients and optimizer-specific state (e.g., moment estimates for Adam). As per Hugging face the estimated memory usage is 109.96 GB for float32, which is roughly four times the model size.

Pre-Training a Large Language Model from Scratch:

To understand the number of Memory/GPUs needed for training the a given model i.e. Mixtral Instruct 7B model for pre-training, we consider factors such as the batch size, training time, and resource availability.

The selection formula for determining the number of GPUs needed for pre-training typically involves considering the following factors:

Batch Size: Determine the desired batch size for fine-tuning. This depends on factors such as GPU memory capacity, training efficiency, and throughput requirements.

Training Time: Estimate the total training time required to complete the fine-tuning process. This includes factors such as the number of training epochs, learning rate schedule, and convergence criteria.

Resource Availability: Assess the availability of GPUs and other resources for fine-tuning. Consider factors such as the number of available GPUs, memory capacity, and compute power.

Throughput Requirement: Determine the desired throughput for the fine-tuning process, i.e., the number of fine-tuning iterations completed per unit time.

Based on these factors, the number of GPUs needed for pre-training can be calculated by balancing throughput requirements, training time constraints, and resource availability.

Example Calculation for pre-training :

This is just the example to understand how a model is pre-trained, Mixtral 7B might have been trained for many hours/days with many iterations and it might have needed many more GPU’s.

Below is just example for understanding, lets consider we have 16 GPU’s available (just randomly picked number for GPU). Again the GPU type selection depends on a model you are pre-training for. Mixtral 7B might have been pre-trained on very powerful GPU’s like NVIDIA A100 which has 40 GB capacity.

Given Parameters:

Mixtral Instruct 7B model parameters: 7 billion
Batch size: 8
Training epochs: 10
Total available GPUs: 16
Training time per epoch: 2 hours

Calculation Steps:

Throughput Requirement: Determine the desired throughput, i.e., the number of fine-tuning iterations completed per hour.

Assuming each GPU can complete 2 fine-tuning iterations per hour:
Total throughput = 16 GPUs * 2 iterations/GPU/hour = 32 iterations/hour.

Total Training Iterations: Calculate the total number of fine-tuning iterations needed to complete training:

Total iterations = Training epochs * Number of iterations per epoch = 10 * 32 = 320 iterations.

Training Time: Estimate the total training time required:

Total training time = Total iterations / Total throughput = 320 iterations / 32 iterations/hour = 10 hours.

Number of GPUs: Determine the number of GPUs needed based on the desired throughput and training time:

Number of GPUs = Total throughput / Desired throughput per GPU
Number of GPUs = 32 iterations/hour / 2 iterations/GPU/hour = 16 GPUs.

In the scenario outlined, pre-training a model for 10 hours with 320 iterations necessitates the use of 16 GPUs to achieve the desired throughput. As the training duration and number of iterations increase, the demand for computational resources also rises accordingly. To maintain efficient training and meet performance targets, it becomes necessary to scale up the number of GPUs employed in the process. Therefore, as you extend the training duration or increase the number of iterations, it’s essential to adjust the number of GPUs accordingly to accommodate the heightened computational requirements. By scaling the infrastructure in this manner, you can ensure optimal resource utilization and achieve desired training outcomes within the specified timeframes.

The large models are usually needs training for weeks, Now we can imagine how expensive it is to pre-train a model.

Parameter Efficient Fine-Tuning (PEFT):

PEFT involves fine-tuning only the model’s output layer while keeping the rest of the parameters fixed. This approach is suitable for tasks where the model’s general knowledge is relevant, and only minor adjustments are needed.

Memory Requirements Calculation:

Mixtral 7B model parameters: 7 billion
Fine-Tuning Parameters: For PEFT, only the output layer parameters need to be fine-tuned. Let’s assume this accounts for 1% of the total parameters.

Example Calculation:

Mixtral 7B model parameters: 7 billion
Fine-tuning parameters (output layer): 1% of 7 billion = 0.01 * 7 billion = 70 million
Assuming each parameter occupies 4 bytes:

Memory required for fine-tuning parameters = 70 million * 4 bytes = 280 MB

When performing Parameter Efficient Fine-Tuning (PEFT), although only a subset of parameters (typically the output layer) is fine-tuned, the entire model still needs to be loaded into memory during the fine-tuning process. Therefore, we do need to consider the memory required to store the entire model, including both the fixed parameters and the fine-tuning parameters.

To calculate the memory required to store the whole model after PEFT, we need to sum the memory requirements of all the parameters, including the fixed parameters and the fine-tuning parameters. Let’s denote:

Pfixed: Memory required for storing the fixed parameters of the model before fine-tuning.
Pfine-tuning: Memory required for storing the fine-tuning parameters (e.g., the output layer).
Ptotal: Total memory required for storing the whole model after fine-tuning.

The calculation for Ptotal can be expressed as:

Ptotal=Pfixed+Pfine-tuning

Given that Pfixed is the memory required to store the entire model before fine-tuning, it’s typically calculated using the same method used to calculate the memory footprint of the original model.

Let’s say Pfixed for the Mixtral 7B model is 28 GB (as explained above). Then, the memory required to store the entire model after PEFT (Ptotal) can be calculated as:

Ptotal=28 GB +280 MB

This calculation provides an estimate of the total memory required to store the entire model after performing PEFT.

However, the exact number of GPUs needed for PEFT depends on various factors such as the size of the fine-tuning parameters, the batch size, the training duration, training epochs and the desired throughput. With fewer parameters to update compared to full fine-tuning, PEFT may be more resource-efficient and could potentially require fewer GPUs to achieve similar training objectives.

Full Fine-Tuning:

For full fine-tuning, where all parameters of the Mixtral Instruct 7B model are fine-tuned, the selection formula for determining the number of GPUs needed is similar to pre-training but considers the following additional factors:

Model Size: Assess the total memory requirement for loading the entire model into GPU memory during training. This depends on the size of the model parameters and any additional resources needed for training.

Batch Size and Throughput: Similar to pre-training or PEFT, determine the optimal batch size and throughput requirements for training. Adjustments may be needed based on the larger memory footprint of the full model.

Resource Constraints: Evaluate resource constraints such as GPU memory capacity, compute power, and training time limitations. Ensure that the selected number of GPUs can efficiently handle the full fine-tuning process within these constraints.

Example Calculation for Full Fine-Tuning:

1. Model: This refers to the memory required to store the model’s parameters and buffers. The estimate is 27.49 GB for float32 precision.

2. Gradient calculation: This stage computes the gradients of the loss with respect to the model’s parameters. The memory usage is similar to the model size, 27.49 GB for float32.

3. Backward pass: This stage computes the gradients of the loss with respect to the inputs, which requires storing the intermediate activations and gradients. The estimated memory usage is 54.98 GB for float32, which is roughly twice the model size.

4. Optimizer step: This stage updates the model’s parameters using the gradients and optimizer-specific state (e.g., moment estimates for Adam). The estimated memory usage is 109.96 GB for float32, which is roughly four times the model size.

With the given parameters and desired training duration, you can perform example calculations to estimate the total number of fine-tuning iterations, training time, and number of GPUs needed for conducting full fine-tuning of the Mixtral Instruct 7B model. Adjustments may be needed based on specific dataset characteristics, hardware constraints, and optimization strategies to achieve the best results within your resource limitations.

Again for full fine-tuning we still need to train models for hours, with multiple runs, that adds to the overall cost of full fine-tuning a model. As we can imagine it gets expensive as well.

Serving Large Language Models:

Designing a performant and efficient model serving system with scaling capabilities involves several key considerations and best practices. Here’s a step-by-step guide:

1. Choose an Efficient Serving Framework:

Select a serving framework optimized for high throughput and low latency, such as TensorFlow Serving, TorchServe.
Ensure the serving framework supports GPU acceleration for inference tasks.

2. Containerization:

Containerize your model serving application using Docker or other containerization technologies.
Containers provide a lightweight and portable way to package and deploy applications, facilitating scalability and reproducibility.

3. Load Balancing:

Implement a load balancer to distribute incoming inference requests across multiple instances of the model serving application.
Use dynamic load balancing algorithms to efficiently utilize available resources and minimize response times.

4. Auto Scaling:

Set up auto-scaling policies to automatically adjust the number of serving instances based on the incoming workload.
Monitor metrics such as CPU utilization, memory usage, and request latency to trigger scaling events dynamically.

5. Caching and Memorization:

Implement caching mechanisms to store frequently accessed inference results and reduce redundant computations.
Use memorization techniques to cache intermediate results within the inference pipeline and improve overall throughput.

6. Batch Processing:

Batch multiple inference requests together to leverage GPU parallelism and reduce overhead.
Optimize batch size dynamically based on the input data distribution and available resources to maximize throughput.

7. Asynchronous Processing:

Design the serving architecture to handle requests asynchronously, allowing the server to process multiple requests concurrently.
Use asynchronous frameworks or event-driven architectures to improve responsiveness and scalability.

8. Model Compression and Quantization:

Apply model compression techniques such as pruning, quantization, and distillation to reduce the memory footprint and improve inference speed.
Use optimized model formats like TensorFlow Lite for deployment on resource-constrained devices.

9. Monitoring and Alerting:

Implement robust monitoring and alerting systems to track key performance metrics such as request throughput, latency, and error rates.
Set up alerts to notify operators of performance degradation or anomalies, allowing for proactive intervention and troubleshooting.

10. Continuous Optimization:

Continuously monitor and optimize the serving infrastructure based on real-time performance data and user feedback.
Regularly review and update auto-scaling policies, load balancing configurations, and caching strategies to adapt to changing workloads and requirements.

When serving a fine-tuned Mixtral model using a GPU endpoint, the GPU requirement may vary depending on factors such as the number of concurrent users, inference batch size, and the complexity of the model. Let’s outline example calculations to estimate the GPU requirement for serving the model:

Example Calculations for Serving a Fine-Tuned Mixtral Instruct 7B Model with Load Balancing, Scaling:

Here’s a table listing different GPU types along with their respective GPU memory capacities in GB:

When determining how much GPU memory is needed to serve a Large Language Model (LLM) for inference, several factors need to be considered:

Model Size: The size of the LLM model, including its parameters and any additional resources it requires, such as embeddings and intermediate activations.
Inference Batch Size: The batch size used during inference, which affects the amount of memory required per inference instance. Larger batch sizes can improve throughput but may require more memory.
Concurrency: The number of concurrent inference requests the system needs to handle simultaneously. This impacts the total memory requirement, as each inference instance consumes GPU memory.
Model Complexity: The computational complexity of the inference process, which depends on the specific architecture and operations involved in the model. More complex models may require additional memory.
Additional Overhead: Any additional memory overhead required by the serving framework, such as memory for input/output buffers, intermediate computations, and overhead for handling concurrent requests.

To calculate the GPU memory needed for each instance of the serving node, you can use the following formula:

Memory per instance = Model Size+(Inference Batch Size × Memory per Inference) + Additional Overhead

Where:

Model Size: Total memory required to load the LLM model into GPU memory.
Inference Batch Size: Number of inference instances processed simultaneously during each batch.
Memory per Inference: Additional memory required per inference instance, accounting for intermediate computations and overhead.
Additional Overhead: Any additional memory required by the serving framework and concurrent request handling.

Once you have the memory requirement per instance, you can determine the total memory needed for serving by multiplying it by the number of serving nodes and adjusting for load balancing and scaling factors.

It’s essential to monitor GPU memory usage during inference to ensure that the system operates within its memory constraints and can handle the expected workload effectively. Additionally, consider factors such as peak load, fault tolerance, and resource scaling strategies to design a robust and scalable serving infrastructure for LLM inference.

Example Calculations for Mixtral 7B:

Let’s break down the example calculations for serving a Mixtral 7B model with 100 concurrent users:

Given Parameters:

Model Size (Mixtral 7B): Assume 28 GB (for illustrative purposes).
Inference Batch Size: 1 (assuming each user sends one inference request at a time).
Memory per Inference: Additional memory required per inference instance, including overhead (e.g., 2 GB).
Additional Overhead: Any additional memory required by the serving framework and concurrent request handling (e.g., 2 GB).

Example Calculations:

Memory per Instance:

Memory per Instance = Model Size+(Inference Batch Size × Memory per Inference)+Additional Overhead

Memory per Instance = 28 GB+(1×2 GB)+2 GB = 28 GB+2 GB+2 GB = 32 GB

Total Memory for 100 Concurrent Users:

Total Memory = Memory per Instance × Number of Concurrent Users

Total Memory = 32 GB × 100 = 32GB × 100 = 3200 GB

So, to serve a Mixtral 7B model with 100 concurrent users, you would need approximately 3200 GB of GPU memory across all serving nodes considering the model size, inference batch size, and additional overhead. If we choose NVIDIA A100 which has capacity of 40 GB then we will need 80 GPU’s.

Please note that these are simplified example calculations for illustrative purposes. In a real-world scenario, you may need to adjust the parameters and consider additional factors such as peak load, fault tolerance, and resource scaling strategies to design a robust and scalable serving infrastructure.

Scaling and Load Balancing:

To handle the load of concurrent users efficiently, implement load balancing and scaling strategies. This can bring also bring down the cost of overall requirements of GPU’s, as system will be scaled up and down based on load.

Load balancer: Distributes incoming inference requests across multiple instances of the model serving application.
Scaling: Dynamically adjust the number of model serving instances based on the incoming workload to maintain optimal performance and resource utilization.

Conclusion:

In this blog, it provides a comprehensive guide to optimizing GPU resource allocation for training and serving large language models. Whether you’re fine-tuning a model with PEFT, performing full fine-tuning, or serving a model to a large number of users, these example calculations and strategies will help you navigate the complexities of working with LLMs effectively.

Efficiently training and serving large language models require careful consideration of resource allocation and optimization strategies. By understanding example calculations for both training and serving phases, practitioners can make informed decisions to maximize efficiency and performance when working with LLMs.