Analyzing the Dual Impact: Batch Size and Mixed Precision on DistilBERT’s Performance in Language Detection

3 min readSep 12, 2023

Introduction

In deep learning, batch size and precision are two pivotal parameters that can significantly influence a model’s training speed and accuracy. In this study, we delve into understanding the combined and individual effects of batch size and mixed-precision training using both 16-bit (fp16) and 32-bit (fp32) floating-point types using DistilBERT (distilbert-base-multilingual-cased) for a language detection task.

Batch Size: What and Why?

Batch size refers to the number of training samples used in one iteration. Its choice can have profound implications:

Small Batch Sizes: Can lead to a more accurate model but at the cost of computational inefficiency.
Large Batch Sizes: Often speed up training but might compromise on model accuracy and can converge to sharp minimas, affecting generalization.

Mixed Precision (fp16): A Primer

Mixed precision training primarily uses 16-bit floats (fp16) instead of the traditional 32-bit floats (fp32). Key benefits include:

Speed: Faster mathematical operations and reduced memory usage.
Memory Efficiency: Allows for larger batch sizes or models.

Experimental Setup

All experiments were performed using T4 GPU.
Model: distilbert-base-multilingual-cased
Dataset: language-detection-dataset
Task: Language Detection
BATCH_SIZE = [4,8,16,32,64,128]
MAX_INPUT_LENGTH = 144
LEARNING_RATE = 2e-5
NUM_OF_EPOCHS = 5

We trained DistilBERT across different batch sizes, ranging from 4 to 128. Each model was trained using both fp16=True and fp16=False configurations to gauge the impact of mixed precision.

The Dual Impact on DistilBERT

A) Training Metrics

Training Metrics corresponding to different batch sizes and fp16=False

Training Metrics corresponding to different batch sizes and fp16= True

i) Training Loss (Epoch 5):

As batch size increased, the training loss demonstrated a pattern of increase-decrease-increase in 32-bit precision.
In mixed precision, a similar trend was observed, though often with marginally lower values.

ii) Validation Loss (Epoch 5):

With pure 32-bit precision, an upward trend in validation loss was noticeable with rising batch sizes.
Mixed precision, however, showcased more consistency, especially in the mid-range batch sizes (32–64).

iii) Accuracy (Epoch 5):

For both precisions, accuracy was commendably high, with mixed precision showing a slight edge in specific batch sizes.

iv) Training Time:

Mixed precision consistently outperformed 32-bit precision in terms of speed, a benefit more pronounced with larger batch sizes.

B) Evaluation Metrics

Evaluation Metrics corresponding to different batch sizes and fp16=False

Evaluation Metrics corresponding to different batch sizes and fp16=True

i) Evaluation Loss:

32-bit precision showed an escalating loss with batch size, while mixed precision remained more stable, especially up to a batch size of 32.

ii) Performance Metrics:

Both precisions achieved near-perfect scores for smaller batch sizes, with only minor dips as batch size increased.

iii) Evaluation Speed:

Mixed precision’s benefits shone again, consistently surpassing 32-bit precision in speed, particularly with larger batch sizes.

Key Takeaways

i) Batch Size Sensitivity: DistilBERT’s performance, in terms of loss and accuracy, is sensitive to batch size, showcasing optimal performance at mid-range batch sizes.

ii) Precision’s Interplay with Batch Size: Mixed precision amplifies the efficiency benefits as batch size increases, offering a blend of speed and performance. This synergy can be particularly beneficial in large-scale deployments where both time and accuracy are crucial.

Conclusion

The analysis reveals the nuanced interplay between batch size and training efficiency. Increasing batch size tends to expedite training time per epoch, owing to fewer forward and backward passes. However, performance metrics, such as training and validation losses, exhibit a subtle deterioration with very large batch sizes, suggesting possible coarse gradient approximations. Furthermore, utilizing fp16 enhances computational speed across different batch sizes while maintaining comparable accuracy metrics, showcasing its effectiveness in model optimization.