Latest YOLOv8 & YOLOv9 Guide for hyperparameter tuning and data augmentation 2024

8 min readMay 24, 2024

Training a YOLO model from scratch can be very beneficial for improving real-world performance. This process can be divided into three simple steps: (1) Model Selection, (2) Training, and (3) Testing.

This article will cover all the best practices for optimizing YOLO model performance, including model selection, training, and testing.

Photo by Possessed Photography on Unsplash

Model Selection

The first step to building a successful computer vision model is defining the problem and selecting the appropriate model.

Task

YOLOv8 is available for five different tasks:

Classify: Identify objects in an image.
Detect: Identify objects and their bounding boxes in an image.
Segment: Segment objects in an image.
Track: Track objects and bounding boxes in an image.
Pose: Identify pose keypoints in an image.

Make sure to choose the appropriate task for your problem.

Model

The second step is choosing an appropriate model. Most common YOLO models are available in five sizes [1]:

Image by Author

The example above shows the sizes, speeds, and accuracy of the YOLOv8 object detection models. Depending on the hardware and task, choose an appropriate model and size. The Ultralytics framework can also be used to build your own custom model architectures.

Training

Training is essential for optimizing the model for our specific task. Thankfully, Ultralytics makes the training process simple. Using the default arguments will often result in satisfactory results. However, here are the training best practices for optimal results.

Dataset

The more data, the better. However, these are the guidelines [1]:

Images per class: Over 1500 images per class.
Instances per class: Over 10,000 instances per class.
Image Variety: Train on data similar to real-world data. Creating your own custom dataset can be very beneficial for increasing accuracy. Consider different conditions, such as weather, seasons, lightning, angles, cameras, RGB, and grayscale.
Label consistency: Labels must closely match the actual ground truth. For example, the bounding boxes must closely enclose the object for object detection. No objects shall be missing a label.
Background images: Background images are images without objects. They are added to the dataset to reduce the number of false positives, such as a pole falsely classified as a person. It’s recommended the dataset contains 0–10% background images. For background images, no annotations are necessary.
Dataset split: Training and validation sets must be provided for training. However, an additional test dataset is beneficial to avoid overfitting the validation data. The usual split ratio for Train-Validation-Test is 80–10–10.
Data leakage: To ensure no data leakage occurs, the images in the different train, validation, and test sets should not contain the same images or images that closely resemble images in the other dataset. To avoid data leakage, record images at different scenes with different conditions and split them based on scenes.

For a guide on dataset format and creation, visit:

Fine-tuning YOLOv9

Step-by-step guide for training and fine-tuning YOLOv9 on custom datasets in Google Colab

Settings

Epochs: The number of epochs is highly dependent on the dataset used for training. It’s recommended to start with 300 epochs. If overfitting occurs, you can reduce the number of epochs or use early stopping.
Image size: For training, the image size is assumed to be square and is set by default to imgsz=640. Ultralytics will automatically scale down the images, keeping the original aspect ratio, and pad them using letterboxing to the desired image size. Make sure to train on the image size you desire for real-world inference. Tip: If you have high-resolution images, you can use a sliding window approach to train the network. Splitting the high-resolution into smaller chunks.
Batch size: It’s recommended to use the largest batch_size that the hardware allows for.
Early stopping: To avoid overfitting, early stopping can be used, such as patience parameter. Early stopping after n epochs without improvement.
Hyperparameters: In addition to the dataset, hyperparameter tuning is one of the most important aspects of optimizing your model. We will cover this in detail further down in the article.
Data augmentation: Ultralytics uses several types of data augmentation to improve performance. Some techniques are more beneficial for certain problems. Make sure the data augmentation is applicable to real-world performance. This will be covered in detail later in the article.

Tuning

Do you want the best performance without manually testing different hyperparameters and data augmentation techniques? The Ultralytics tuner can help. The tuner takes a model and finds the optimal hyperparameters through weighted random mutations [2].

from ultralytics import YOLO

model = YOLO("yolov8n.pt")
model.tune(data="coco8.yaml", epochs=30, iterations=300, optimizer="AdamW", imgsz=640)

Epochs: Specifies the number of epochs to train each model.
Iterations: Specifies how many models you want to train to find the optimal hyperparameters and augmentations.
Optimizer: Specifies the optimizer, optimizer=”AdamW” is a recommended starting point.

Remember to include your regular training parameters, such as imgsz when performing hyperparameter tuning.

Keep in mind that the process can be very time-consuming. Assuming you want to train 100 models that take 4 hours each, the total time would be approximately 400 hours. So, when using the tuner, attempt to reduce the number of epochs as much as possible.

Hyperparameters

There are lots of hyperparameters to consider when training a neural network. The default parameters work well for most tasks, but if you want to optimize the model, make sure to edit the following:

optimizer="auto": Affects the convergence speed and performance.
cos_lr=False: It helps manage the learning rate for better convergence, making the convergence smoother by following a cosine curve.
lr0=0.01: Initial learning rate. This specifies how fast the model weights are updated. Increasing the value can cause higher oscillations while decreasing it can cause slower convergence.
lrf=0.01: Final learning rate, calculated as (lr0 * lrf).
momentum=0.937: Momentum helps the model get unstuck from local minima. Tweaking the value can make convergence better.
weight_decay=0.0005: Weight decay is a form of regularization used to combat overfitting.
dropout=0.0: Dropout is yet another regularization technique. However, it doesn’t play nice with weight decay, so choose your regularization technique carefully.
warmup_epochs=3.0: Before beginning the training phase, a number of warmup epochs will be performed. During the warmup phase, the initial learning rate is updated.
warmup_momentum=0.8: Initial momentum for the warmup phase. The momentum is slowly adjusted to converge to the set value during the warmup.
warmup_bias_lr=0.1: Initial learning rate for the bias parameter in the warmup phase.
label_smoothing=0.0: Creates a uniform distribution over labels, which can improve generalization.

Data Augmentation

When training a YOLO model, Ultralytics automatically augments the images to improve the training process.

The following data augmentation techniques are available [3]:

hsv_h=0.015: The HSV settings help the model generalize during different conditions, such as lighting and environment. The H stands for hue.
hsv_s=0.7: Adjusts the saturation of the image.
hsv_v=0.4: Adjusts the brightness of the image.
degrees=0.0: Rotates the image randomly within a specified range.
translate=0.1: Moves the image horizontally or vertically, helps detect partially visible objects.
scale=0.5: Scales the image, which is good for detecting objects at different distances.
shear=0.0: Shears the image, shifting the parallel lines in a rectangular image. Improves the model for objects viewed at different angles.
perspective=0.0: Applies a random perspective to the image.
flipud=0.0: Flips the image upside down.
fliplr=0.0 : Flip the image from left to right.
bgr=0.0: Guards against incorrect channel ordering. This shouldn't be used if the dataset or real-world data is assumed to be correct.
mosaic=1.0: Combines four images into a 2x2 image. Great for improving scene understanding and performance.
mixup=0.0: Creates a composite image by combining two images and their labels.
copy_paste=0.0: Copy and paste images into different scenes.
erasing=0.4: Erasing a percentage of the scene. Making the model better at predicting hidden objects or partially visible objects.
crop_fraction=1.0: Crops a fraction of the image.

If the default value is set to 0, then the augmentation technique is not used. You can enable it by increasing the value. The value is often related to the image’s probability of being adjusted. For example, fliplr=0.5, it specifies that the image will be flipped left to right 50% of the time. Another example is classification; here, mosaic augmentation might not be as beneficial as for object detection.

Adjust the data augmentation techniques depending on the use case. For example, if you’re training on grayscale images, you can omit hsv_h, hsv_s, hsv_v, and BGR.

For easy experimentation, use the Ultralytics hyperparameter tuner.

Loss Settings

If you’re working with object detection, the following losses are applicable:

box=7.5: Box loss in the loss function. Specifies the importance of the bounding box coordinates in the predicted boxes.
cls=0.5: Classification loss in the total loss function. Specifies the importance of predicting the correct class relative to the other losses.
dfl=1.5: Weight of distribution focal loss (dfl).

For pose estimation: pose=12.0 and kobj=2.0.

Miscellaneous Settings

There are a few more settings that could be beneficial depending on the use case:

pretrained=True: It’s recommended that training is started from a pretrained model. If pretrained=Falsethe weights are initialized randomly.
single_cls=False: If you’re working with a multi-class dataset but want to treat all classes equally, use this parameter. The model will treat all classes as the same class.
rect=False: Used for training on rectangular images.
close_mosaic=10: Disables mosaic augmentation for the last N epochs. The parameter can improve model accuracy towards the end of training.
fraction=1.0: Specifies how much of the dataset should be used for training. Reducing the value will speed up training, usually at the cost of accuracy. It could be used when tuning hyperparameters.
freeze=None: Freezes the N first layers of the model. Reducing the number of trainable parameters. Great for fine-tuning a model. It also reduces the training time.
Testing

Including a separate testing set is good practice to ensure the model doesn’t overfit the validation data. Ultralytics YOLO doesn’t provide a separate mode for evaluating test data. However, there’s an easy workaround: