Detailed Explanation of YOLOv8 Architecture — Part 1

6 min readDec 3, 2023

YOLO (You Only Look Once) is one of the most popular modules for real-time object detection and image segmentation, currently (end of 2023) considered as SOTA State-of-The-Art. YOLO is a convulsional neural network that predicts bounding boxes and class probabilities of an image in a single evaluation.

Despite the undeniable efficiency of this tool, it is important to bear in mind that it was developed for a generalist context, aiming to serve the largest possible number of applications. However, for more specific cases requiring higher quality, speed, handling of non-standard images, among other scenarios, it is advisable to comprehend the architecture and, when possible, customize it to suit the task’s needs. To assist computer vision developers in exploring this further, this article is part 1 of a series that will delve into the architecture of the YOLOv8 algorithm.

Main Blocks

The first step to understanding the YOLO architecture is to understand that there are 3 essential blocks in the algorithm and everything will occur in these blocks, which are: Backbone, Neck and Head. The function of each block is described below.

Backbone:

Function: The backbone, also known as the feature extractor, is responsible for extracting meaningful features from the input.
Activities:
- Captures simple patterns in the initial layers, such as edges and textures.
- Can have multiple scales of representation as you go, capturing features from different levels of abstraction.
- Will provide a rich, hierarchical representation of the input.

Neck:

Function: The neck acts as a bridge between the backbone and the head, performing feature fusion operations and integrating contextual information. Basically the Neck assembles feature pyramids by aggregating feature maps obtained by the Backbone, in other words, the neck collects feature maps from different stages of the backbone.
Activities:
- Perform concatenation or fusion of features of different scales to ensure that the network can detect objects of different sizes.
- Integrates contextual information to improve detection accuracy by considering the broader context of the scene.
- Reduces the spatial resolution and dimensionality of resources to facilitate computation, a fact that increases speed but can also reduce the quality of the model.

Head:

Function: The head is the final part of the network and is responsible for generating the network’s outputs, such as bounding boxes and confidence scores for object detection.
Activities:
- Generates bounding boxes associated with possible objects in the image.
- Assigns confidence scores to each bounding box to indicate how likely an object is present.
- Sorts the objects in the bounding boxes according to their categories.

Main Structures of The Main Blocks

Figure 1 — image adapted from: https://blog.roboflow.com/whats-new-in-yolov8/ accessed on 12/03/2023.

Conv:

The YOLO architecture adopts the local feature analysis approach instead of examining the image as a whole, the objective of this strategy is mainly to reduce computational effort and enable real-time detection. To extract feature maps, convolutions are used several times in the algorithm.

Convolution is a mathematical operation that combines two functions to create a third. In computer vision and signal processing, convolution is often used to apply filters to images or signals, highlighting specific patterns. In convolutional neural networks (CNNs), convolution is used to extract features from inputs such as images. Convolutions are structured by Kernels (K), Strides (s) and paddings (p).

For a visual explanation of convolution, watch the video below.

Figure 1, originally from the Nvidia developer website, presents a real case of applying convolution to extract a feature.

Figure 1 - From: https://developer.nvidia.com/discover/convolution accessed on 12/03/2023.

Kernel:

The kernel, also known as the filter, is a small array of numbers that is slid across the input (image or signal) during the convolution operation. The goal is to apply local operations to the input to detect specific characteristics. Each element in the kernel represents a weight that is multiplied by the corresponding value in the input during convolution. Below in Figure 2, are some examples.

Figure 2- From: https://intellipaat.com/community/11105/why-are-inputs-for-convolutional-neural-networks-always-squared-images accessed on 12/03/2023.

Stride:

Stride is the amount of displacement the kernel undergoes as it moves across the input during convolution. A stride of 1 means the kernel moves one position at a time, while a stride of 2 means the kernel skips two positions with each move. Stride directly influences the spatial dimensions of the convolution output. Larger strides can decrease the dimensionality of the output, while smaller strides retain more spatial information. Larger strides reduce computational effort, thereby increasing the speed of the operation, which can directly impact quality.. In Figure 3 below, a kernel in red traverses the image’s pixel map at a stride of 1.

Padding:

“padding” refers to adding extra pixels around the edges of the input image (typically zeros) before applying convolution operations. This is done to ensure that information at the edges of the image is treated in the same way as information in the center during convolution operations.

When a filter (kernel) is applied to an image, it typically goes through the image pixel by pixel. If no padding is applied, pixels at the edges of the image have fewer neighbors than pixels in the center, which can lead to a loss of information in these regions. In figure 4 below, there is an example of padding.

In terms of probability, the convolution operation can be understood as a weighted average or a weighted sum of random events, which can be interpreted as a probability distribution with two random variables X and Y with probability distributions (p) X(x) and (p) Y(y). The convolution of X and Y is given by de equation 1:

The convolution of X e Y equation.

Specifically in the Yolov8 conv block:

Convolução 2D:

During the 2D convolution operation, a filter is applied to the input to extract local features. Each position in the resulting feature map is a weighted linear combination of the values in the input’s local region.

BatchNorm 2D:

Normalization of activations resulting from convolution. This involves calculating averages and standard deviations across the batch to stabilize the distribution of activations.

Application of the SiLU Function:

After convolution and optionally Batch Normalization, the SiLU activation function is applied to the output. SiLU is defined as SiLU(x)=x⋅σ(x), where σ is the logistic function (sigmoid).

Propagation to Subsequent Layers:
The output of the SiLU function (or, alternatively, Batch Normalization if applied afterwards) is then propagated to subsequent layers of the neural network. The nonlinearity introduced by SiLU is crucial for learning nonlinear representations of data.

In the next articles, the remaining blocks will be explained…