What is YOLO? A step-by-step walkthrough.

9 min readMar 25, 2023

I am currently working on a project to build an AI that will take aerial footage of trees and count them, the goal is to eventually train this AI software to identify different species’ health and age.

This article aims to explain in normal human terms what a Convolutional Neural Network (CNN) is and how it works, more specifically about YOLO — you only live once ‘no’ You Only Look Once which is the specific CNN architecture I will be using for my project.

I am going to keep this fairly short but hopefully very simple. This is the first coding project I have ever done so this article will be the theory behind it. A following article will be coming soon which will have the code broken down into what it is doing so you can understand how to code a problem like this.

What is a CNN?

It can detect and classify objects in images, a basic CNN model will be able to distinguish objects such as animals from an image, if you show it a photo of a dog it will output the same image with the word ‘dog’ written underneath and a confidence score between 0–1.

This is cool but won’t work for my case if I used the same code and input aerial photos of trees it would just output ‘trees’, 0.95. So I need something that can detect individual trees and count them.

Luckily there is a way to do this! The key difference is BOUNDING BOXES, this is where the CNN will look at an image and detect where in the frame of the image the object is lying. Input a picture of a dog in a field it will output the same photo but with a box around the dog and the classification ‘dog’, 0.97.

Now for the more technical bit — How does it work?

I am going to walk through how a CNN more specifically a YOLO algorithm will work in the same order that it processes information.

1. Input layer

Pretty simple, just the image or video input.

2. Convolution layers

What does it do? Scans an image and applying mathematical operations will produce a new matrix (feature map) which will have enhanced features to help with object detection.

How does the convolution layer do this? By sliding filters over the image and computing convolution (matrix multiplication and summation) to output a feature map. The CNN will learn a set of feature maps so it will know what patterns to look for in future evolutions of the program. The feature map will have detected edges, corners and textures within the image this enhancement will help the next stage.

Imagine you could only see through a 3x3 square but you wanted to see a 64x64 image by moving the square along the picture you would get a sense of what the picture is, the computer is doing the same thing just assigning numerical values to the output through convolution (matrix multiplication) of the filter and the image.

Why does it need to do this? The filter/ small square has values in order so by passing it over the top and performing the mathematical equations it is detecting where edges or corners of the object are within the image, this helps to detect what it is and where it is!

Backpropagation → This is what makes an AI ‘learn’
It works by computing the gradients of the loss function between its predictions and the ground truth label (the pre-labelled input images). Specifically, by adjusting the weights, therefore, the next time, the more the program is trained the more accurate the weights will be thus the better the prediction.

3. Batch normalisation

The purpose of this is to help the CNN learn quicker without full reliance on backpropagation changing the weight, it does this by standardising the mean and standard deviation across the batch of images. Therefore the network can learn quickly even if different sizes or quality images are used.

How it does this? By subtracting the batch mean of the previous layer and dividing by the standard deviation. With this outcome, it will scale and shift the results helping the batch of images become balanced/ regularized.

If you imagine you own a bakery and you want to make a cake, you will weigh out all your ingredients, sugar, flour, eggs and milk. Then start making the cake, however, you may run into an issue the next time you want to make the same cake as you have only weighed the ingredients, therefore, if one ingredient is heavier or lighter than the original you will have a problem e.g. the egg. This wouldn’t happen with batch normalisation as each ingredient would be scaled and shifted to have a consistent mean and standard deviation. Obviously, this would be impossible to do when actually baking but the point is by standardising with both the mean and standard deviation the algorithm is not relying on the weights so much.

4. ReLU activation layer

ReLU ( Rectified Linear Unit) activation function and its importance in neural networks. The ReLU layer is a mathematical function that introduces non-linearity into the neural network, allowing it to learn more complex patterns and relationships in the data. By discarding negative inputs and keeping positive inputs unchanged, ReLU helps to reduce noise and focus the network’s attention on the most important features in the image, leading to faster and more accurate learning. This can improve the performance of the network on various tasks, including object detection in the YOLO architecture.

5. Downsampling and max-pooling layer

Max pooling is a technique used in Convolutional Neural Networks (CNNs) to downsample the feature maps produced by the previous convolutional layers. By reducing the spatial dimensions of the feature maps, max pooling reduces the computation required in the later stages of the CNN.

Additionally, max pooling is important because it learns spatially invariant features, meaning that the CNN can recognize an object regardless of its position in the input image. YOLO uses downsampling to create a feature pyramid that enables object detection at different scales.

When feature maps are downscaled through max pooling, the algorithm retains only the important information and discards the redundant information. The process works by finding the highest number in each filter window and outputting it as the corresponding feature map value.

6. Full connected layer + Softmax layer

This layer consists of the weights, biases and neurons that have been processed in the previous convolutional layers. The layer’s main job is to flatten the matrices of information into an array so that the output can be determined.

Fully connected layers in YOLO are one of the last layers in the neural network that take the high-level features extracted from the image and transform them into predictions for object classes and bounding boxes. During training, the weights and biases of these layers are adjusted to minimize the difference between the predicted and actual labels of the training data. Making the algorithm more accurate which is why the more training data you have the better your final AI algorithm will work.

Softmax layer: This is similar to the ReLU layer earlier in the way it is a mathematical function however it works at the end before the output as it will provide you with the probability of the answer. With a CNN the program will never be 100%, sure it has correctly identified the object in the image therefore the softmax function will produce a list of probabilities from each of the outputs (nodes) the highest probability will be the final output. For example, the output of code that tells the difference between a cat and a dog may output — Dog-0.95 meaning it is 95% sure the image input is a dog.

7. Three in One — Downsampling, Anchoring boxes and IoU

Detection Layer: The detection layer in YOLO is responsible for predicting object bounding boxes and class probabilities. This layer takes the feature maps from the last convolutional layer and generates a fixed number of bounding boxes for each grid cell. Each bounding box is represented by its centre coordinates, width, height, and confidence score. The confidence score represents how likely it is that the bounding box contains an object, and the class probabilities represent the probability that the object belongs to each class.

Anchoring Boxes: Anchoring boxes, are pre-defined boxes of different aspect ratios and sizes that are used to better capture objects of different shapes and sizes. These boxes are used in YOLO to generate the initial set of bounding boxes for each grid cell in the detection layer. The network then adjusts these boxes based on the features in the input image to better fit the objects in the scene.

IoU Layer: The IoU (Intersection over Union) layer is used to calculate the overlap between the predicted bounding boxes and the ground truth bounding boxes for each object in the training data. This layer computes the IoU score for each predicted box and the ground truth box and then assigns the predicted box to the object with the highest IoU score. This process helps the network learn to better predict accurate bounding boxes and improve its object detection performance.

A good analogy for this is if you are a detective the first detection stage is where you are trying to find the object in the image, you start by guessing where in the image the object may be. Anchor boxes are like using a magnifier glass zooming in on different parts of the image to help identify exactly where the image is. Finally, IoU is like a second person checking the detectives’ work and comparing the answers to the actual location of the object.

8. Output

The final output will combine the bounding box and all the convolutional layers to output the image with the object with a box around it, the name of the object and the confidence score.

Using this output you can then apply it to a real-world application e.g. counting trees, object detection on driverless cars and even medical imaging to detect cancerous cells.

YOLO was created by Joseph Redmon at the University of Washington. There are now multiple versions available from YOLOv1 to YOLOv8. If you want to see how it works in real-time you can watch this ted talk:

Other Object detection models are used for slightly different things:

Faster region-based Convolutional Neural networks (Faster R-CNNs) → Although Faster R-CNNs has the word ‘Faster’ it’s actually one of the slowest! Working at 7 FPS(frames per second), however, it has very accurate with tight bounding boxes.
You Only Look Once (YOLO) → YOLO object detection is the quickest and the only detection method which works in real-time, therefore is most likely to be used in autonomous vehicles, however, the accuracy is lost with slightly sloppy bounding boxes
Single Shot Detectors (SSDs) → Single shot detectors are slightly slower than a YOLO architecture however it is better at dealing with grainy images and when the object is much smaller in the image.

They are all similar as they all achieve the same goal of object detection, however, they work with different speeds and accuracies.

Thank you for reading to this point, I hope this article is clear and you have learnt something about the world of AI and how it works, and if not I hope the dog pictures made up for it!
I will be back soon with an article talking through the code.