YOLO v1 : Part 1

Divakar Kapil
Escapades in Machine Learning
3 min readMay 4, 2018


YOLO, short for You Only Look Once is a convolutional neural network architecture designed for the purpose of object detection. There are 3 versions of YOLO namely version 1, version 2 and version 3. The latter two versions are improvements of the first one. In this multi-part series I will cover the YOLO v1 paper. Please note that this series serves to highlight the main points of the paper with simplified explanations.


Object detection is the problem of localization and classifying a specific object in an image which consists of multiple objects. Prior to YOLO, image classifiers were used to carry out the task of detecting an object by scanning the entire image to locate the object. The process of scanning the entire image begins with a pre-defined window which produces a boolean result that is true if the specified object is present in the scanned section of the image and false if it is not. After scanning the entire image with the window , the size of the window is increased which is used for scanning the image again. Systems like deformable parts model (DPM) uses this technique which is called Sliding Window.

Other detection methods like R-CNN and Fast R-CNN are primarily image classifier networks which are used for object detection with the following steps.

  1. Use Region Proposal method to generate potential bounding boxes in an image
  2. Run the classifier on these boxes
  3. After classification, perform post processing to tighten the boundaries of the bounding boxes, remove duplicates

These pipelines prove to be complex and bulky and hard to optimize as each component needs to be trained separately. Also such a pipeline is often very slow during inference.

How is YOLO different?

YOLO is different from all these methods as it treats the problem of image detection as a regression problem rather than a classification problem and supports a single convolutional neural network to perform all the above mentioned tasks. The unification of all the independent tasks into one network has the following benefits:

  1. SPEED: YOLO is extremely fast comapared to its predecessors as it uses a single convolution network to detect objects. The convolution is performed on the entire input image only once to yield the predictions.
  2. LESS BACKGROUND MISTAKES: YOLO performs the convolution on the whole image rather than sections of it due to which it encodes contextual information about the classes and their appearances. It makes less mistakes in predicting background patches as objects as it views the entire image and reasons globally rather than locally.
  3. HIGHLY GENERALIZABLE: YOLO learns generalizable representations of objects due to which it can be applied to new domains and unexpected inputs without breaking.

Yolo however lags behind state of the art object detection systems like Faster R-CNN in accuracy. The speed of inference comes at the cost of precise localization of objects especially small ones or a group of small objects.

Network Design

YOLO is implemented as a convolution neural network and has been evaluated on the PASCAL VOC detection dataset. It consists of a total of 24 convolutional layers followed by 2 fully connected layers. The layers are separated by their functionality in the following manner:

  1. First 20 convolutional layers followed by an average pooling layer and a fully connected layer is pre-trained on the ImageNet 1000-class classification dataset
  2. The pretraining for classification is performed on dataset with resolution 224 x 224
  3. The layers comprise of 1x1 reduction layers and 3x3 convolutional layers
  4. Last 4 convolutional layers followed by 2 fully connected layers are added to train the network for object detection
  5. Object detection requires more granular detail hence the resolution of the dataset is bumped to 448 x 448
  6. The final layer predicts the class probabilities and bounding boxes.
Fig 1. YOLO v1 architecture. Image source https://arxiv.org/pdf/1506.02640.pdf

The final layer uses a linear activation whereas the other convolutional layers use leaky ReLU activation.

The input is 448 x 448 image and the ouput is the class prediction of the object enclosed in the bounding box.

The next part of the series YOLOv1 Part2 and Part3 will cover the working and limitations of the network respectively, so stay tuned :)

If you like the post or found it helpful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.





Divakar Kapil
Escapades in Machine Learning

4th year CE undergrad at University of Waterloo | Machine Learning enthusiast :)