YOLO V1 Architecture

Ankushsharma
3 min readJun 25, 2020

--

Object detection and classification is an important computer vision task for artificial intelligence in real-world applications. You need to detect various objects like cats, cars, or other human beings. It basically defines “what objects are where”.

Around 20 years back, all the object detection models were based on handcrafted features. Viola Jones detector was the first one who did facial recognition and gave extremely good results. It used a sliding window and searched across the image to find a bounding box in which the face is enclosed. But it did not work well when the image was oriented in a different direction. In 2005, a Histogram of Oriented Gradients (HOG) detector came which used gradients and was scale-invariant.

Around 2012 was the time when deep learning had won the Imagenet challenge and the world saw the rebirth of convolutional neural networks. In 2014, R-CNN (Region-based Convolutional Neural Network) became widely popular, R-CNN initially uses selective search to find a manageable number of bounding-box object region candidates(ROI) and then it extracts CNN features from each region for classification. Over a period of time, various other versions also came like Fast R-CNN and Mask R-CNN, but all these are very computation extensive and time-consuming. It is also very difficult to train these models are each component has to be trained separately.

In 2015, Joseph Redmon came up with a new architecture called YOLO(You Only Look Once) in which repurposed the detection problem as not being a classification problem but a regression problem. YOLO uses a single neural network to predict bounding boxes and associate class probabilities.

YOLO V1

YOLOV1 is an extremely fast object detection technique that processes images in real-time at 45 frames per second. YOLO is pretty simple. A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.

Unlike other detection techniques, which uses a sliding window to extract features, YOLO sees the entire image during training and test time as a result of which it can encode contextual information about the classes as well as their appearance. YOLO makes less than half the number of background errors compared to Fast R-CNN.

YOLO learns generalizable representations of objects. When trained on natural images and tested on the artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin.

YOLO V1 Architecture

YOLOV1 uses features from the entire image and predicts bounding boxes simultaneously. The image is divided into S X S grid and each gird produces B bounding boxes and their confidence scores. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts. Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image.

Each image is divided into boxes of equal size and the bounding boxes are drawn with each line width indicating the confidence scores.

YOLO model was evaluated on Pascal VOC detection dataset. The initial convolutional layers of the network extract feature from the image while the fully connected layers predict the output probabilities and coordinates. It has 24 convolutional layers followed by 2 fully connected layers.

Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that the model can predict. The model struggles with small objects that appear in groups, such as flocks of birds.

Resources

The original paper on YOLO.

--

--