Computer Vision: YOLO: Grid Cells and Anchor boxes

Ayush Yajnik
5 min readSep 7, 2023

Welcome back to our odyssey through the realm of YOLO (You Only Look Once), where we’re about to embark on a voyage through the intricate inner workings of object detection. In today’s chapter, we’ll unravel two pivotal aspects of YOLO’s architecture that make it a true game-changer: Grid Cells and Anchors, and Predictions and Confidence Scores. Get ready for an enlightening journey into the heart of real-time object detection! 🌟🔍

The Canvas of Prediction: Grid Cells

Before diving into the specifics of YOLO’s grid cells, let’s envision how an image becomes the canvas for object detection:

In YOLO, the input image is divided into a grid. Each cell in this grid takes on the responsibility of scrutinizing a specific region of the image. This division allows YOLO to consider multiple locations simultaneously, transforming object detection into an efficient, parallel process.

Understanding Grid Cells:

Grid cells serve as the foundation of YOLO’s spatial understanding. Each grid cell acts as an anchor point, and within it, YOLO seeks to identify objects. By systematically assessing each grid cell, YOLO can pinpoint objects’ locations with precision.

The Ingenious Role of Anchor Boxes:

Have you ever tried to study an object detection algorithm? Like going really in-depth about it? If so, you might have stumble across a concept called anchor boxes!

Now, let’s explore the ingenious concept of anchor boxes. Imagine objects of various shapes, sizes, and orientations. Anchor boxes are predefined bounding boxes that serve as reference points for YOLO. They come in different shapes and sizes, strategically chosen to encompass the wide variability of real-world objects.

Intuitively, how would you predict a bounding box? The first, most obvious technique, is the sliding window. You define a window of arbitrary size, and “slide” it through the image. At each step, you classify whether the window contains your object of interest.

This is what you thought about, right? Well, an anchor box will be the “Deep Learning” version of it. It’s faster, and also more precise.

You get the point, do you? Some windows should be incredibly small, and others should be bigger. Similarly, some boxes should be vertical, like those of pedestrians, and others should be horizontal, like buses.

This is where the concept of anchor boxes comes into play, using an anchor box, we’re going to predefine a set of plausible boxes for the classes we want to detect, and then predict them.

Why an anchor box is different than a bounding box

There is a concept we however must understand:

An Anchor Box is not a Bounding Box!

You want to predict a bounding box, and for this, you’ll use an anchor box as a helper. So let me show you how.

First, we will take one anchor box, and put it everywhere on the image, just like for sliding windows:

But notice how none of these anchor boxes can be our final bounding box. And with that, we only have one shape of anchor boxes. As a result, none of these approaches work!

So let’s step back:

When you’re looking at the architecture that allows for box generation, you have something like this:

Which means that:

  1. Anchor Boxes are used on top of feature maps, not on images
  2. Anchor Boxes are used to generate bounding boxes, but they aren’t the bounding boxes[1]

IoU: The Measure of Overlap:

A crucial element in this process is IoU (Intersection over Union). It’s a metric used to measure the overlap between predicted bounding boxes and ground-truth boxes. IoU plays a significant role in determining the accuracy of object detection and in the selection of the most suitable anchor box.

The Fusion of Grid Cells and Anchor Boxes:

Grid cells and anchor boxes are intrinsically linked within YOLO’s architecture. By systematically evaluating each grid cell and selecting the most appropriate anchor boxes, YOLO can detect objects of varying sizes, shapes, and orientations with remarkable accuracy.

Predictions and Confidence Scores:

Within each grid cell of YOLO’s canvas, a symphony of predictions unfolds. These predictions are YOLO’s way of deciphering the visual cues within its scope. Let’s break down the components of these predictions:

  1. Bounding Box Predictions: YOLO aims to locate objects within each grid cell by predicting bounding box coordinates. These coordinates specify the position of the bounding box’s center (x, y) relative to the grid cell, as well as its width (w) and height (h). These predictions are crucial in defining the precise location of an object.
  2. Class Probabilities: YOLO doesn’t stop at just locating objects; it’s equipped to classify them as well. For each bounding box, YOLO predicts a set of class probabilities. These probabilities represent the likelihood of the object within the bounding box belonging to various predefined classes. In essence, YOLO assigns labels to objects.

Confidence Scores: Gauging Objectness:

To further refine its detections, YOLO introduces the concept of confidence scores. These scores represent the network’s degree of confidence that an object indeed exists within a given bounding box. But how are these scores calculated?

Confidence Score Calculation:

The confidence score (often denoted as “conf”) is determined through a combination of factors:

  • Objectness Score: YOLO’s network learns to assign an objectness score for each bounding box. This score signifies the likelihood of a significant object’s presence within the box.
  • Class Confidence: Additionally, the confidence score takes into account the class probabilities assigned to the bounding box. The highest class probability contributes to the overall confidence score.

Conclusion:

As we conclude this chapter, our journey through YOLO’s architecture brings us closer to mastering the art of object detection. We’ve unlocked the secrets of grid cells, anchor boxes, predictions, and the refinement process through non-maximum suppression.

In our next chapter, we’ll explore the real-world implications of YOLO’s predictive prowess, unveiling the beauty of real-time object detection and its transformative potential.

Stay curious, keep exploring, and let’s continue our quest to redefine possibilities in the realm of #ComputerVision! 🚀🔍 #AI #DeepLearning #YOLO #TechSeries

Reference:

[1]: https://www.thinkautonomous.ai/blog/anchor-boxes/

--

--