A Quick Reference for Bounding Boxes in Object Detection

7 min readJan 17, 2024

Object detection stands at the core of many computer vision-related tasks, and the simple yet powerful concept of bounding boxes occupies an important role within it.

This article is supposed to be a no BS straightforward reference on a few topics that I frequently turn to the internet for. The information presented here is not groundbreaking and can be found scattered online. My intention is to provide it in a curated and practical manner. You will find this article beneficial if:

You’re new to computer vision and looking for a brief guide on bounding boxes, their usage, and various ways to represent them
You use keras or albumentations and are seeking a refresher on the bounding boxes related options available within them, along with an understanding of the reasons behind the naming conventions used.
If you plan to create and label your own object detection dataset, it’s beneficial to be familiar with the conventions already established by the community.

With the context established, let’s delve into the substantial content.

What is a Bounding Box?

A bounding box, or bbox, is simply a rectangle drawn on an image to highlight the presence of an object of interest at that spatial location. That’s it. There’s nothing more to it than that.

A bounding box can be represented in multiple ways:

Two pairs of (x, y) coordinates representing the top-left and bottom-right corners or any other two alternate sides of the rectangle.
A set of (x, y) coordinates, along with the width and height, defining the rectangle.
Some conventions also express the previously mentioned coordinates in a normalized form, based on the image’s width and height.

Note: x, y, width and height mentioned above are in terms of the pixels of the image.

Bounding boxes serve the primary purpose of labeling images in the training datasets, initially by human annotators and subsequently by neural networks or other machine learning algorithms during training. Bboxes may additionally include information about the object class and confidence probability associated with that class, along with the corresponding coordinates.

Now that we’ve completed the basic introduction, let’s proceed to explore their application in real-world datasets and libraries.

Dataset based formats

Thanks to the incredible efforts of the research community, several datasets are available on the internet, each containing a vast number of labeled images for detection and segmentation tasks. As you might have already anticipated, these datasets use bounding boxes to indicate the objects of interest along with their ground truth classes.

(P.S. Many of these dataset communities host various challenges annually, serving as significant sources of innovation in the field of computer vision. If you haven’t already, be sure to explore them.)

Several datasets within the community have grown immensely popular, and the formats they’ve chosen for denoting bounding boxes have become de facto standards, often named after the respective dataset itself. Let’s delve into a few examples.

COCO

COCO stands for the Common Objects in Context dataset, which includes around 330K images with detection and segmentation related labeled data. For more detailed information about this dataset, you can refer to it here.

The format used by COCO dataset is [x, y, width, height] for each annotation where:

x and y are measured from the top left image corner and are 0-indexed.
width and height are the number of pixels from x and y.

PASCAL VOC

PASCAL stands for Pattern Analysis, Statistical Modelling, and Computational Learning, while VOC stands for their Visual Object Classes dataset. They conducted a series of challenges from 2005 to 2012, and many industry-defining object detection architectures originated from these competitions. SSD, RCNN and its variants are among the notable architectures that emerged. For more information about this dataset, you can visit here.

The format used by PASCAL VOC is [x_min, y_min, x_max, y_max] for each annotation box where:

x_min and y_min are the coordinates for top-left corner of the bounding box.
x_max and y_max are the coordinates for bottom-right corner of the bounding box.

YOLO

YOLO stands for You Only Look Once. It isn’t a dataset but rather a family of neural network-based architectures designed for single-pass real-time object detection tasks. It has held the position as the state-of-the-art in object detection for quite some time now.

The bounding box format chosen by YOLO diverges slightly from the relatively simple format used by COCO or PASCAL VOC and employs normalized values for all the coordinates. An annotation is represented as [x_center, y_center, width, height], where:

x_center and y_center are the normalized coordinates of the center of the bounding box.
width and height values are also normalized by dividing with the image’s width and height.

This one requires a little more explanation than the previous ones so I’ll provide a concrete example. Let’s assume the following:

height of the image = 1800 | width of the image = 2400

height of the bounding box = 1500 | width of the bounding box = 750

x coordinate of the center of bounding box = 550

y coordinate of the center of bounding box = 950

Then representation of the YOLO coordinates looks like this:

x_center = 550/2400 = 0.229166

width_bbox = 750/2400 = 0.3125

y_center = 950/1800 = 0.5277

height_bbox = 1500/1800 = 0.8333

So the final coordinates become: (0.229166, 0.5277, 0.3125, 0.8333)

Library based formats

Albumentations

Albumentations is an excellent image augmentation library written in Python. It offers a comprehensive set of augmentation methods that seamlessly integrate with Keras workflows. If you haven’t explored it yet, I highly recommend checking it out here.

When building augmentation pipelines using this library where bounding boxes are involved, Albumentations provides first-class support for handling bounding boxes along with the image. This is achieved through the bbox_params parameter, which accepts a BboxParams class.

The BboxParams class is aided by the source_format parameter to determine the bounding box structure. You might wonder what values this parameter can take. Upon inspecting the source code, you'll find the following:

Args:
source_format: format of the bounding box. Should be ‘coco’, ‘pascal_voc’, or ‘yolo’.

If you connect the dots backwards, these are the dataset names we reviewed earlier.

Let’s see an example to horizontally flip an image while taking care of all the bounding boxes present in it:

import Albumentations as A

flipping_transform = A.Compose([
    A.HorizontalFlip(p=1.0),
], bbox_params=A.BboxParams(format='coco', min_area=1024))

transformed = flipping_transform(image=sample_image, bboxes=sample_bboxes)

transformed_image = transformed['image']
transformed_bboxes = transformed['bboxes']

As you can see in the above example, the library provides robust support for flipping the coordinates of bounding boxes along with the image. Additionally, it offers an option to verify if the transformed bboxes has a minimum area, and if not, they get removed.

Keras

Keras is a high-level API for designing neural networks that can support multiple backends like TensorFlow, PyTorch, JAX, etc., all while maintaining a consistent and simple API. It also offers a library called KerasCV to streamline the implementation of various computer vision components.

When working with KerasCV to augment or preprocess your images, a significant amount of complexity related to bounding boxes is taken care of for you. This includes handling conversions between different bounding box formats and managing coordinates when the images are resized.

Unlike Albumentations, KerasCV doesn’t rely on dataset names for bounding box formats. Instead, it supports the following few bounding box formats in the provided APIs. You can use one of the class or the string format from the below:

keras_cv.bounding_box.XYXY or “xyxy”
keras_cv.bounding_box.REL_XYXY or “rel_xyxy”
keras_cv.bounding_box.CENTER_XYWH or “center_xywh”
keras_cv.bounding_box.XYWH or “xywh”
keras_cv.bounding_box.REL_XYWH or “rel_xywh”
keras_cv.bounding_box.YXYX or “yxyx”
keras_cv.bounding_box.REL_YXYX or “rel_yxyx”

Let’s take a super simple example where we convert bounding boxes coordinates from PASCAL VOC like format to COCO like format using keras:

import numpy as np
from tensorflow import keras

# convert_format function requires data to be in
# the shape [batch_size, num_boxes, 4]
# so here we have a single pascal_voc like bbox with the following:
# x_min: 200, y_min: 200, x_max: 500, y_max:700
bbox = np.array([[[200, 200, 500, 700]]])

# let's convert to coco like format with x, y, width and height
formatted_box = keras_cv.bounding_box.convert_format(
    boxes=bbox, source="xyxy", target="xywh",
)

print(formatted_box.numpy()) # output: [[[200. 200. 300. 500.]]]

Ending

That’s all, folks! If you use any other library that follows a different convention and would like it added here for a quick reference, please feel free to comment, and I’ll include it here.

References: