3D object detection

Elven Kim
3 min readAug 21, 2023

--

3D object detection is a computer vision task that involves identifying and localizing objects in a 3D space from a given input such as images, LiDAR data, or a combination of both.

Problem Statement:

Given an image, estimate 3D shape and 3D pose of all object instances

And if pose-angle be allocentric or egocentric?

  • Allocentric — camera wrt to object (choose this)
  • Egocentric — pose of object wrt camera

Breakdown of tasks

Data Collection: Gather a dataset that includes both images and corresponding 3D data, such as point clouds from LiDAR sensors, depth maps, or stereo images. The dataset should be annotated with the 3D bounding box information of the objects we want to detect.

Decide on the sensors wewill use to capture the 3D data. Common options include LiDAR sensors, stereo cameras, RGB-D cameras (like Microsoft Kinect), and depth sensors. Each sensor type has its advantages and limitations, so choose the one that suits your application.

Annotate our data by labeling the objects of interest with their corresponding 3D bounding boxes. This involves specifying the object’s dimensions (length, width, height) and its position in 3D space (usually represented by the object’s centroid or a corner of the bounding box). Tools like Labelbox, VGG Image Annotator (VIA), or custom scripts can assist in this process.

Preprocessing: Prepare our data for training. This might involve normalizing the images and data, aligning the images with the corresponding 3D data, and augmenting the data to increase the diversity of your training set.

Architecture Selection: Choose an architecture suitable for 3D object detection. There are various neural network architectures designed for this purpose, such as PointNet, Frustum PointNets, PIXOR, VoxelNet, and SECOND. These architectures can handle different types of input data like point clouds or voxel grids.

Example below is the architecture of VoxelNet

Data augmentation for VoxelNet includes rotation and shifting of bounding boxes, global scaling and rotation.

Network Training: Train the selected architecture on our dataset. This typically involves optimizing for both the classification of objects and the regression of their 3D bounding box parameters. We need to define appropriate loss functions that take into account both the object presence and the accuracy of the predicted bounding boxes.

Post-processing: After inference, the network will provide us with predictions. Apply a post-processing step to filter out false positives, refine the bounding boxes, and associate objects across frames if our application requires tracking.

Evaluation: Measure the performance of ourmodel using appropriate metrics such as Intersection over Union (IoU), Average Precision (AP), and Mean Average Precision (mAP) to understand how well it’s performing.

Fine-tuning and Optimization: Iteratively fine-tune our model and hyperparameters to improve its performance. We might need to adjust learning rates, augmentation strategies, and other training parameters.

Deployment: Once our model meets the desired performance levels, deploy it to our application or platform. Ensure it can handle real-time or near-real-time processing if necessary.

Remember that 3D object detection can be challenging due to the complexity of the task and the variety of data sources involved. It’s important to thoroughly understand the algorithms and techniques we’re using and tailor them to our specific application and dataset.

References

(1) Videoguide :VoxelNet End to End Learning for Point Cloud Based 3D Object Detection https://youtu.be/LyOC2TkYnS8

(2) Videoguide:CVPR18: Session 2–1A: Object Recognition & Scene Understanding II https://youtu.be/Jl1NeziAHFY

--

--

Elven Kim

I am a researcher in the field of Robotics, Computer Vision and Artificial Intelligence.