Should the computer vision industry continue using bounding box annotations?
In this post, I will share some ideas related to image annotation that I accumulated during my PhD research. Specifically, I will discuss the current state-of-the-art annotation methods, their trends, and future directions. Finally, I will briefly talk about the annotation software we are building and give a little preview about our company — SuperAnnotate.
- Introduction to Image Annotation
- Mainstream Annotation Methods: Bounding Box
- Pixel-Precision in Image Annotation
- About SuperAnnotate
1. Introduction to Image Annotation
Image annotation is the process of selecting objects in images and labeling them by their names. This is the backbone of the AI computer vision, where, for example, in order for your self-driving car software to accurately identify any object in the image, say a pedestrian, one needs hundreds of thousands to millions of annotated pedestrians. Other use cases include drone/satellite footage analytics, security and surveillance, medical imaging, e-commerce, online image/video analytics, AR/VR, etc.
The increase in image data and computer vision applications requires a huge amount of training data. Data preparation and engineering tasks represent over 80% of the time consumed in AI and Machine Learning projects. Therefore over the last few years, many data annotation services and tools have been created to cover the needs of this market. As a result, the data labeling became $1.5B market in 2018 and is expected to grow to $5B by 2023.
2. Mainstream Annotation Methods: Bounding Box
The most common annotation technique is the bounding box, which is the process of fitting a tight rectangle around the target object. This is the most used annotation approach since bounding boxes are relatively straight forward and many object detection algorithms were developed with this method in mind (YOLO, Faster R-CNN, etc). Therefore, all annotation companies offer solutions for bounding box annotation (services or software). However, box annotation suffers from major drawbacks:
- One needs a relatively large (usually in the order of 100.000s) number of bounding boxes to reach over 95% detection accuracies. For example, for the autonomous driving industry, one generally gathers millions of bounding boxes of cars, pedestrians, street lights, lanes, cones, etc.
- Bounding box annotation doesn’t usually allow reaching superhuman detection accuracies no matter how much data you use. This is mainly because of the additional noise around the object that is included in the box area.
- The detection becomes extremely complicated for occluded objects. In many cases, the target object covers less than 20% of the bounding box area making the rest as a noise which confuses the detection algorithm to find the right object (see the example in a green box below).
3. Pixel-Precision in Image Annotation
The above issues with bounding boxes can be solved with a pixel accurate annotation. Yet, the most common tools for such annotations heavily rely on slow point-by-point object selection tools, where the annotator has to go through the edges of the objects. This is not only extremely time-consuming and costly but also is very sensitive to human errors. For comparison, such annotation tasks usually cost around 10x more than a bounding box annotation. In addition, it can take 10x more time to annotate the same amount of data pixel accurately. As a result, bounding boxes still remain the most common annotation type for various applications.
However, deep learning algorithms have progressed substantially over the last seven years. While in 2012, the state-of-the-art algorithm (Alexnet) was only able to categorize images, current algorithms can already identify objects accurately in pixel level (see the image below). For such accurate object detection, pixel-perfect annotation is the key.
3.1. AI/segmentation based approaches
There have been approaches that use segmentation based solutions (i.e. SLIC Superpixels, GrabCut based segmentation) for pixelwise annotation. However, these approaches perform segmentation based on the pixel colors and often show poor performance and unsatisfactory results in real-life scenarios such as autonomous driving. Hence, they are not commonly used for such annotation tasks.
Over the last 3 years, NVIDIA has done extensive research with the U of Toronto towards pixel accurate annotation solutions. Their research mainly concentrates on generating pixel accurate polygons from the given bounding box and includes the following papers — Polygon RNN, Polygon RNN++, Curve-GCN — , published at CVPR in 2017, 2018, 2019, respectively. In the best case scenario, generating a polygon with these tools requires at least two precise clicks (i.e. generating a bounding box) and hope that it will capture the target object accurately. However, the proposed polygons are usually inaccurate and it can take much more time than expected (see the example below).
Another problem with such polygon based approaches is the difficulty of selecting ‘Donut’ like objects (topologically speaking), where one needs at least two polygons to describe such objects.
3.2. A novel approach to pixelwise annotation
The easiest and fastest way for pixelwise annotation would be the ability to select objects with just one click. I was specifically working on this problem during my PhD research at KTH Sweden. By the end of my PhD in November 2018, we prototyped a simple tool which allowed selecting objects with just a click. Our initial experiments showed that the pixelwise annotation can be accelerated by 10–20x without compromising the selection quality. Here is an example of how it works on the same image presented above.
We also carefully analyzed the advantages of our solution compared to other AI or segmentation-based approaches:
- The speed of our algorithm allows to segment and annotate up to 10-megapixel images in real time
- Unlike SLIC superpixels, our segmentation solution accurately generates non-homogeneous regions, allowing users to select both large and small objects with just one click
- Our software allows us to change the number of segments instantly that enables selecting even the smallest objects.
- Self-learning feature of our algorithm even further improves the segmentation accuracy. Even with a few hundred annotations, dramatic changes in the segmentation accuracy can be observed. This further accelerates the annotation process.
- Compared to Box-to-Polygon based techniques discussed above, our software allows selecting donut style objects with just a click.
- Most importantly, as the amount of annotated data increases, our software allows automatic pixel-accurate annotation.
Even compared to the basic bounding box annotation, which requires at least 2 precise clicks to annotate one object, we need only 1 approximate click within the segment making it even faster than generating a bounding box.
By this, we drop down the cost of pixelwise annotation to the cost level of the bounding box at the same time allowing to reach superhuman accuracy levels of detection otherwise not reachable with bounding boxes.
Furthermore, since pixel precision doesn’t include noise, one would need at least 10x fewer data to reach a certain level of accuracy compared to bounding box annotations.
As our software hit the mainstream (launching in June 2019), we expect that the demand for bounding boxes will eventually disappear. Pixel accurate annotation will become the new norm.
4. About SuperAnnotate
We are a venture-backed team with investors including Berkeley Skydeck, Plug and Play and SmartGateVC — backed by Tim Draper. Our team consists of PhD researchers from top US, European, and Asian Universities, who came together to provide new approaches in the field of image and video annotation and make the “Human in the loop” tasks up to 100x more efficient in the most accurate level.