High-Precision, Semi-Automatic Image Annotator by FORMCEPT

for Building High-quality Datasets to Train Object Detection Models

Ghanshyam Chodavadiya
FORMCEPT
7 min readNov 13, 2019

--

If you are looking to convert your images/videos/3D-point cloud into pixel-perfect and accurately annotated training data, your search ends here.

Computer Vision is perhaps one of the most important and emerging fields in AI. It cuts across diverse areas of science, engineering, and technology. At the level of engineering, it mimics the human visual system to accomplish wonders. It consists of a wide array of elements such as image recognition, object detection, image segmentation, pose estimation, image generation, image super-resolution and so on.

Among these, Object Detection, in particular, is concerned with detecting and identifying key elements like people, things, etc. in images, stock video footage or live video streams. This achieved by broadly dividing an image or video frame into objects, backgrounds and noise and then diving deeper into the segmentation of objects.

For example, on a road, Object Detection could encompass applications like pedestrian detection, face recognition, vehicle detection, etc.

Image Source: https://github.com/facebookresearch/detectron2

Image Annotation for Object Detection:

When we look at an image, for example, the human eye can:

  • Detect the elements in an image
  • Detect where each element is located in a particular scene (in case it is moving, the location keeps changing)
  • Compare colour, size and shape across elements
  • Recognize the points of similarity between 2 or more images
  • Recognize the points of difference between 2 or more images
  • Understand the continuity in images, if any (for example, two screenshots of the same scene in a video taken at different points in time).

Image Annotation for Object Annotation helps to attain these capabilities by training an algorithm with the image database that is annotated and labelled with high precision. Image annotation is the process of marking images with identifiers/labels. The more complex the task, the more accurately annotated data is required for high-end training. Further, the training needs to accommodate real-time feedback to reduce errors and to augment the capability of the Object Detection algorithm.

How Object Detection Works:

Broadly speaking, the Object Detection algorithm gives two specific outcomes:

  1. Object name
  2. Object position

The position of an object in an image can be shown by drawing a 2D or 3D bounding-box around the object. The various ways in which images need to be annotated are:

  • Labelling objects and backgrounds using bounding boxes, polygons, cuboids, lines and points
  • Labelling the entire image without drawing boxes (e.g. generation of ‘alt-text’ for images by Microsoft Powerpoint)
  • Customizing labels and matching them to object attributes
  • Enabling search function across images (search by labels and attributes)
  • Enabling auto-suggest for quick selection of object names
  • Converting videos to non-redundant image frames

2D object detection is used for use cases like searching entities from multimedia files, vehicle detection, license plate recognition (LPR), pedestrian counting, digitization of hand-written text and many more. A significant use-case of identifying object name and position is the Autonomous Driving System. It is designed with the help of 3D Object Detection to scan and scope the various static and moving objects around the autonomous vehicle so that the vehicle can self-drive on the road safely. It helps to understand the direction and the distance between the self-driving car and other cars, pedestrians, traffic lights, other traffic/non-traffic objects. Other use-cases of 3D Object Detection include Parking Lot Management, Smart City Planning, Traffic Optimization and Vehicle Lifecycle management.

Various Data Formats in Image Annotation:

The common formats for images are .jpeg and .png. Medical images (X-ray scans, CT scans, MRIs) are often stored in more complex formats like DICOM. Common video data formats include .mov, .avi and .mp4.

High-quality machine learning models can only be created from high-quality training data that covers all possible scenarios and outcomes.

This is where we come in.

Use Case: Autonomous Vehicles

Now that the self-driving car is no longer science fiction, innovators across the world are competing to outsmart each other and build cars that make drivers truly redundant. Today, almost all autonomous vehicle companies are targeting level-5 autonomy (i.e. 100% driverless).

Autotech companies use a setup involving LiDARs calibrated together with cameras working in sync to perceive the world around them. As a result, these companies are seeing a huge surge in the demand for data to train the deep learning systems built around these sensors. Over time, these LiDAR sensors are improved to capture the information in real-time.

Calibration of LiDAR data points with the camera image

These LiDAR data are processed by a deep learning system. Deep learning models require a huge amount of annotated data to train the system. However, 3D annotation in the image with the help of LiDAR data is a tedious task. It is manual, time-consuming and expensive.

Image source: https://media.giphy.com/media/xT1XH1NoZlqskNb2fu/giphy.gif

Key Challenges

Currently, in the market, there are many tools for annotation of LiDAR and camera images, but they are inefficient and expensive because of complete dependence on human annotators. On the other hand, while automatic image annotation techniques can produce annotations through the deep learning models, they are often less accurate than the annotations made by human annotators, even though they are relatively faster.

FORMCEPT’s Solution: A Self-Learning Semi-Automatic Annotator That Annotates Your Image and Video Dataset in Real-Time at Scale

As explained above, both human annotators and automatic annotation systems have their pros and cons. By combining the two approaches, we have devised a solution that incorporates the best of both worlds. Semi-automatic image annotation uses a LiDAR-based 3D Object Detection to help human annotators create annotations quicker by providing suggested annotations to annotators. It also lets annotators provide the system with relevant feedback on suggested image annotations. These suggestions, in turn, improve by self-learning from the feedback, i.e. the annotator-generated feedback can then be used to further improve the accuracy of suggestions.

Self-Learning Semi-Automatic Annotation System

What Sets Us Apart: Faster and Continuously Learning Annotations with High Precision

All the human-annotated images are stored in the dataset and our deep learning models learn from those data. We have explored the many LiDAR-based 3D object detection research papers and identified the shortcomings in those methods. We have then come up with the most suitable methods for the semi-automatic annotator. These are discussed below:

  • Complex-Yolo is the only LiDAR-based 3D object detector but it does not give good results in terms of bounding-box precision as the fixed object class is dependent on the height.
  • Frustum-ConvNet is a combination of 2D object detection and pipeline of the parallel frustum of LiDAR points, but, it is dependent on the 2D object detection result so it may have a problem of error propagation.

We now evaluate our annotator on KITTI dataset (used for research purposes only).

Camera Image:

Camera Image (Source: KITTI dataset)

3D Boxes:

3D bounding box on camera image (Source: KITTI dataset)

3D Boxes on LiDAR Points:

3D bounding box on Velodyne lidar points (Source: KITTI dataset)

Annotators have to spend most of the time selecting a bounding-box for the objects, which is a time-consuming task. There are also chances that annotators may not able to see tiny objects on LiDAR data points. With our solution, annotators get suggestions for the objects, so that they can verify those bounding-boxes faster and make changes on-the-go if required. Otherwise, they don’t have to spend time on it. Usually, annotators have to make changes in scenarios where cluttering, occlusion, or sparse LiDAR points of objects are present.

Evaluation:

We have evaluated our annotation system with the publicly available KITTI dataset. Here is the average precision(AP) for the three classes (car, pedestrian, cyclist) on KITTI dataset’s easy, moderate and hard samples.

AP of 2D Bounding Box (on camera image)

AP of Bird’s Eye View Bounding Box (on LiDAR points)

AP of 3D Bounding Box (on camera image)

Time taken for one sample(image + LiDAR):

time: 0.16 second

1 GPU Card: Tesla T4

GPU RAM: 16GB

Concluding Note:

FORMCEPT’s versatile, semi-automatic image annotator can support hundreds of classes and helps train your Computer Vision models accurately by incorporating feedback and continuously improving your model precision at scale. You can also enable recursive feedback from multiple users to analyze and capture the object details in a scene more accurately. Partner with us to shorten your R&D cycle and reduce the go-to-market time of your Computer Vision algorithm. We can custom-build capabilities like semantic segmentation by pixel and super-pixel as well, based on your unique needs.

Sounds interesting? Write to us at contactus@formcept.com and tell us about your requirements to get the ball rolling. Image annotation is just one of the many cutting-edge AI solutions that we offer at FORMCEPT. To know about our flagship product MECBot, please visit: https://www.mecbot.ai/

--

--