⏳Human-in-the-loop for object detection with Supervisely and YOLO v3

Supervise.ly
7 min readJan 7, 2019

--

Manual data annotation is a bottleneck that greatly slows down AI products development. In this post we show how to leverage pre-trained detection model to speed up labeling process.

What we are fighting for is to minimize human labor spent on:

  1. Searching for images where objects of interest are present
  2. Putting a bounding box around each object

To be more concrete, let’s consider a self-driving related task. Suppose we need to label with bounding boxes the following objects:

  • Cars
  • Persons
  • Traffic lights

As an input, we have this video:

As an output, we expect to get a list of frames which:

  • Diverse enough, since a lot of identical images in the training set will not help to achieve a better performance
  • Contain a bounding box around each object of interest

To get the job done, we will walk through the following steps:

  1. Upload a video file
  2. Attach a machine with GPU on board
  3. Apply YOLO v3
  4. Keep classes of interest
  5. Evaluate performance
  6. Correct automatic detections

After steps 1–6 are completed, we describe the ways to extend and scale up our human-in-the-loop pipeline

Let’s start …

1. Upload a video file

First, from import page, just drag & drop a video file to process. At this stage, we should also specify “skip frame” parameter that allows to control diversity of the frames extracted. In our case, we set “skip frame” to 60 which means that every 2 seconds we will extract a frame (under the conditions that a video recorded with 30 fps).

Figure 1. Video importing process.

After video is imported, a new project and dataset are created that contain extracted frames (figure 1).

Note, if you import the video for the first time, you will need to go to Explore page, click Plugins and then click add button for “Supervisely/Videos” plugin.

2. Attach a machine with GPU on board

Then, through Cluster page, we need to attach a GPU machine to Supervisely. Just click Add agent and execute generated command in your linux terminal. After that your GPU machine will be attached to Supervisely and available for computations (figure 2).

Figure 2. Attaching GPU machine to the platform

3. Apply YOLO v3

In order to apply detection model to the frame extracted, we go to Explore page and then to Models. In this post, we use YOLO v3(pretrained on COCO dataset) as detection model, so we need to add it to my models (figure 3).

Figure 3. Adding YOLO v3 to my models

Now we can use the model to process all frames extracted. So we click test button, choose the source and target projects, set default inference configuration and run the task (figure 4).

Figure 4. YOLO v3 is applied to the frames extracted

After the task is completed, the target project contains extracted frames along with automatically detected objects.

4. Keep classes of interest

Since our detection model was trained on COCO dataset, it detects 80 COCO classes. In our case, we are only interested in detecting cars, persons and traffic-lights. We may go to classes page and remove all the classes we are not interested in.

5. Evaluate performance

Let’s look at the statistics first.

Figure 5. Number of images where target classes were detected

From Figure 5 we see that cars appear on 95% of images while persons only on 6.5%. Given that we are in Texas, it seems normal. On the contrary, the fact that traffic lights appear on 12.5% of images looks strange. More than likely, there are a number of false positive detections. The reason is that traffic lights are small objects and detection errors for small objects tend to be higher.

Figure 6. Total number of objects for target classes

From figure 6 we can see the total number of objects detected for each class. The proportions between persons, cars and traffic lights are roughly the same.

Finally, let’s take a look at objects area (figure 7).

Figure 7. Area occupied by each class

We see that the largest area is occupied by cars (9% of total pixels). Area of persons (0.08%) is larger than area of traffic lights (0.03%). These proportions seem reasonable.

Now let’s visually inspect the results. First, we will look at the cars (figure 8).

Figure 8. Visualisation of cars detection

As we can see, most cars are found and there are not too many false positive detections. Though, in some cases, small cars were not spotted.

Next, we will take a look at persons detection quality

Figure 9. Visualisation of persons detection

It looks like that more than 75% (rough estimate) of persons are detected. It’s a good starting point, but manual labeling is still required to get high quality training data.

Let me emphasise very important point here.

It’s not that hard to put a box around an object. What is much harder is to identify images where the boxes should be put. For our video, only 6.5% of extracted frames contain persons and we were able to identify these frames just by applying appropriate filter.

Generalising the previous quote, we can say

In general, the larger the dataset and the rarer the object class, the more challenging it is to get a training set of required size (without suffering from class imbalance problem).

We still have one class left to visualise. Let’s take a look at traffic lights detection. Recall, according to calculated statistics before, we expect to see a lot of false positive examples (figure 10).

Figure 10. Visualisation of traffic lights detection

Statistically, out of 27 images with traffic lights detected, 9 does not actually contain them. For the images that does contain traffic lights more than 70% correctly detected.

The reason we provide these numbers is not to show that the detection model works great (it actually does not and, moreover, does not have to, since it was trained on completely different dataset), but to illustrate the following idea:

Suppose you have terabytes of unlabelled videos, some detection model and a requirement to improve your model. What you can do is to process all the videos with current model and automatically identify frames with the objects of interest. As nice bonus, you may expect that more than half of objects will be automatically detected. So no need to watch the videos and label everything from scratch!

6. Correct automatic detections

The only thing left is to manually correct the detections that were automatically generated. The process is straightforward — we just add missing boxes or move a little bit generated ones. Figure 11 illustrates the procedure.

Figure 11. Correction process of automatically generated detections

Now that we have completed human-in-the-loop pipeline, let’s say a few words about possible extensions.

Possible extensions

There are a number of ways to extend the pipeline described

  • Use more appropriate model. In our opinion, YOLO v3 is the best detection model today. For this tutorial, we used the model that was trained on COCO dataset. In order to achieve much higher accuracy for our video, datasets like Cityscapes or Mapillary should be used instead.
  • Apply sliding window inference mode. If we had applied detection model in a sliding window manner, the detection quality, especially for small objects, would significantly increase. Sliding window mode is available in Supervisely for all detection models
  • Annotate at scale with Labeling Jobs. While the techniques based on human-in-the-loop leverage AI to speed up labeling process, Labeling Jobs make it possible for a lot of workers to be simultaneously involved in annotation of the same set of images.

Conclusion

In this post, we have shown how the process of labeling with bounding boxes can be automated. We used Supervisely platform to go all the way from raw video to a diverse set of labeled images for self-driving related task.

Build your AI product faster. Try Supervisely Community Edition for free or speak with us about an Enterprise solution for your business.

If you found this post interesting, then let’s help others too. More people will see it if you give it some 👏.

--

--