Fast Annotation Net: A framework for active learning in 2018
Motivation
Machine learning models, specifically deep learning based approaches for computer vision, require training data. Here are two examples:
Training data, or annotations, is what the deep learning model uses as ground truth in order to learn.
On the left we have a bounding box, a common type of annotation. On the right we have two polygons. Polygons take longer to draw as they represent the exact shape of the object.
That’s not too hard right? Well what if you want 10,000 images, or 14.2 million?
There are services that offer to do it for you, such as Scale API. Their posted rate for a semantic segmentation image, is $6.40 USD per image. So that 10k image dataset will cost you $64,000.
Scale API represents the market rate to deliver a quality annotation. The cost will go up over time as the level of annotations required gets more complex and the level of knowledge and experience required for the human annotation increases.
Data is the #1 roadblock to building machine learning models
As machine learning models become easier to train and computation power improve, training data becomes the #1 roadblock to building applied machine learning models.
What if there was a way we could use the a machine learning model to help reduce that cost? Here’s an example:
On the left, an image that was annotated by a human.
In the center is an image predicted by FAN that may be marked as “correct”, no further annotation needed. We just went from having to draw 4 boxes, to simply reviewing the image and marking it complete.
On the right, an example where the green marker was missed, and the human adds the green marker label in. Here we had to only add 1 annotation. Or 1/4 the work.
We see here is the heart of it — FAN is learning alongside you. As you annotate, you train FAN, and FAN gets better over time, helping you annotate.
As you annotate, you train FAN, and FAN gets better over time, helping you annotate.
Before we take a deeper dive into how this works, let’s look at some of the prior art.
A brief history
Active learning, the concept of the human in the loop, has been around for a while. Even as specific as in the context of machine learning, published in 1996 Cohn et. al. Active Learning with Statistical Models According to them:
“ The goal of machine learning is to create systems that can improve their performance at some task as they acquire experience or data. … This passive” approach ignores the fact that, in many situations, the learner’s most powerful tool is its ability to act, to gather data, and to in influence the world it is trying to understand. Active learning is the study of how to use this ability effectively.” (emphasis added)
Other approaches
More recently the approaches generally fall into two buckets: Specific and general.
Coarse-to-Fine Annotation Enrichment for Semantic Segmentation Learning by Luo et al. And Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ by Acuna, Ling, Kar, et. el. are two examples.
Let’s take a closer look at Polygon-RNN++. Their approach is interesting in that they appear to predict the polygon points directly instead of predicting per pixel.
This reminds me of the way an objector detector predicts a value for min and maximum points to form a box. It can be “off” by a significant number of pixels and still get a good result. Where as if you had to have every pixel correct, it would be a lot more difficult.
And there have been general approaches. For example UC Berkeley: BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling.
60% reduction in annotation time
Here’s an example of results from UC Berkeley’s active learning approach:
“As shown in Fig. 3(a), the object detector is able to label 40% of bounding boxes at a minimal cost. On average, the time of drawing and adjusting each bounding box is reduced by 60%.” by Yu et. el.
On average, the time of drawing and adjusting each bounding box is reduced by 60%.” — Yu et. el.
To put the statistic into context here’s an example of a FAN network trained on a small portion of the Cityscapes data.
The image on the left has no pre labels, you have to annotate every box yourself. The image on the right was pre-labeled with FAN (shown in dashed lines).
How it works
- A user annotates images or video
- A network is trained. Inference is performed and results are fed back into annotation system
- The user reviews, corrects, and or adds new content
Example use cases
For annotations:
- Continually refine an existing large scale model at a significant reduction in annotations needed.
- Use an already high performing model to handle the majority of existing classes while annotating only new classes.
- Use the network built through FAN for your own processes.
Fast annotation net is an important piece of the puzzle for reducing the cost of annotations.
Limitations and failure cases
- This is still a very new concept and requires a certain general level of machine learning knowledge to get good results
- There has been less research on the true effectiveness of human + computer ground truth data. It’s a very ill posed questioned so we may never have a definitive answer. (A comparison would be, does a software developer write better quality code with tab based auto complete features? Well since software quality is an ambiguous concept this is hard to define beyond a general answer like “probably”.)
- The time to correct an annotation can sometimes be just as long to do it in the first place — so in the worst case it’s about the same time.
- It takes some time and compute resources to train a FAN network and run inference.
I demonstrated some of my work on May 24, 2018 at LDV vision summit. Here’s the video:
I’m working on making FAN available to everyone through Diffgram.
If you are interested in participating in the beta signup here.
And if you are interested in working with me on this please reach out to me on LinkedIn.
Thanks for reading!