When talking about AI (particularly about Computer Vision), I spend half of my time saying how much the field has progressed in the last few years, and the other half debunking and diminishing what’s possible today.
Recently, I came accross an article by Pete Warden showing a plant disease classifier. It seems very accurate at detecting different types of diseases the human eye would have trouble to; however it has spectacular failures when used on random pictures (non plant), that a human would never make.
It seems that capabilities of Computer Vision systems are usually very different from our human intelligence, and this is what I decided to illustrate with a little quiz.
Here are 5 different Computer Vision problems, try and guess which ones are easily solvable. Can you guess what are current AI systems good at today?
Detect Diabetic Retinopathy
Input: Well-constrained picture of the retina
Output: 5 classes (healthy, and different stages / forms of the disease)
Webcam gesture recognition
Input: Short video sequence taken from a Webcam
Output: most probable action among 25 classes
Note: since then TwentyBN has released much richer datasets
Handbag detection on Instagram
Input: Picture from Instagram
Output: Bounding box around the handbag(s)
Input: Fixed Camera taking pictures
Output: Bounding boxes around pedestrians
Robotic Object Grasping
Input: Two images (see above) from fixed camera
Output: Robot Control Policy
Disclaimer: If you disagree with the answer I provide to this question, I’m happy to discuss as there is still plenty to learn in that field and I don’t think I have all the answers as of now!
Diabetic Retinopathy: Classification should be fairly Easy as the problem is well constrained in terms of input and output (Google claim really good performance in their blogpost). Difficulties arise when putting such a system into production, UX & the way you handle the interaction with the doctor is key, as there could be huge imbalance between the different classes.
Webcam Gesture Recognition: The problem is rather well defined, but the variability makes it quite hard: webcam videos have people with varying distance, gesture duration, etc… Furthermore, natural difficulties arise from the analysis of video which bring more engineering problems for training. I’d say this problem is Quite Hard but solvable.
Handbag detection on Instagram: The problem seems easy and solved, but the input domain is open/unconstrained (instagram) and the class definition is wide (handbag could mean practically anything, there are no clear visual patterns associated with handbags). This make this problem unexpectedly Very Hard, see by yourself…
If you try to fit into the model’s shoes, this is absolutely legitimate and expected: Our training data obviously does not include “axe” pictures as negative handbag images; and the axe’s head fits quite well with the representation of a handbag that the model must have learnt. It is brownish, has a coherent handbag shape and size, and worn by the hand.
Are we doomed? No, it’s possible to solve this by active learning, which means relabeling the model’s incorrect predictions and feeding these examples back into training. But it’s a hell of a challenge to make current techniques work perfectly in open domains such as instagram.
To us humans, diabetes sounds really difficult while the axe vs handbag seems obvious. That’s mainly because the axe thing is a common knowledge about the world we all share, and that is beyond the data presented to the system.
Disclaimer: that’s part of what I do at Heuritech!
Pedestrian Detection from Camera: The problem is rather Easy: The input domain is quite constrained (fixed camera), and the class (pedestrian) is quite standard. There will be problems related to occlusions, but globally the problem is easilly solvable (you could even do it without Deep Learning). However, modify slightly the problem scope and it can become much harder: if the Camera is moving (in a robot, in a car…) ; or has several points of view, angles, scale — the problem becomes more and more open and difficult.
Robotic Object Grasping: This problem is Very Hard. It goes beyond a standard classification or regression problem as the output is a robotic policy, usually trained using reinforcement learning, which is much less mature than supervised learning. Moreover, objects vary in size, shape, and the way you have to grasp them may require semantic understanding. So even though this problem is solved by a 2-year old child in a much less restrictive setting (the camera here is fixed and the background is always the same), it’s still a long way before we solve this problem!
Expectations in Computer Vision & AI
The notion of “difficulty” is very different for a Computer Vision system or us humans. This is one of the main points which leads to wrong expectations in the field of AI.
Engineers and researchers have to be realistic and educational about the performance of systems in open domains.
We showed that there are problems in understanding the progress of AI systems today. Take Autonomous Driving for instance: there is a vast difference between being able to drive in well constrained definitions (i.e. motorways) versus driving in open domain, under any condition (i.e. within cities, small roads, …). Most of the industry today evaluate autonomous driving progress based of the number of miles driven without alerting the driver. This creates an incentive on putting the cars in easy conditions, while we should instead have metrics which focus on broadening the scope in which they can operate correctly. More generally,
I think it’s time we stop saying generic non-sense such as “computer vision is solved”.
Very narrow problems may be solved provided you have enough labeled data, well defined and constrained classes. But incorporating commonsense knowledge of the world to computer vision systems is still a complete challenge.
Plenty of researchers acknoledge this and several interesting research fields are booming these days, such as Visual Reasoning & Grounding, Laws of Physics Discovery, Representation learning through unsupervised/self-supervised learning, or even reinforcement learning tasks starting from raw pixels (if interested, see references below).
Finally, this was about Computer Vision as this is where I have most experience, but I believe the same reasonning applies to any Machine Learning, especially Deep/Machine learning based NLP.
Charles Ollion is CoFounder & Head of AI at Heuritech. Also teaches Deep Learning at Master Datascience (Ecole Polytechnique/Paris Saclay) and EPITA. Thanks to Hedi Ben Younes and Alexandre Ramé for helpful comments!
Bonus: Evaluate the complexity of Computer Vision problem!
Answer 6 questions and count your complexity points! Disclaimer: Points allowed are arbitrary and based on my experience, I simplified a lot. More accurate for simple CV tasks.
Computer Vision Evaluation Survey
Web survey powered by SurveyMonkey.com. Create your own online survey now with SurveyMonkey’s expert certified FREE…
- Between 1 and 20 points: Doable! Includes Diabetic Retinopathy (18) and Pedestrian detection (14)
- 20 and 30 points: Will require strong tinkering and good amount of work to put into production. Includes Webcam Gesture detection (29)
- 30 and 45 : Will require a strong team including both engineers, research engineers, huge datasets and a lot of time. Includes Handbag detection on Instagram (42)
- 45 to 60 : An open research problem and/or a real engineering challenge. For instance Robotic Grasping, Visual Reasoning or hard VQA problems could score > 45
- More than 60 : Wait a few years before expecting to put a good system into production. Typically Fully autonomous driving would score > 80
CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning Johnson, Justin, et al. CVPR, 2017.
Discovering causal signals in images Lopez-Paz, David, et al. CVPR 2017.
Interaction networks for learning about objects, relations and physics, Peter Battaglia, et al. NIPS 2016.
Iterative Visual Reasoning Beyond Convolutions, Xinlei Chen, et al. arxiv preprint 2018.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, Mehdi Noroozi, Paolo Favaro ECCV 2016.
World Models, David Ha, Jürgen Schmidhuber 2018.