Image recognition is not enough

As with language, photos need contextual intelligence

Ken Ryu
Ken Ryu
Sep 22, 2016 · 9 min read

At a recent Deep Learning Investor Conference, a panel of VC and angel investors were asked to discuss the state of Artificial Intelligence (machine learning, neural networks, or big data if you prefer those buzzwords). There was an interesting debate on how well machines handle natural language processing and visual recognition. The consensus is that machines are now very good at visual recognition. For example, if you show a machine a photo of a cat sitting on top of a turtle, the machine will recognize the green-shelled animal as a turtle, and the fuzzy, four-legged animal as a cat. Where the debate was spirited was in whether machines have cracked the natural language comprehension challenge.

After some back-and-forth, the panelists clarified the argument. They ended up agreeing that machines are quite adept at transcription, to the point where machines will soon outperform humans in this task. In the case of language comprehension, machines have a ways to go.

Panelist Leonard Speiser used the peanut butter and jelly challenge to articulate the problem with a machine’s language comprehension.

Computer, put the peanut butter on the piece of bread.

The computer does this:

Ask a 7 year old the same request. They will know exactly what you want. They may not do it, but that’s a another problem altogether.

Let’s move on. The focus of this post is not to discuss the challenge of natural language comprehension. The object is to show that like language, image recognition requires more than simple object isolation and identification.

Object recognition, but what about context?

Let’s take a look at the above photo. If we were to feed a machine this photo, what does the machine see?

  • 5 young children (2 girls, 3 boys),
  • the machine might be able to match the names of the children with other photos that the machine has learned (Awesome!),
  • different colored balloons,
  • different colored mugs,
  • plate of food,
  • hats on kids
  • background with purple letter P, yellow letter P, green letter B, yellow letter R, blue letter T

Now let’s contrast that with how a human sees this photo.

  • It’s a picture from a birthday party,
  • The obscured sign in the back is obviously a sign that says “Happy Birthday”,
  • The plate of food, although blurry, is most likely a plate of cookies,
  • The kids seem to be having fun and are smiling,
  • Although it is somewhat difficult to tell for sure, there appears to be ten colored balloons showing.
  • The tiara on the red-headed girl is difficult to make out, it says “Happy Birthday”.

Perhaps the machine could fill in some of the blanks to guess the sign in the back says “Happy Birthday”, but if the machine is taking the images literally, it will only register “P P” “B R T” clearly.

Does not compute

Ask the computer what it sees in the image above, and it might answer, 4 missles taking off and one cat. Ask a human, and they will see the logic problem immediately. You might argue that computers are better than humans at detecting fake images by analyzing pixel patterns. That is true. However, like the peanut butter and bread problem, if the computer’s task is to tag major items, it will go about the task without considering the absurdity of the image in its entirety.

Depth perception

This photo has more going on that just a wife and her groom. A human will see this image and understand that the shot has been setup to take advantage of the angle and the position of the couple to create this optical illusion. Our understanding of the relative heights of the man and the woman allow us to decode this image.

The implied invisible elements

Wind, heat, and cold are camera shy. A human can see a picture like the above and see more than a series of trees, bushes, a body of water, a road and building in the distance. The human sees a gusty wind storm.

Panicked groom, bridge and best man? Maybe, but looks more like the couple got hitched on an unseasonable hot day. Outdoor wedding? Look at the best man’s shadow. Looks like an outdoor wedding to me. The likely first names of the bride and groom start with “S” and “N”. Did the computer get all that info?

Object permanence problem

Peek-a-boo. Now you see me, now you don’t. Babies lack object permanence. With time, they learn that mommy didn’t magically disappear behind some hands. Once they gain object permanence, the game is no longer as fun. For still images, object permanence is not a problem for computers. Once you introduce videos or a series of photos into the mix, object permanence becomes important. The literally killer problem with object permanence is self-driving vehicles. Humans use our periphery vision and object permanence senses to detect changes in our driving environment. A squirrel we see out of the corner of our eyes, seemingly out of harms way, suddenly decides to scramble across the road. We notice the critter changing its course, and we tap the brakes and swerve to the right to allow our bushy-tail neighbor another day to collect acorns. Object identification for frames is not enough in this situation. The objects that could present a hazard need to be tagged and tracked till they are no longer a threat to or from the vehicle. The 4-year old chasing the baseball across the street is the edge case that is sure to keep self-driving engineers up at night unless they are able to account for the object permanence challenge.

Periphery vision

You are finally making time on the commute home on the 280 when you notice the cars to the side of you rapidly slowing down. The car you are following has yet to register any slowdown, but your spider-senses are telling you that their is trouble up ahead. By tracking the vehicles to the side, you instinctively begin your slowdown and cross-your-fingers that the vehicle you are following can hit the brakes in time to avoid a collision with the vehicles ahead. Self-driving cars need to do more than track the vehicle they are following in case that lead car is driving carelessly. By analyzing the entire flow of traffic near, far and in the periphery, the vehicle can better predict dangerous slow-downs and obstacles regardless if the lead vehicle is not making the proper decisions.

Tesla case studies

Case 1: Joshua Brown fatal accident (Florida), May 7, 2016

Let’s consider the two highly publicized Tesla fatal accidents from early this year. In the case of Joshua Brown, a trailer was traversing Highway 27 in Florida. The Tesla Autopilot did not properly identify or calculate the height of the trailer. It guessed the trailer was a highway overpass. The result is the vehicle continued on its path without slowing down until it collided with the trailer. In this scenario both the object identification and the depth perception of the cameras did not properly compute. As well, the object permanence or lack thereof, of the trailer was misinterpreted. Consider the human reaction to the trailer blocking the freeway.

  1. Object identification. That looks like a trailer. I better slow down so I don’t run into it.
  2. Object permanence. Last time I looked ahead, I don’t recall seeing that object. Is that a moving object? I better slow down.
  3. Periphery vision. Was there a movement on my left periphery a short while ago? Is that object a moving vehicle?
  4. Depth perception. That object seems to be right in my path right ahead. I better slow down.

Case 2: Gao Yaning fatal accident (China), January 20, 2016

In this case, it appears the Tesla was traveling in Autopilot mode, and was following a vehicle. The vehicle directly ahead changed lanes to avoid a street-sweeper. The Tesla continued in its lane and crashed into the street-sweeper without slowing down.

In this incident, the Tesla seemed to hold too much faith in the vehicle it was directly following. Once the vehicle changed lanes, the Tesla was unable to make the split-second recognition and response to the fact that there was an object in its path.

This case is less clear than the Florida accident if a human driver would have avoided the fatal accident.

A human driver may have better scanned the horizon ahead of the vehicle in front and may have seen the street-sweeper. If there was traffic (unclear in the report), the human may have noticed a traffic slow-down as the vehicles passed the street sweeper. As the vehicle in front abruptly changed lanes, the human response would be to go on high alert. The dramatic action signals to the driver that something is very wrong, and an immediate brake or change lane action is required. Even if the driver was not able to avoid the collision, they would have recognized the object and attempted to brake ahead of the collision.

Since Tesla’s Autopilot has a strong dependence on the vehicle directly in front, the software should maintain a safe minimum car-length follow distance to allow the system to brake ahead of this Chinese-accident scenario. The event of the sudden braking (deceleration) or abrupt lane change coupled with a possible object in the pathway should put the Autopilot on high alert and begin a braking or lane change decision.

NOTE: The results of the investigation for the Chinese fatality have been limited. There is still some uncertainty if the Autopilot was engaged or if the vehicle was under human driving control. The reason that it is believed to be an Autopilot accident is that a complete lack of any braking action is a very unhuman reaction for this type of accident.

Image context and holistic comprehension is a work in progress

Computer image recognition has greatly improved to a point where objects are being properly recognized in the high 90% range. A computer’s ability to cross check a human’s face and compare against a library of millions of tagged photos is impossible for a human to replicate. Image recognition technology is being applied to advance medical research, self-driving cars, fraud-detection, and fight terrorism. The results are encouraging, and are only getting better. We are at a point, as with natural language, where humans are better at the making sense of the data than machines. As we program learning algorithms for machines to decode dynamic image threads and non-visual clues, machines will eventually out-perform humans in interpreting the context of images and videos. When that day comes, the applications for computer automation are limitless.

Emergent // Future

Exploring frontier technology through the lens of artificial intelligence, data science, and the shape of things to come

Ken Ryu

Written by

Ken Ryu

CEO & Founder. Gogocater.

Emergent // Future

Exploring frontier technology through the lens of artificial intelligence, data science, and the shape of things to come

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade