How Cameras Are Learning To “See”

Raphaël de Courville
NEEEU Tales
Published in
10 min readApr 5, 2018

What happens when artificial intelligence peers through the lens?

Remember the time when a phone was a device for making phone calls? Yeah, me neither… The smartphone has grown into a super versatile machine, and over time, its original purpose took a backseat to answering work e-mails, scrolling down your Instagram feed and playing Candy Crush.

Now the camera in your phone is going through a similar revolution. Yes, say goodbye to your boring camera app! The fairy godmother of artificial intelligence is bringing her to the ball and nothing will ever be the same.

“There are fundamental changes that will happen now that computer vision really works, (…) now that computers have opened their eyes.” — Jeff Dean, Head of Google Brain

Building artificial intelligence inside the camera is not a new idea. With their powerful mobile processors, smartphone cameras already make use of computer vision to recognise faces for example, but the scope of these features is relatively narrow and single purpose. This will be changing soon.

Apple’s Portrait Mode uses face detection to selectively touch up the photograph

The iPhone already has a built-in AI chip. Future devices will come with dedicated hardware for machine vision. These chips will make computer vision faster and less of a drain on your battery, allowing designers and developers to further push the limits of the camera experience.

Machine vision has been making incredible progress in the last few years. We now have real time algorithms that do a pretty amazing job at anything from recognizing objects to estimating human pose in a video.

Recognizing and labelling objects in real time with the YOLO library.
Real time 2D pose estimation

If camera intelligence was a dish, object recognition, gesture detection, and position tracking would be the raw ingredients. The true power of camera intelligence boils down to something deeper: context. In other words, it gives the computer a rich understanding of what is going on right now.

Microsoft’s Seeing AI app narrates the world to blind people.

Soon, you will turn to your camera to find relevant information about your environment, and interact with the physical world too. Combining image understanding with other data like location and audio input, the camera app will turn into a powerful platform for search and contextual interaction.

Welcome to The Great Camera Awakening — the consumerization of advanced computer vision: the creation of platforms, software and services where the Camera comes to the forefront of user experience, and creates a bridge for the digital and physical world to co-exist. — Jacob Mullins, Shasta Ventures

To design products and services that take advantage of this new platform, we have to understand where the opportunities lie. Let’s have a look at some of the areas where camera intelligence will have the most impact.

“Hey Siri, Look…”

We get the largest part of our information about our surroundings from visual cues (sighted people do at least). If we want to give our devices a similar understanding of the world, the camera is an obvious place to start.

In Spike Jonze’s 2013 film Her, the protagonist Theodore Twombly (played by Joaquin Phoenix) builds a romantic relationship with an AI called Samantha (voiced by Scarlett Johansson). About halfway through the movie, Theodore takes Samantha to the beach to show her what the world is like. She looks through his smartphone’s camera, commenting on what she sees.

For now, virtual assistants feel less like a friendly sidekick you want to take on a sea escapade, and more like a clumsy helper you wouldn’t trust to keep an eye on your beach bag. Part of this is due to the heavy limitations of current language understanding, but another major roadblock is that Siri and her friends lack a sense of context. Imagine the questions you could ask if your virtual assistant could see what you see.

Google Lens uses computer vision to recognise objects, places, and more

With the introduction of Google Lens Google acknowledged the importance of Camera intelligence. Lens can identify plants, books, animals, and more. Though it is quite similar in scope to its previous app Goggles, the integration of Lens in Google Assistant means the recognized objects become part of a conversation. This enables more contextual interactions: for example, you can scan a music album and ask “who played drums on this one?”.

Opportunities

Creating meaningful interactions within the limitations of natural language processing and computer vision is still challenging today, but technology is improving rapidly. Right now, camera chatbots can be a great fit for environments where the number of possible interactions is relatively limited like parks, retail or hospitality. In a shop environment for example, a virtual shopping assistant could use the camera to identify products and answer questions about them, like “does this contain nuts?” or “what kind of wine would go well with that?”

Real Augmented Reality… Really

In a previous article, I argued that Mobile Augmented Reality should be seen as a separate medium, with its own set of design principles and best practices. One thing I didn’t mention (as it was out of scope for that particular article) was how well augmented reality and artificial intelligence can work together.

The ongoing rebirth of mobile AR — trough Apple’s ARKit, and Google’s ARCore — is made possible by the same breakthroughs in computer vision and artificial intelligence we mentioned in introduction. The work that went into making AR work as well as it does on these devices is astounding, but just like voice assistants, what mobile AR is missing right now is a sense of context.

AERO app by NEEEU

With AR, your phone can make a virtual duck look like it is sitting in the middle of the street, but what if you wanted to show the rubber duck only when pointing your camera at a bathtub for example?

Camera intelligence will let apps recognize specific objects and summon the relevant AR overlay. It will be the glue that sticks relevant virtual content to physical landmarks, products, appliances, or locations (I discuss the broader case of visual positioning further down).

In the future, Lens might be able to deliver information as an AR layer on top of the camera image

Opportunities

The combination or AR and AI is a creating a new platform for interaction between mobile devices and physical spaces. Smart AR will bring digital content into the real world. Expect tutorials, ratings, and recommendations to make it to physical spaces. Retail will certainly be among the first to take advantage of it, along with gaming and tourism.

Solving The Internet Of Things

“I love wall switches” may sound like the opening line to the most boring date ever, but hear me out. Wall switches are awesome. They are cheap, familiar, low-maintenance, and reliable. When it comes to remote control, anything more complicated than a wall switch needs a very good reason to exist.

Anyone working in the digital space knows the level of skill and determination needed to design smooth, two-dimensional experiences. When you move into our three-dimensional world, that complexity explodes. Potential interaction points increase exponentially. Then you add more than one person…

Connected devices are part of this complexity. Once you installed the app, plugged in the hardware bridge, configured everything… you have to start it all over and do it all again for the next device. Where is the wall switch for the Internet of Things?

Using object recognition to summon a typical AR overlay, image by Vuforia

We already saw how you camera can recognize an object and overlay digital information over it, but why stop there? What if your camera could recognize any connected appliance and let you control it?

Real world example from Google: connecting to a WiFi router by reading the label

In the near future, phones could use their camera to recognise nearby appliances, and summon the appropriate interface on demand. The camera app would become a universal visual browser, displaying the right interface at the right moment. Think of this as a universal remote for the physical world.

Reality Editor by MIT Fluid Interfaces

Imagine never having to read an instruction manual or deal with a clunky remote again. Just point your device at an unfamiliar appliance and get a friendly interface in your own language. No app, pairing, or configuration needed. Hey, it’s no wall switch, but it’s pretty darn close!

Not everyone shares my (admittedly slightly worrying) fondness for wall switches. I once worked for a summer in a building where the only way to close the blinds was to call IT on the phone and ask them to do it for you (they were on holiday… the blinds stayed open). To avoid this kind of issues, physical inputs should always be available as a fallback for essential functions. Advanced features can be available to users with the fancy hardware (think of this as progressive enhancement in the real world) while leaving the core functionality accessible to all through tangible and minimal inputs.

Opportunities

This will be a new way to think of the relationship between hardware and software and create new kinds of products. Toy manufacturers are early adopters of such playful technology, and some like Mekamon have already started exploring the possibilities of mixed reality gameplay between physical robots and AR worlds.

Note how the particle effects follow the robot as it moves in this mixed reality gameplay video by Mekamon

Once we come to truly see the physical and the virtual layers of a product as one unique thing, we will start to create exciting new hybrids that blur the boundary between realities.

Visual Positioning

Satellite positioning is a fantastic technology, but accurate geolocation remains a big challenge even today. GPS is still fairly unreliable (location errors in cities can go up to 16.8 meters) and pretty much useless inside of buildings. Where will the next improvement in location accuracy come from? The solution, once again, will come from the camera.

Using a type of computer vision algorithm called SLAM, a computer can look at a camera feed and triangulate the device’s position in three dimensions. This technique improves dramatically on the accuracy of GPS positioning. It is also more robust, and yes, it works indoors too.

Early demonstration of Google’s Visual Positioning System (VPS)

Unlike previous solutions, visual positioning doesn’t require any external infrastructure (markers, radio transmitters, satellites), which makes it infinitely scalable and flexible.

Visual positioning will be a vital part of the AR infrastructure. By building a shared 3D map of the world, it will allow creators to permanently link virtual experiences to unique locations, sharing them between multiple users, and triggering them on demand.

Opportunities

Smartphone geolocation enabled a whole wave of location-based startups like Uber, Foursquare, and Tinder. Visual positioning will not only help AR experiences integrate with real world locations, it will also spawn a new generation of location-based services that will take advantage or its robustness and accuracy to deliver highly personalized experiences.

Visual positioning will transform existing location-based services too. Imagine if Amazon Prime could deliver your package not to a fixed place, but to wherever you happen to be at the moment. With visual positioning, your phone will become your home address, and businesses will be able to deploy frictionless experiences that literally meet their clients where they are.

Beyond The Smartphone Camera

One interesting detail in Spike Jonze’s Her: Theodore uses a safety pin to prop up his phone at just the right height so that Samantha can get a good view of the world. This hints at a clear shortcoming of smartphone cameras: they spend most of their time in a pocket or a bag, blind to the world.

Wearable cameras could solve that, but they have had little success so far, whether as standalone like the defunct Narrative, or built into glasses like Snap’s Spectacles. People think they are useless or creepy, sometimes both.

The Google Clips camera, image by the Verge

Now and again, tech giants test the water with products like the Google Clips, a set-and-forget smart camera that chooses on its own when to take a picture. Though the Clips is marketed ostensibly as NOT a wearable camera (probably to avoid repeating the PR disaster that was Google Glass), it only takes a little imagination to see how the form factor could work as a wearable.

The same qualities that make life-logging cameras creepy also make them an ideal platform for always on vision-based applications. With the rise of camera intelligence (and as the potential benefits become obvious) I predict that the debate on wearable cameras in public spaces is going to make a come back, even before AR headsets become a viable alternative to the smartphone and make the discussion moot with their multi-camera positioning.

Magic Leap’s AR headset looks like there was a 10 for 1 camera sale on Alibaba

Once camera intelligence takes over, a world of possibilities will open, above and beyond what we’ve experienced so far. It will an exciting and sometimes creepy world for sure, but a magical one too. You’ll have to see it to believe it.

🐸 Raphaël de Courville wrote this article. He is also a co-founder of NEEEU. You can find him on Twitter at @sableRaph or contact him directly at r@neu.io.

🤹🏻 Identifying which spatial technology has the potential to become a great product or service is a full-time job. It’s also not your job. Luckily for you, that’s exactly what we love doing at NEEEU. Wanna create delightful services and products that live in the real world? Get in touch at hello@neu.io

--

--

Raphaël de Courville
NEEEU Tales

Interaction designer, creative technologist, co-founder of NEEEU (@neeeu_io).