Computer Vision is the next frontier for AI Home Assistants

6 min readJul 4, 2017

Advances in Speech Recognition and Natural Language Understanding, have contributed to the popularity of Home Assistant devices like the Amazon Echo and Google Home. They have brought us closer to realizing the vision of ubiquitous computing that Mark Weiser and others put forth nearly three decades ago. This vision has re-gained popularity recently as Ambient Computing.

New input modalities like touchscreens (Echo Show, Essential Home) will lead to a rapid increase in the number of applications and capabilities Home Assistants support. However there are a range of useful services that Home assistants can’t provide without the integration of Computer Vision (CV). This will require distributed video cameras in the home along with significant advances in AI algorithms for video understanding, action labeling, face and object recognition.

Use Cases for Vision based Home Assistants

There is virtually no limit to what developers can build when new AI capabilities are exposed to them. This is apparent in the thousands of skills/actions that are already available for voice only Home Assistants. That said, below are few categories of use cases I think will be most compelling for CV.

Smart Capture

This could be the be the killer app for families with kids or pets. One challenge is capturing the priceless moments when they do or say amazing things (e.g. first steps, first words etc). Distributed cameras in the home will make capturing these moments easier. AI can help by adjusting the camera to focus on the action. Video understanding can also help, by automatically tagging the segments so they can identified easily (for storage/ sharing), from within a transient (daily/weekly) video stream.

In Home Safety

There are few options to intelligently monitor and keep children or the seniors safe at home. Vision systems can help prevent common household accidents by providing intelligent alerts to family members before accidents happen. Examples of these include situations when a child is near the edge of Stairs, near electric outlets, playing with a sharp object or an adult has left the stove on etc.

These systems can also alert emergency services in the unfortunate event of an household accident. These types of services can be particularly useful for seniors who want to maintain their independence and continue living in their home.

Memory Augmentation

Facial and object recognition will allow systems to track the movement of people and objects around the house. This is useful for common scenarios like finding your phone or keys around the house (without special sensors like Tile). It can be indispensable for folks in early stages of Dementia or Alzeimer’s disease, who can be given contextual reminders for common tasks like taking medication, turning off the water or stove, recognizing people etc.

Predictive and Personalized IoT

The current generation of IoT devices that integrate with Home Assistants require explicit commands to control them. Vision systems that can detect presence and identity will allow these systems to be predictive and personalized based on preferences of individuals. A simple example is family members may have different preferences for ambient lighting, or may want different apps pre-populated on their Smart TV. Some of these settings can also be changed automatically based on activity detected e.g. set a different light level for reading vs. watching TV.

Recommender Systems

Most recommender systems act on a very narrow set of information collected within individual services e.g. Netflix or Hulu don’t know what you watch outside their service. A system that’s aware of the shows you watch across these online services and on cable TV, can provide better recommendations. A CV system that understands what you’re watching can help improve your entertainment experience.

New types of recommender systems can be built based on CV. Amazon’s Echo Look (for Fashion) is a good example. Similarly Houzz may want to create a app that recommends changes to Interior design. Pinterest, Wayfair or Amazon can provide suggestions for updating a room’s décor or furnishings, if requested.

Discovery and retention has been a problem for new skills on home assistants. A vision system can help alleviate this, by providing contextual suggestions for skills to use. For example it can ask if you’d like to create or update your shopping list (occasionally) when you open your refrigerator or pantry.

Home Security

Intruder detection has seen a lot of advances in hardware and services in the past years, especially since it’s easily monetized. Early versions of these services relied on Motion and Sound based alerts which resulted in a lot of noise. Newer offerings like the Netatmo Welcome, Nest Cam IQ are starting to use facial recognition to reduce the false positives.

With CV capabilities like real-time action labeling, the next generation of these offerings should be able to provide smarter alerts e.g. a person dropping off a package is not suspicious, whereas a stranger lurking around your house is.

The Challenges

Privacy

Privacy concerns could be one of the biggest blockers to adoption of CV at Home. Security camera’s that currently stream information to the cloud, are usually pointed outside the house. A large segment of potential customers is likely to be concerned about having a network of camera’s looking inside their home, while they are in it. Not to mention streaming this video to the cloud for processing or storage.

Strict privacy guarantees will be need to be provided, to overcome these concerns. One approach might be to restrict video from being streamed outside the home network. This will require some re-thinking of the ML architectures so that inference and training (to a large extent) can happen within the home network, and video is only stored in the cloud at the user’s explicit request.

Video Understanding

Image recognition is difficult as is, because Neural Networks must account for variations in lighting, angles and occlusions. Video understanding is an order of magnitude more difficult since it requires convolutions to be performed both in space and time. The difficulty in getting labelled data for training, makes supervised learning even more challenging.

Advances in the following areas have brought us much closer to accomplishing real-time video understanding and classification:

CNNs, RNNs and LSTMs — The latest developments in these deep learning techniques have shown some promising results in real-time Video classification.

Adverserial Networks — Researchers have been able to use Adversarial networks to predict future frames in a Video. GANs and advances in unsupervised learning are particularly promising, given the lack of labelled data.

AutoML — Google’s AutoML will lower the bar for generating new Neural Nets by automating a significant portion of the process. This can be an effective way of creating models that are customized for individual homes.

Indoor location tracking –AR technologies like the Visual Positioning System can be used effectively for video understanding in the home as well. Granular information about the location of a person or an object, makes it easier to understand the scene and/or actions.

Labelled datasets — Google has also released a large dataset of labelled videos that should help accelerate research in this area.

One advantage while applying CV in the home environment, is that the background of the scene is relatively static. This should make it easier to draw inferences about whats happening in the foreground.

Cost and Complexity

The hardware cost associated with instrumenting a home with a network of cameras, will be higher than that associated with voice based assistants. Installing these cameras, to ensure they have the optimal field of view and coverage, is more complex than placing a speaker (like the the Amazon Echo) in a room.

However new wireless camera systems like Netgear’s Arlo, are much easier to install since, they run on batteries. The cost of these systems will also come down, as adoption increases. Business model innovation can also help. Vendors can discount the hardware for end users and monetize some of the more compelling services (like Nest Aware and other home monitoring services do).

In Anticipation

I believe we are on the verge of seeing some exciting breakthrough products in this space. The next generation of smart cameras like the Nest CamIQ and Lighthouse are targeting use cases beyond just home security. The integration of the Google Lens into it’s mobile Assistant, demonstrates how effective vision systems can be at solving real-world problems. The first few iterations of these products and services might be less than perfect, but similar to how autonomous vehicle, speech recognition and NLU algorithms got better with time and real world data, so will Computer Vision systems.