AI & Computer Vision

Libby Kinsey
6 min readJul 4, 2016

(Part one of a series of posts “AI & …” from Project Juno).

“Computer Vision is an engineering discipline: we are primarily motivated by the real-world concern of building machines that see” Simon J. D. Prince, Computer Vision: Models, Learning and Inference, 2012.

Let’s define ‘AI & Computer Vision’ to describe any application that uses visual information to inform or reason about the world.

This post aims to give a flavour of the breadth of applications for computer vision AI. It’s not a survey of techniques (it would be a veeeery long post in that case), suffice to say all of them take as input pixel data from still or video images, extract measurements, model and infer.

Imagine a parent with a newly mobile baby entering a room for the first time and scanning it for danger — the sharp table edge, hot oven door, or eminently topple-able heavy vase — they are inferring risk based on visual information and their knowledge of the world (their ‘priors’). Similarly, a carer visiting an elderly person in their home may carefully note visual cues of deteriorating health, perhaps a clear loss of weight or of colour, a sorrowful or pained expression. More prosaically, we might want to sort photos of a particular friend to contribute to an album. The information that we can extract from visual information is remarkable. And whilst computers are not yet able to undertake the wide variety of ‘seeing’ tasks that humans do, they are able to do some things very well (even better than humans for some image classification tasks) in narrow realms.

Tell me what you see

I’ve talked complacently, above, as though all of us have the ability to process visual information, ignoring those who are blind or visually impaired. Here computer vision enabled services can assist. Aipoly is one startup that enables the visually impaired to use their phone to identify objects and colours, whereas Orcam is a portable device with smart camera which “recognizes text, products and faces, and speaks to you through a mini earpiece”.

What’s more, current research will lead to massive improvements. I am thinking of the vision systems for obstacle detection and spatial contextual awareness that are being developed for autonomous vehicles and robots, and the recent deep learning successes in inferring what is important in a scene and describing it.

What’s in a face?

Facial recognition technology is already in widespread use to work out who is in an image — in your Facebook feed, at the airport, or in a casino. But faces can reveal more than ‘just’ identity.

You and I might agree, most of the time, about whether a person looks glum or happy, serious or silly. We do this without thinking, taking in to account factors like facial expression and posture (and other non-visual cues such as vocal tone). We have learned over time what those emotions look like, and we adjust our interactions appropriately. Computers are learning to recognise emotion from images too, and this will contribute to much more natural interfaces, and allow marketers to take advantage of emotional response to design and sell products. So far, so (relatively) uncontroversial.

Then there’s the idea that computers can infer personality type from faces alone. I am sure that I have as many biases as the next person, and work as hard to avoid ‘judging a book by its cover’. Yet, alarmingly, that is what one startup, Faception, claims to do, marketing its technology to “efficiently detect and apprehend potential terrorists or criminals…”.

On the other hand, Face2Face - which edits pre-recorded videos of people in real-time with another person’s facial expressions - demonstrably works and shows what is possible in CGI and video games. Wow:

Face2Face, Real-time Face Capture and Reenactment of RGB Videos (CVPR 2016 Oral)

Identifying disease

Clinicians scrutinise medical images like X-Rays and MRIs to make judgements about diagnosis and treatment efficacy. Their experience is expensive to acquire and is not readily scaled or tested. A number of computer vision startups aim to assist clinicians in detecting and predicting disease so that they can provide timely, accurate diagnoses. To name a few, Avalon, Enlitic, Kheiron Medical, SemanticMD, Visulytix, and Zebra Medical cover a variety of imaging and disease types.

Then there’s Butterfly Network (which gets a special mention because I’m a hardware geek) which is aiming to make an ultrasound-on-a-chip, along with deep learning algorithms to assist clinical decision-making:

Unobtrusive monitoring

I was very tickled to see the recent seed round in Nanit, a visual monitoring device for babies, because that was the subject of a past exam question that I revised last year. The specific problem was to alert to problems breathing (a camera is less obtrusive than attaching sensors), whereas Nanit helps to understand sleep patterns and captures important moments automatically. Having mused collectively with my classmates on how something like this would work, it’s great to see Nanit actually taking the challenge on!

Nanit: tracks and understands sleep patterns

For older people, an intriguing camera-based system by Heartfelt Technologies monitors cardiac health in the home. Again, the beauty is in its unobtrusiveness — it requires no explicit patient interaction or wearable device.

Vision systems are used to monitor other things too, such as to automatically monitor CCTV for suspicious activity or to predict corn crops using satellite images. They can estimate pose, which has applications in sports training, ergonomic design, and rehabilitation:

Convolutional Pose Machines: MPII validation set ( related paper)

Autonomous Vehicles

Autonomous vehicles and mobile robots are comprised of multiple complex systems, but being able to ‘see’ where they are and what is around them is fundamental.

Vision companies that innovate in this space must interact with a technology and stakeholder ecosystem that combines driving policy, sensing, mapping, and navigation with legacy infrastructure, regulation, and public scepticism.

Oxbotica’s approach (I hope I recall correctly from their recent presentation at London Machine Learning meetup) circumvents reliance on GPS and takes a pragmatic approach to automation, with “autonomy when offered, as opposed to on demand”. Its technology underpins the UK’s first autonomous car approved for public trials. I think it’s really exciting.

It’s a big opportunity, but these are hard problems and safety is paramount — NYSE-listed Mobileye, “the global leader in the development of vision and data analysis for Advanced Driver Assistance Systems and autonomous driving” noted that they have 800 (!!) people to annotate images for its supervised learning tasks,

Image/scene enhancement

Berkeley computer scientist Richard Zhang has created a lovely application of deep learning, to colourise/colorize black and white images:

Magic Pony Technology (acquired by Twitter this month) has been doing some super-cool deep learning to intelligently ‘sharpen’ poor-quality images and video, enhance streaming video, and even improvise new images:

Of course there’s Augmented Reality too — when computer-generated video or graphics are inserted in to our view of a physical, real-world environment, that’s computer vision at work:

Blippar: “Bank of England Reveals New 5 Pound Note Design with Augmented Reality”

Sell more stuff

The ability to search based on visual attributes is particularly useful in fashion retail, where customers might be searching for a particular look, shape, colour or style that are difficult to formulate as a natural language search term. There are some great companies offering visual search products, such as Snap Fashion and Cortexica. This application fulfils my requirements perfectly, summarised as, ‘that, but in blue.’

Snap Fashion

Computer vision has also been used by e.g. Clevapi to understand image content and object placement so as to seamlessly insert relevant ads.

Now what?

It’s a very varied list, isn’t it? That’s what’s amazing about computer vision, it’s so very broadly applicable. There’s masses of commercial opportunity too. Just in the last couple of years in the UK, vision companies like Vision Factory, Magic Pony Technology, Apical, and Seene/Obvious Engineering have been acquired.

There’s still much to do though. The ability to identify what is important in an image (attention learning); to tell you about them (captioning); to integrate with other data sources like language and sensor data (multi-modal learning), to unambiguously describe objects in busy scenes, the sheer processing power to do useful things with pixels in a timely and cost-efficient fashion… they are all difficult tasks, which if solved open up even more opportunity!

Thanks to Oisin Mac Aodha for commenting on a draft of this blog.

--

--

Libby Kinsey

Machine Learning | Venture Capital | startups | anything blue.