Cameras and Cats — Five Trends That Make Computer Vision Interesting Now

Why is computer vision interesting now? To answer that question, we’re going to talk about… cats.

Chang Xu
Upfront Insights
8 min readSep 25, 2018

--

The internet is full of pictures of cats and cats have done a great service for computer vision. I’m recognizing their contribution.

What is computer vision and why should we care? Because it’s in your everyday lives and you probably don’t even know it.

There are five trends in this picture — five reasons that computer vision is exploding now:

  1. Cameras and sensors
  2. Connectivity
  3. Data
  4. Intelligence
  5. Computing power.

I’ll tell you about them through cats. We laugh about it, but it’s incredibly hard to do and you’ll see why.

Cameras and sensors

Cameras and sensors are being deployed everywhere. Because they are better, smaller, and cheaper. There are more and more inputs of the physical world to computers. This means that there are many many cameras to take photos of cats. They are in your home outside your door, on the drones that you fly, and on the manufacturing floor to detect defects. They’re in your car on the dashboard, behind the rear-view mirror, on the sides, and at the back. In fact, your Tesla has 8 cameras, 12 ultrasonic sensors, and radar to help it see its surroundings.

We have also hugely benefitted from the rise of smartphones over the past decade. This put a camera in everyone’s pocket. Plus, the massive scale has made cameras and sensors a lot cheaper. I still remember just over 10 years ago, I had the coolest phone back then, which was a Motorola RAZR. It might have had a camera but it was so bad. Nowadays, every smartphone is able to not only take beautiful high-def photos, but also capture stunning video, measure acceleration, fix geographic location, and pretty much do anything you want. How the world has changed! While your iPhone hasn’t gotten much more expensive, the camera and the sensors that it packs have improved significantly over the generations.

Connectivity

These cameras and sensors are connected. The cameras take photos at the “edge” and then they upload to the powerful servers in the cloud to aggregate all the data and access powerful computing resources. This all helps it to constantly improve how it understands the world. Each device on the edge also shares data with each other, which allows them to work together.

inVia Robotics as illustrated below makes robots for fulfillment warehouses for e-commerce. The robots pull the bins from the shelves and take them to human pickers, who then would pick the item and packs and ships. The robots navigate the warehouse autonomously by reading QR codes along the shelves and on the bins. Here, the robots function as a swarm. Together, they batch orders and optimize routes. If one human picker is slow, then more robots would line up behind the faster picker. The robots can also “defrag” the warehouse, move the bins that are more frequently accessed to be closer to the pickers and the bins that are less frequently accessed to be farther away. inVia robots coordinate together based on what they see through their cameras.

Data

There is more and more visual data being shared over the web (YouTube, Facebook, Instagram, Pinterest, Snap, Flickr). They are powered by cameras in everyone’s pocket and the many connected and distributed cameras and sensors. Over 300 hours of video are uploaded to YouTube every minute. Over 300 million photos are posted on Facebook every day. Over 50 million photos are posted on Instagram every day.

We have also benefitted from the autonomous car movement. The data collection first started with the Google Maps effort, where Google sent cars to drive around streets everywhere. Then many companies launched self-driving car efforts, including Waymo, Uber, Cruise, and Zoox. As a result, they have gathered a huge amount of data about road conditions, cars, pedestrians, and road signs.

Algorithms

Understanding what is in an image is staggeringly complex. Only recently have computers been able to do this. We have developed algorithms to simplify the computation so that it wouldn’t take days or months to tell you whether you took a photo of a cat. The most foundational algorithm for computer vision is convolution (which I explain in this blog post). This is why you often hear the words Convolutional Neural Networks, or CNNs, or ConvNets, when people discuss computer vision.

Convolution is just one algorithm. Researchers have developed many other algorithmic innovations to further reduce the complexity and help computers recognize what’s in an image. Most are pretty esoteric and related to the inner workings of statistics and models. I’ll give two more examples that are easier to grasp.

There is data augmentation, which is applying techniques to distort images slightly such that the images don’t change meaningfully, so that you have more data to train your model. For example, you can create mirror images, crop randomly, use rotation, shear, warp locally, shift the colors, and so on. Now you have eight pictures of cats when you started with just one.

There is transfer learning, which is taking a model that someone else has trained based on millions of images of cats and applying it to your problem. You probably need to tweak it slightly because they were only using cat photos that are taken in professional lighting whereas you want to be able to recognize cats in photos you took on your phone. This is much easier than starting from scratch and training your own model.

As we develop better and better algorithms, we can recognize what’s shown in an image more accurately and quickly.

Computing power

The last piece I should touch on is computing power. I’ve explained to you why we need so much computing power and why we could always use more.

Traditionally, computing has relied on CPUs, or central processing units, that are great at taking a sequential list of instructions and executing them very quickly. Over time, we have progressed making CPUs that are smaller, better, and faster, in line with Moore’s Law.

Computer vision problems, on the other hand, are better suited for GPUs, or graphical processing units. Whereas CPUs have multiple cores, GPUs have thousands of cores that can operate in parallel and can perform a multitude of identical simple jobs simultaneously. GPUs were developed for video games where you need to render images on the screen efficiently. It turns out that is an analogous computation to image recognition, where when you apply algorithms like convolution, you want to apply it quickly all across the image.

This is why the GPU market has gotten a lot of attention of late. The market leader today in GPUs is nVidia, but the other big tech companies have all made serious moves to create their own GPUs and chips tailored to machine learning tasks, from Intel to Google to AWS.

Major cloud providers have been adding GPUs and machine-learning specific services to their offerings. The chart below shows CapEx spend by public cloud vendors. Although these figures don’t specifically break out how many nVidia GPUs they are buying every year, they show that it is an area of heavy investment.

Why? As you can see, the public cloud market is expected to more than double from $31B in 2016 to $72B in 2019. We all need more computing power in order to recognize cats more quickly and accurately.

There is a virtuous cycle of better sensors, increased connectivity, more data, better algorithms, and more computing power. This gives better predictions and makes computer vision better and better and more pervasive in our world.

So what are cat photos to you?

Computer vision is not a hardware-specific gimmick. This is breaking another wall that is allowing us to interact digitally and physically. It fundamentally changes how businesses operate in the physical world. Any business you have is either using computer vision or feeding computer vision.

What are the problems you have that computer vision can solve? I have a few ideas:

Identity & security — can your product become “smarter” if it can see exactly who or what it is interfacing with? This could be applied to a digital product like a phone FaceID or a physical product like a door

E-commerce — why is it that I’m still flipping through online catalogs and imagining clothes on me, but when they arrive, the clothes invariably fit terribly or not what I had in mind?

Change detection — so much of our jobs is to keep an eye on something.
When you’re driving and you want to switch lanes, you need to remember to check your blind spots. Why do we still have blind spots?

The parking police walk around the city all day to give tickets to illegally parked cars. Why not put cameras around the city and ticket cars automatically?

Or when you’re in a factory and making lots of electronic widgets, a machine could be out of alignment and start to make defective products. Why not have a camera stare at it and alert you if it sees something different?

A lot of diseases start with changes that are imperceptible to the untrained eye, like slight tremors in your fingers might indicate Parkinson’s, why not have a camera as a silent observer in your home with a direct line to your doctor?

Why not use computer vision to solve these issues?

Computer vision is changing the way we interact with the world. We’d be hard pressed to find a business that this would not be relevant for.

This was adapted from a talk at San Diego Venture Summit on August 16, 2017. The slides I presented at the talk are here:

If you are interested in computer vision, check out my other posts:

If you are building a startup using computer vision, I’d love to talk to you. Shoot me an email at chang@upfront.com.

--

--

Chang Xu
Upfront Insights

Partner @Basis Set Ventures. Investing in AI, automation, dev tools, data/ML ops. Former founder and operator. Never still, running towards the next big thing