Originally published: https://stevenjenkins.io/essays/foundations-for-solving-perception/
A baby is born with two high resolution cameras that transmit 10 million bits per second to the brain and a recurrent neural network that resembles an LSTM. With the world as its dataset, it learns to perceive depth, shape, material, and texture. Therein lies the solution to solving perhaps the most important computer vision task — the ability to look at the world and understand how to shape it. In this post, I will explain the foundations through which computers can learn to see as humans and suggest how we can begin to make it reality.
Foundation 1: Calibration and Localization
It’s no coincidence humans evolved to include two eyes. Mathematically, we need them to understand our 3D world. Look around for a recognizable object. I, for example, am looking at a Bird of Paradise. Now think of your eyes as cameras taking two pictures at the same time from two different locations. That object exists slightly towards the right of the picture in your left eye and slightly to the left of the picture in your right eye. The two images below represent the same object taken from two cameras placed next to each other. Left image, left eye. Right image, right eye. While the object does not change, the location in the x axis of the red dot does. This is how we perceive depth.
In epipolar geometry, coplanarity dictates that if a single point can be localized on two planes with a known translation and rotation matrix, distance to that single point can be calculated. In the case of humans (and most animals), extrinsic parameters are calibrated with a known translation matrix (distance between our two eyes) and rotation (angle between our two eyes). Put mathematically…
- (X, Y, Z) are the coordinates of a 3D point in the world coordinate space
- (u, v) are the coordinates of the projection point in pixels
- fₓ and fᵧ are the focal lengths expressed in pixel units
- cₓ and cᵧ are the principal points at image center
- r, and t represent the calibrated rotation and translation matrices between camera and projector
Localization and calibration are the foundation through which you and I have learned to understand and manipulate our world. They are, therefore, one of the most important foundations behind 3D computer vision and the basis through which computers can learn to see just as we did.
Foundation 2: Data
Data is essential for training computers to see. Unlike humans, computers are not born into an environment with photo-realism and real time rendering. They don’t have loving parents that label data, pointing to an object on your birthday and saying “cake”. They don’t have a natural curiosity about the world and the innate drive to touch and feel.
Instead, computers need to be given data. With data, we can train algorithms (deep neural networks) to optimize for a specific task given a score (loss function) and a whole lot of compute power.
The problem is that there is no virtual dataset remotely similar to the real world dataset humans are given through which computers can learn to understand, navigate, and even manipulate their environments. Computers are at a natural disadvantage. But computer vision researchers are crafty and one of my favorite examples of this was the use of Grand Theft Auto to generate data for self driving cars. Here’s a video of the world’s most boring playback of GTA…
Rockstar Games eventually sent cease and desist orders, but it goes to show both how important data is and the length through which researchers will go to get it.
But imagine there was a virtual world that resembled ours through which a computer could learn just as we did. Every object assigned metadata ranging from material properties and textures to physical properties and potential functions. These objects could be observed in view and segmented on either image plane. Physics could be simulated so if something falls, it falls as if it would in the real world. Light could be simulated so it permeates objects with a high index of refraction (think glass) and bounces off materials with low roughness (think mirror).
A virtual world with infinite labeled data, fully customizable with modularity that reflects our own, that governs based on the laws of physics would immediately become the most important tool for any researcher in 3D computer vision. We could teach robots to move and manipulate objects. We could teach drones to navigate their surroundings. There would be massive improvements to manufacturing, healthcare, art, design, construction, leisure. The world would never be the same.
This idea massively excites me. It’s something I could imagine working on for decades. If you share my excitement, I’d love to chat. At 3co, we are always looking for smart, driven people with a passion for understanding the real world and finding ways to make it better. Browse our job postings here and if you don’t see what you’re looking for, email me directly — email@example.com.