Foundations For Solving Perception

Steven Jenkins
Jun 25, 2019 · 5 min read

Originally published:

A baby is born with two high resolution cameras that transmit 10 million bits per second to the brain and a recurrent neural network that resembles an LSTM. With the world as its dataset, it learns to perceive depth, shape, material, and texture. Therein lies the solution to solving perhaps the most important computer vision task — the ability to look at the world and understand how to shape it. In this post, I will explain the foundations through which computers can learn to see as humans and suggest how we can begin to make it reality.

Foundation 1: Calibration and Localization

In epipolar geometry, coplanarity dictates that if a single point can be localized on two planes with a known translation and rotation matrix, distance to that single point can be calculated. In the case of humans (and most animals), extrinsic parameters are calibrated with a known translation matrix (distance between our two eyes) and rotation (angle between our two eyes). Put mathematically…

  • (X, Y, Z) are the coordinates of a 3D point in the world coordinate space
  • (u, v) are the coordinates of the projection point in pixels
  • fₓ and fᵧ are the focal lengths expressed in pixel units
  • cₓ and cᵧ are the principal points at image center
  • r, and t represent the calibrated rotation and translation matrices between camera and projector

Put visually…

Localization and calibration are the foundation through which you and I have learned to understand and manipulate our world. They are, therefore, one of the most important foundations behind 3D computer vision and the basis through which computers can learn to see just as we did.

Foundation 2: Data

Instead, computers need to be given data. With data, we can train algorithms (deep neural networks) to optimize for a specific task given a score (loss function) and a whole lot of compute power.

The problem is that there is no virtual dataset remotely similar to the real world dataset humans are given through which computers can learn to understand, navigate, and even manipulate their environments. Computers are at a natural disadvantage. But computer vision researchers are crafty and one of my favorite examples of this was the use of Grand Theft Auto to generate data for self driving cars. Here’s a video of the world’s most boring playback of GTA…

Rockstar Games eventually sent cease and desist orders, but it goes to show both how important data is and the length through which researchers will go to get it.

But imagine there was a virtual world that resembled ours through which a computer could learn just as we did. Every object assigned metadata ranging from material properties and textures to physical properties and potential functions. These objects could be observed in view and segmented on either image plane. Physics could be simulated so if something falls, it falls as if it would in the real world. Light could be simulated so it permeates objects with a high index of refraction (think glass) and bounces off materials with low roughness (think mirror).

A virtual world with infinite labeled data, fully customizable with modularity that reflects our own, that governs based on the laws of physics would immediately become the most important tool for any researcher in 3D computer vision. We could teach robots to move and manipulate objects. We could teach drones to navigate their surroundings. There would be massive improvements to manufacturing, healthcare, art, design, construction, leisure. The world would never be the same.

This idea massively excites me. It’s something I could imagine working on for decades. If you share my excitement, I’d love to chat. At 3co, we are always looking for smart, driven people with a passion for understanding the real world and finding ways to make it better. Browse our job postings here and if you don’t see what you’re looking for, email me directly —

The Startup

Get smarter at building your thing. Join The Startup’s +792K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Steven Jenkins

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +792K followers.

Steven Jenkins

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +792K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store