At Scape Technologies, we believe in a future in which humans and machines can intelligently co-exist. But for any phone, vehicle, headset, or delivery drone to be an intelligent agent, it will need a human-like understanding of the world around them.
Our mission is to create and unify this human-machine understanding, by connecting the digital and the physical world. We call the cloud infrastructure powering this centralized unification our “Vision Engine”. In this new, technical blog series, we will describe how we designed and built it.
Understanding the World
Let us start with some basics. We believe that a machine has to be able to answer three fundamental questions to achieve the first stage of understanding the physical world:
Where am I? (Location)
How is the world shaped? (Geometry)
What am I looking at? (Semantics)
As humans, we can answer questions around location, geometry, and semantics using nothing but visual sensor: our eyes. Therefore, if we look at the problem from first principles, cameras should be wholly sufficient for any machine to do the same.
We started with the question “Where am I?”. This because before any machine can make intelligent decisions on a global scale, it will need to know precisely where in the world it is.
Where am I?
When we go back to our hometown, we are able to recall where we are without the use of a phone or GPS. When we see Big Ben, we know we are in London. When we also see the London Eye in the distance, we know from what angle we are looking at Big Ben. Using the size of the two landmarks, our brain figures out exactly where in the world we are.
This means that if we would like to enable devices to answer the question “Where am I?” in an environment, we need to index the landmarks of that environment in a 3D representation. When we then describe that 3D representation in GPS coordinates, it becomes a 3D map.
The question then: what do we index, and how do we do that.
Back to our hometown example: We look for identifiable features of an environment like shop signs, traffic lights, or Big Ben’s clock. These objects are unique and we expect them to be observable regardless of the time of day or the seasons. Consequently, these recognizable features are exactly what we want to have in our digital map.
We can teach a computer to detect similar recognizable features within an entire image. However, from a single image alone, we can not interpret where those 2D features are in 3D space. To do that, we need to infer depth.
The main way us humans interpret depth is by using our two eyes. Our eyes function as two separate, but overlapping, 2D sensors. So, if we create a rig with two cameras, we could use the overlap between the images to calculate the depth of visual features. Luckily this already is a well-known technique in the field of computing: stereo vision.
There is one major limitation though: most cameras only have one sensor. This has two implications. First, we would need to capture all the data ourselves with a special stereo rig (like a Google Streetview car). Second, we would have to repeat that cycle every time an environment changes. This is why companies like Google and Apple spend fortunes on updating their maps.
Thankfully, nature tells us that stereo vision is not the only solution to infer depth: If we close one of our eyes in an unfamiliar city, we can still make strong assumptions about its structure.
Though this method works for distant objects, it is significantly less reliable when objects are nearby. For example, if we close one of our eyes and have someone hold two pens in front of us, it is hard to estimate which of the two pens is farthest away.
Again, nature provides a beautiful solution which can be found in chickens. A chicken’s eyes do not have the overlap necessary for stereo vision. Instead, they move their head back and forth to capture two overlapping measurements with a single eye.
Chickens use the motion of their 2D sensor to estimate the 3D structure of the environment.
This means that if we loosen the definitions of “motion” and “chicken”, we should be able to calculate depth from any two overlapping observations, from any two cameras.
Using this approach to create our 3D maps, we will be able to leverage every smartphone, digital camera, dash cam, body cam, drone camera, and ground vehicle camera out there today. This in addition to the trillions of images that have been captured already. But most importantly, we can use the images that flow through our Vision Engine when devices use its services. Hereby voiding the need to pro-actively recapture environments.
Structure from Motion
Fortunately, we do not have to start from scratch. In the field of Computer Vision, ‘Structure from Motion’ has been an active topic of research for decades. There are even software packages able to do it for us.
So, to create a model of a city, we just take images and feed them into our favorite Structure from Motion package (Colmap!), right?
Whilst existing approaches are very well suited for single locations, they are typically unable to cope with larger environments. Furthermore, they rely on an abundance of imagery data or sets of images from single HD cameras. Finally, the resulting 3D models are usually optimized for aesthetics, rather than for machines to answer “Where am I?”.
To overcome the existing problems, we had to build a Structure from Motion pipeline from scratch. One that can create 3D maps at an unprecedented scale using a limited number of images from the cameras that we have in our pockets today.
How did we do it?
In the next blogs of this series, we will describe Structure from Motion in more detail. We will explain the existing algorithms, their problems, and how we overcame them.