A new idea on how to make an AI interpret it’s 3D surroundings — the natural way!

Artificial Intelligence is a new trend nowadays, even though a bit late. This matter has been observed since the mythology-era, when Greeks were imagining on how to forge the mind as a God would, into this artificial, maybe alchemic being. Today, new ideas and insights are available, as technology evolves so rapidly, on how to build a mind artificially.

This process obviously involves the most-discussed-ethical issue of “giving life” to an artificial object. An important part of “living”, is experiencing the surroundings, getting sensory data, and reaching conclusions during the time of achieving the goal. A good way to “experience” things in a fast rate, precise and good quality, is to make the machine learn.

During this post, I will focus on the 3D vision problem which a machine has to face. A lot of AI until now, including the awesome machine learning algorithms which Google has developed and is using to speed up the process of image-learning, might be using a not-so-perfect system to achieve the end goal. The actual process is indeed interesting, but it should evolve, and maybe, just maybe start to mimetize the nature, biology and it’s foundations, found everywhere, in order that we start understanding what is it so special in our minds, that we can percept and learn the surroundings with lesser encounterings of the same object. In a machine-way of learning an object, we have to provide it a million pictures of the same object (as we know it, but the machine doesn’t), to make it “believe” that it’s that specific kind of object, even when it is shown from another angle, or different color, or other combinations. How can a machine do this by itself, without an intense training, like the classic example of distinguishing a banana from a taxi? Magic? Not quite.

A perfect example of this mimetize I started talking about, are our own eyes. We have them, we use them, we know so much of them, so what is stopping us from trying to mimic them, and I mean to mimic them in much more detail.

“The accommodation reflex (or accommodation-convergence reflex) is a reflex action of the eye, in response to focusing on a near object, then looking at distant object (and vice versa), comprising coordinated changes in vergence, lens shape and pupil size (accommodation).” — is cited in a great article on Wikipedia. Let’s take a look at the image below:

What we understand from this diagram, is that, whenever we see an object near before our eyes, the latter try to converge to each-other, until both of them point at the center of the object they are focusing on, based on where are each of them located. And when the object is far, the eyes diverge from each-other, until they reach the almost-paralel state, at which we consider the object to be at “infinite” distance. The conclusion is that, the first angle we see in the image, is higher than the second angle, because of this difference in distance of the first and second object. We start to see a geometrical pattern here. Are we able to distinguish or calculate the distance of that object, based on the accomodation reflex of our eyes? Yes and no. Yes, because we are used to telling if an object is near or far, not only because we have experienced enough to learn that, but also based on relative relations between objects, in our every day lives. This is a correlation between these two, or maybe more factors, as the power of our minds, to percept, construct models, abstract and imagine different scenarios. No, because we don’t know anything about angles, trigonometry and such formulas, and we also couldn’t do it all the time, even if we did know. But this can be implemented as a very necessary and important first step to the machines. Since we don’t have yet a good solution to the intelligent self-awareness problem with the AI-s, at least we can try applying geometry to achieve this goal.

I will focus even more on the distance calculation problem, instead of the obvious results from having two sensory points, with a considering distance between them, in order to have a simulation of a stereo vision, each of them with it’s perspective.

In electronics, engineering and other fields, we use different methods to measure the distance between two objects. We use a metric/imperial measurer, a laser-designator, a highly-sophisticated radar or a simple ultrasonic device, GPS and relativity, InfraRed, and so on. But we humans, don’t use either of them, and yet, we can orient ourselves very well. Does the machine also need these kind of sensors? The answer is, no.

I started researching the possibility to simulate a motor-sensory reflex of such, using open-source electronics and code libraries.

As we see in this picture, I have build these modules with cameras, with a registering device (which keeps track of the “eye calibrations” and their angles), and a reflex system, which is a small-form computer, running on CortexA7 processor, and trying to calculate the position of the object in front of the camera.

I use openCV library, which comes with the awesome Python language. openCV is carefully tailored for every kind of actiity or purpose we want to use images/video for, from image recognition, facial recognition, even to 3D perceptions, like I’m trying to achieve. The computer calculates all the time the possibility to approximate the possible “extend” of the two images it is getting from both the cameras. Succeding in that task, it starts to calculate the angles, and thus, concluding on the distance to that object.

The “eyes” can rotate, mimicing the vergence process which our own eyes do, when accomodating. You can see in the image below, the eyes, mounted on two step motors, operating at half-step mode, which provides a 4096 step precision on a 360 degree of rotation. Pretty precise for a cheap experimenting phase.

We already see a slight rotation on each plaftorm, in order that the cameras can focus on the most centered position of the object each of them can get.
Images respectively from Camera 1 and Camera 2

Are you starting to get the bigger picture of this concept? As we try to make an “extended image” of two images, we achieve the 3D vision, from two cameras, enabling us to measure whatever we want.

In fact, this is not the specific way on how can a machine measure distance. It is a tool, or module to be used in a more specialized way, and a great example would be to use this module with an AI system, using deep neuron network to simulate feedback, and machine learning algorithms to make this machine, learn from it’s experiences, and start measuring different information, including distance.

I note that this is an improve of the classic already-known stereo vision on a computer. The classic method, has static cameras, perfectly paralell with each-other, thus limiting the power of our real eye reflexes. This method, gives the machine the ability to move each camera with a specific rate and position, to achieve the best outcome and calculate the best approximation of an object’s distance, by just using two pictures! Isn’t this amazing?

This will surely become a key system, to be used in image recognition. It’s not anymore about 2D images, but bringing the world into life, merging the virtuality and reality into one, this way we can be sure that we don’t forget how reality is, since the boom of the Virtual or Augmented Reality is impacting people’s lives in a scary way. This will also help blind individuals, by using actual technology to project into their minds a silhouette of the reality, in this case, using a better module, to make their life easier, and maybe provide their brain a new neuron path on how to close the gap of the blindness.