3D cameras in 2022: choosing a camera for CV project

Anton Maltsev
12 min readJun 13, 2022

--

For Computer Vision, one of the main sources of data is the cameras. And the more data you get, the better. And what will give more information than a 3D camera?
In this article, I will tell you how to choose a 3D camera for your ComputerVision project in 2022.

This article based on the experience that our team has. The robotics and outdoor use. Here and here are some of our projects.

Image by the Author

Types of 3D cameras.

Let’s first mark the field of what can be called 3D cameras to decide what we will talk about in the article:

  • Monocular 3D (classic cameras + a bit of magic) — often it’s not as bad as it seems, and may be enough for some applications
  • Stereo 3D cameras are the most classic way to get stereo. Mankind has been using it since the time of the fish.
  • Active 3D/Structured Light — this approach improves the stereo.
  • Time of flight cameras — cameras that emitted pulse as a flash, and the matrix will estimate the return time
  • Lidars — laser scanning

Monocular vision

Do not underestimate monocular vision. This does not require special equipment. A conventional camera is enough.
There are several approaches for estimating 3D parameters using monocular vision:

  • Using SLAM or photogrammetry algorithms. These topics are clearly outside the scope of this article, so let’s leave them. From the most modern achievements, I advise you to read about NERF
  • Using knowledge about camera positioning and camera settings. For example, you can train a cuboid prediction net. Knowing the height and angle of orientation of the camera — you can evaluate the size of objects.
  • Using neural networks for depth estimation. Let’s talk about this in a little more detail. This approach is most similar to conventional 3D cameras.

There are many depth estimation networks. A lot of them you can view here. Let’s make a note about the disadvantages of this class of algorithms:

  • All distances are relative. If you have a dollhouse with realistic dolls, you will not be able to distinguish the real sizes.
  • Networks, like people, are subject to optical illusions. Maybe you remember this video -. And this is how it’s processing through the depth estimation network looks like:
Image by the Author

Speaking of networks. The net I like the most -
MiDaS. Today it’s one of the best open-source networks running real-time (on powerful devices).

Image by the Author, MiDaS prediction

Hardware for monocular 3D. Today there is only one camera that can provide a depth stream as a monocular camera. This is OAK-1. I would not say that the quality is high, but for some tasks, it may be enough.
But if you have a processing unit — you can take any camera

Image by the Author

And here you can see what this scene will look like:

Image by the Author from OAK-1 camera

After I published the article, the guys from Labforge shared with me that their camera can also estimate monocular depth right on the board. I have not tested it, but according to the characteristics, the chip is much more powerful than the OAK.

Stereo 3D

Image by the Author. Concept of stereo

The most classic approach to getting a depth map is to use two cameras. However, there is no “standard” approach. There are dozens of algorithms that have appeared over the past 30 years. Some of them are classic. For example, those presented in OpenCV (Semi-global matching), or more modern, neural networks. For example — HITNET.
This leads to the fact that there are dozens of vendors that produce a variety of cameras using a variety of technologies. Let’s try to note what you need to look at first:

  • (Calculations are made on board the camera) VS (сalculations are made on the host machine). The pros and cons of these options are clear. On the host machine, you can use more modern algorithms, but you need a powerful machine and a lot of work and computation power. If you are using a camera with a processing unit, the 3D quality may not be perfect.
  • RGB/IR matrices are used in cameras. This is very important for many applications where the reflectivity for different wavelengths can be different. Face recognition algorithms do not work in IR when learning from the RGB spectrum.
  • The quality of depth calculation algorithms. For some manufacturers, these may be old, OpenCV-based algorithms. Others have their own unique know-how.
  • Distance between lenses. The greater the distance, the farther the camera looks — the closer the lenses can see.

The main disadvantage of stereo cameras is that they do not work well in areas where there is no texture / where it is highly periodic. There is nothing worse than a uniform white wall for stereo cameras.

Image by the Author, The Wall

Consider a few examples of stereo cameras from different manufacturers:

White label and other little-known assemblies. Just two cameras on the same board. For example, 1,2,3.
Such things attract by the price. You can find one ready-made for 30–60USD. But the quality of the picture, the quality of synchronization, and much, much more can be at a low level. For example, I have ELP stereo camera that provides only 320*240 when you try to grab both streams. And they are not synchronized 🤦‍♂️
When choosing such a camera, you will have to use some kind of stereo reconstruction algorithm. Here, for example, is how HITNET works out of the box on an ELP camera. If you suppress distortion, it will, of course, be much better:

Image by the Author, Stereo camera + on host HITNET

Mid-range cameras with proprietary processing. The most famous are definitely MintEye and ZED cameras. They are quite cheap and give a decent picture.
Cameras with close lenses. If you need them — you should look at RealSense 405 (I recently reviewed it), and G53.
Long-range cameras. Again, there is a ZED camera, there is Nerian, and there is a self-assembled OAK-FFC-3P.

The OAK-D, OAK-D lite series stands apart (I did a detailed review on it). Their advantage is the presence of an Intel Myriad X processor for data processing right in the camera. But the underlying algorithm used for out-of-the-box depth is from OnetworkspenCV. And its quality is far from ideal.
The OAK platform also allows you to use a modern stereo depth estimation neural network. For example CREStereo. The result is very cool, sometimes even the glass is detected!

Labforge also stays apart, but it is completely new and I haven’t tested it yet.

Image by the Author, OAK-D Lite depth (CREStereo + SGM depth)

But the speed (2FPS) and resolution (160*240) are far from ideal.

I have compiled a small list of cameras more or less relevant today:

  • ZED cameras
  • MINT EYE cameras
  • OAK cameras
  • Real Sense D405
  • Nerian cameras
  • e-Con cameras
  • ELP cameras
  • roboception
  • DUO 3D
  • Dynim Spark
  • Ensenso
  • Arcure Omega
  • Carnegie robotics
  • eCapture
  • Labforge

It’s not all cameras. And even we tested only a part of them.

Active Stereo / Pattern

Image by the Author, different active cameras

Let’s do a classification first. Active backlighting can mean several camera and lighting configurations:

  • A standard stereo pair + an illuminator that projects an IR texture — this adds information in areas without texture, which compensates for the main disadvantage of stereo cameras.
Image by the Author
  • Standard stereo pair + illuminator without texture projection. This allows you to work in dark rooms. But in bright rooms, this does not improve the quality of the cameras.
  • Camera (or stereo pair) + encoded pattern. In this case, the camera for each point evaluates the pattern that got there. The main difficulty of such cameras is to create a fairly bright and clear pattern.
  • Cameras + encoded pattern changing over time. This gives submillimeter quality. However, this significantly increases the exposure time. Often a frame can be received no more than once per second.

The main disadvantages of active light systems:

  • Doesn’t work at long range (not enough light to illuminate far objects)
  • Active light does not work in direct sunlight (or does not improve stereo quality)
  • Can be unpredictable in reflective environments
  • Requires more energy then stereo
  • Fewer vendors
  • More expensive than stereo

If you choose such a camera, you should take into account that they are usually used in the near field. An active backlight will almost never give data at distances of more than 5 meters. Let’s look at the market situation. There are already far fewer manufacturers than for stereo cameras (again, not all of them are here). More or less cheap are:

  • OrbBec — one of the cheapest cameras, there are few different models
  • Structure Core — I haven’t tried it myself, but many people recommend it
  • OAK-D pro — cameras with computing module directly on the camera (my review)
  • RealSense — the most famous ones. There are many models
  • Asus Xtion — not sure if it’s still on sale

Expensive ones:

  • Zivid
  • Photoneo
  • MechMind

Few of them are including a dynamic pattern. The dynamic pattern gives amazing quality, but with very low FPS and for a lot of money. Here is one example of how we use it in our practice:

Video by the Author Company

Let’s sum up. What you need to look at first of all if you choose such a camera:

  • Camera type
  • Operating range
  • FPS — many low-speed cameras

Time of flight camera

Image by the Author

TOF cameras are cameras built on the principle of lidars. When we evaluate the return speed of reflected light in one way or another. Globally, there are three approaches to this analysis:

  • Pulse. We send a flash and in each pixel, we measure the beginning of the return
  • Phase. We send a modulated signal and evaluate its correlation in each pixel.
  • Cumulative. Now it seems not used, but once upon a time, I saw it. In fact, it is estimated how much signal has accumulated in each pixel. The accumulation had to be proportional to the distance. Not sure that it’s TOF, but the vendor calls it so

At the same time, TOF cameras also have disadvantages:

  • Accuracy is worse than expensive active stereo cameras
  • Doesn’t work well on dark surfaces (dark surfaces reflect light bad)
  • Doesn’t work well in bright sunlight (where the stereo can save the situation)
  • Often low resolution
  • Ready-made modules are usually more expensive than cameras with active stereo
  • The range of work is usually also not very long (but there are exceptions)

At the same time, there are quite a few manufacturers of such cameras:

  • IFM
  • Microsoft Azure Kinect — we tested this one, and we really didn’t like the software processing of the stereo that is there
  • SICK
  • RealSense — 515 camera, it seems to be discontinued now
  • Sense Photonics
  • Basler
  • Helios
  • LIPSedge
  • PmdTech

Lidars

I would not want to go very deep into lidars. I myself have worked with them a little, and I do not know the samples existing on the market. But it is impossible not to mention them. Lidars are usually more expensive than the cameras above. At the same time, according to the physical principle, it is similar to TOF cameras. Only the measuring goes pixel by pixel. This allows more energy to be concentrated in the beam, but at the same time the scanning speed becomes slower, and mechanical parts appear.
Lidars are more commonly used outdoors than indoors. And most lidars are focused on long distances.

Laser profilers

The part of the profilers where the laser beam moves are almost a lidar:)
Other profilers, where the beam is stationary, do not estimate depth over the entire field. Their use is rather conveyor belts or specific handheld scanners.
In my opinion, these are very specific tools, and I decided not to talk about them in more detail.

Working distance

Having made a general overview of the principles of work, let’s talk about which tasks for which cameras should be chosen in the first place. Here I drew a small overview table on the ranges for which different cameras are intended for:

Image by the Author
  • For the closest possible distances, it is easier to take TOF cameras or stereo cameras with a small base
  • If you want to work from 0.3 to 5 meters — most of the existing cameras are at your service. Cameras with active illumination, stereo cameras, TOF cameras.
  • From 5–6 meters to 15–20 meters most stereo cameras will work
  • From 15–20 to 50 stereo cameras with a large base will work (cameras with small base can work too but they will give you bad quality)
  • More than 50 meters is the only hope for good lidars

But besides the formal “range”, there is also “accuracy” at a distance. And it’s not the best side for cameras using a stereo base. The greater the range, the greater the error. Here is a sample graph for two different stereo cameras we had:

Image by the Author

It can be seen that the error will be significant even at 10 meters, and will rather allow you only a rough estimate of a distance. You can take a closer look at the characteristics of these cameras, for example, here (documentation on Zed) or here (testing document posted by Luxionis).

Speaking about the working distance, do not forget about a few more sources of error:

  • Focus (if you have fixed focus on 30cm, it will work badly on 20m)
  • Brightness (with high ISO quality will drop, and the high dynamic range of the image is bad also)
  • Temperature errors (if you have a stereo camera with a big base it can drop quality when the temperature will drop from +40°C to -20°C)

Engineering design

Another very important and critical feature of 3D cameras is where they can be used. Let’s look at the USB connectors.

Image by the Author

Can it be used outdoors? Hardly.
Is it possible to use it if the camera is moving (a mount on the robot, etc.)? Also no.
Is USB always software stable? Judging by our experience in mass series, everything is also not optimal.
What is used? Here a few examples:

Image by the Author — industrial connectors

It is clear that in most cases these are issues of industrial installations, and for laboratory and home applications, this is not required.

I will not go deeper but check Ethernet-connected cameras and cameras with CSI-MPI connectors.

What else is important? I would advise you to pay attention to the quality of the camera case itself and its protection. There are enclosures with protection IP67/IP68. There are bare boards.

Also, pay attention to the temperature of the case. For example, OAK cameras use the Intel Miryad X processor, which gets very hot. In-room conditions, the camera heats up to 55 degrees. What will happen in outdoor conditions with some hull on it? I’m not sure that the camera will save all its parameters.

Other options

When we choose a camera, we also look at such characteristics (I mentioned them above, but I will summarize them here).

Camera size. Small cameras are, of course, TOF cameras. But the more industrial the camera, the bigger it will be.

FPS in cameras. This will have an effect:

  • Matrix characteristics. Some matrices are faster, some are slower. Some are bigger with lower noise, and some are smaller.
  • Camera connection interface. USB, LAN CSI-MPI, and PCIe have different transfer rates. For a DEPTH stream, it is difficult to compress, so very often FPS rests on the transmission channel. Or should it be read on the HOST machine
  • Depth estimation algorithm. Some methods require a lot of calculations. And they go slowly. This may limit FPS
  • Hardware implementation. As I said above, there are methods where it takes seconds to evaluate a frame, while there are milliseconds.

Calculator. Where is the depth calculated? On a special chip inside, on a separate processor, on a HOST machine.

Price. As you understand, the price of cameras can range from 30USD for stereo to tens of thousands of dollars. And this is an important characteristic of many projects. And the more accurate specifications you need, the more expensive the camera will be.

That’s all!

If the article seemed interesting to you — subscribe to my youtube or LinkedIn. There I sometimes review 3D cameras and boards for computer vision.

--

--