8MPX automotive camera system, Auto.AI 2022

Timofey Uvarov
Fourier Image Lab
Published in
6 min readJun 30, 2022

The presentation which I made at Detroit Auto.AI 2022

8Mpx camera system presentation.

If you have missed my presentation the PDF can be found under the following link.

The presentation was about 8 mpx automotive autonomous camera system with the focus on how to translate human vision based image quality metrics into image quality metrics and image quality metrics — into a trend of probabilities and costs of detection of various objects at different distances with cameras equipped with 8mpx and 2mpx image sensors. Also some architectural aspects of back-end/infra system design were addressed with correlation to the design of image pre-processing pipeline.

Image Quality and detection trends

The essence of the first part could be summarized in the slide below:

Translation of vision chart results into detection probabilities and cost of a human at distance

Human vision chart

On the left side of the slide above you can see the human vision chart captured with the same high resolution lens at the distance of 10 feet with 2mpx and 8mpx sensors and processed with same software ISP from onsemi. At 8mpx chart we can confidently read 3 more lines than on the 2mpx chart.

The chart can be purchased online:

Detection probabilities

detection probabilities and cost for a adult pedestrian at the distance

At the graph above in you can see two solid lines representing the detection probability of a human mannequin at a given distance. Orange line represents the vision trend for 2mpx and blue line — for 8mpx sensor.

For instance, if we target the detection probability of 0.5, we can find out that with 8mpx sensor (blue line) the ADV will be able to detect the human at 390m and with 2mpx sensor - at 320m, which are hypothetical maximums of detection distances of cameras tested camera prototypes. With 8mpx camera we gain 70m of distance advantage for ADV to detect, predict and react. Slicing the trend at 0.75 confidence level we can observe difference in distance of detection around 60m and at 1 the two trends converge at around 200m. Thus below 200m (unless we are targeting some gesture or face detection there is not much benefit to use computationally expensive 8mpx).
You might be surprised to compare the numbers for human detection vs. detection of a deer.

Above results were perceived at clear sunny day Aug. 25, 2021 at Crow’s Landing airport: 84°F (29°C), with 8 MPH wind, 30% humidity.

Cost of detection

The dotted line represents the cost of detection of the pedestrian at a given distance in pixels with each of the sensors.

For example, at 100m we would need at least 20 pixels horizontally to detect a pedestrian with 2mpx sensor and 40 pixels with 8mpx sensor. The cost of compute (number of transistors) will grow as a square of the pixel size ratio — thus it be 4x more computationally expensive to detect a pedestrian at 100m with 8mpx sensor.

Conclusions

Above we conducted a field experiment engaging a target perception group of image labeling team members at Pony.AI, were able to learn vision trend for 2 and 8mpx sensors for detection of different objects and relate 2 and 8 mpx trends via cost and probability of detection of numerous objects at wide range of distances.

Using the charts below one can determine the rational of using what type of sensor resolution/pixel size for different application.

detection probabilities of different objects

Notice how different results are for different objects due to difference in size, shape, morphological structure and material.

detection probabilities of different objects

How to use the carts at your own computer vision project

From the charts above we can find out that to detect a deer with 0.5 confidence we would need to be at 350m with 2mpx camera and — at around 400 meters with 8mpx camera. At the same time we know from human vision chart metrics that we with 8mpx camera we can see 3 more lines of text at 10ft.

vision

My other article with deep dive into detail reproduction explains why human vision charts are effective for computer vision.

Thus, 3 extra lines of text visibility at human vision chart translates into roughly 15% increase of distance, when we are detecting a deer with 8mpx sensor compared 2mpx sensor and acknowledging the pixel count and size.

Practice

To estimate a rough vision trend, without needing to rent the whole runway, for a given camera one can purchase same vision chart on amazon, take a photo placing the camera 10ft away at bright conditions and find how many lines of text are readable without a doubt and interpolate the trend using the data provided in above charts.

For example if we could only read the second line from the bottom with a given camera, it would mean that we would be able to detect a deer with 0.5 confidence at a distance of 350–400m somewhat closer to 400m.

The survey that we used to conduct the experiment is available online, if you pass it you can compare yourself with our labeling group consolidated trend.

VISION SURVEY from Pony.AI — measure the consistency of your labeling group.

We can also detect vision or attention issue if the survey data is deviating from the trend too much (i.e. more confidence is too often assigned to an object that is taken from longer distance compared to amount of confidence assigned an object that is closer to the viewer)

Infra and ISP

In the second part of my presentation I shared some pros and cons of collecting raw image data for training and physically positioning the ISP inside the camera module vs. near the central processor and also gave a brief introduction of the ISP core developed with our partners.

ISP was introduced as a genetic replicant of bi-polar and ganglion retinal cells in the human eye. To understand how ganglion cells do such high-pass and band-pass decomposition of the visual information this video is very informative:

Here later <<will be a link to an article>> describing how such frequency based decomposition was applied to decompose and high bit depth visual signal and each frequency was compressed using nature and physics guided mathematical functions and each frequency was linked to the corresponding convolutional layer in the network of the same size, shining some light on our Tesla patent.

Currently we have a sharable demo library which is able to read RAW images from Sony/Onsemi and soon will support Omnivision HDR and linear RAW formats. Please contact if you are interested to evaluate it.

High level view on conceptual imaging pipeline:

GPU model ISP results:

some examples of ISP processing
At magnification

Non-technical part of Detroit visit and conference experience <<click>>

--

--