Detail reproduction as a key to object detection

Timofey Uvarov

Published in

Fourier Image Lab

10 min readFeb 18, 2022

Follow-up of Tech.AD talk on 8mpx automotive camera system

Tech.AD presentation

If you missed event you can see the video recording below

The presentation covered 4 main topics:
1. Key image quality indexes we measure for automotive camera system

2. Perceptual analysis of 2mpx and 8mpx sensors to determine detectability of objects at different distances.

3. Infrastructure to support 8mpx camera with ability to work with RAW images.

4. Introduction to ISP pipeline we are working on with our partners.

After the presentation I was asked several questions on image quality metrics so I wrote this article to follow-up.

Object detection capability primarily depends on detail reproduction and dynamic range reproduction. In this article I’d like to do a deep dive on detail reproduction and later to devote a special article dynamic range and explain how to understand it.

Detail reproduction.

Detail reproduction is the ability of the camera system to reproduce and render the finite small details so that the “detector” can reliably identify and classify them. Detector could be you, AI-algorithm or target group, such as QA and labeling engineers with a determined process to form a consolidated opinion.

Detail reproduction is based on information theory and is a measure of how effectively camera uses its optics, photodetectors and the rest of ISP pipeline. In other words detail reproduction tells us what is the smallest detail(object) we can capture and render properly at a given distance.

Such ability depends on each aspect of image acquisition process such as lighting conditions, camera lens properties, sensor color filter and micro lens, pixel architecture and all ISP blocks such as bad pixel replacement, demosaicing, sharpening, noise reduction, gamma correction, tone-mapping, contrast enhancement and others.

To illustrate how important detail reproduction is we can imagine a 2 megapixel and 8 megapixel sensors of the same physical size, so that each pixel of 2k sensor is replaced with 2x2 block of pixels on 8mpx sensor.

8mpx sensor captures has 4x more pixels during each readout, but how much more information and how much more useful information does it really carry? For example, if the lens on 8mpx is slightly out of focus, or if the cross-talk between two neighboring pixels on 8mpx sensor is high, significant amount of blurring can occur.

the right image has 4x more pixels but same amount of information as the center image

Top convolutional layers of a neural network consume a lot of recourses so its in out best interest to design our camera system so that it effectively uses its pixel elements, or in other words it designed in a way so that if we capture an image and later down-sample it (reduce the number of pixels) some useful information will be lost.

ISO Charts uses in conventional imaging

In conventional imaging detail reproduction is measured with the charts such as ISO 12233 chart.

ISO 12233 chart for measuring detail reproduction (Imatest)

The so called wedges represent thin lines that are getting closer to each other as they get thinner and the further we can look into the wedge and distinguish separate lines, the higher detail preproduction of the camera system is.

example of same RAW image processed with different demosaicing algorithms

In the illustration above same RAW image was processed using two different demosaicing algorithms. In the top processed result we can distinguish clear lines to the mark of about 16.5 and at the bottom image the mark is at about 18. Bottom result has higher detail reproduction than the top one.

ISP demosaicing process

But with modern cameras which are used for object detection the ISO 12233 and its wedges is not the best method to determine detail reproduction. To understand why wedges are not reliable metrics to use for object detection, we need to look into the image sensor pixel architecture and the process of demosaicing in ISP.

In modern RGB camera the image sensor has a mosaiced pattern, such as Bayer pixel array or similar, so at each physical pixel location only one color component is detected. To find the other two color components color interpolation or demosaicing process in ISP is used, where the missing color components are reconstructed using neighboring pixel values.

Demosaicing is used to reconstruct tiniest details in the image and is usually optimized by ISP vendor to show the highest score on such wedges as described above. But since the wedges in the chart are just thin lines that are vertical or horizontal the wedge scores do not always translate into real life scenarios with objects like cars and pedestrians, which have all types of edges, corners and patterns.

During the evolution ISP demosaicing process has became enough sophisticated to detect such patters as wedges with very high certainty using the method of edge-directed interpolation where the missing color values are interpolated either horizontally or vertically.

For the simplicity of explanation lets look into the demosaicing and the reconstruction of the green color channel below:

Left: Green color plane in Bayer RAW image, Right: Green color plane in RGB image

In the image above we can see how white squares on the left image where green value was originally unknown are becoming replaced (populated) with green on the right image after color interpolation.

Let’s illustrate that process looking into any random white square:

Green value at the Gx location has to be found

In a linear camera system we would calculated missing value as Gx = (G1+G2+G3+G4)/4

In edge directed interpolation mentioned above the Gx value is determined based on the relation between changes in horizontal and vertical direction. First horizontal and vertical classifiers are calculated:

GradH=abs(G4-G2,), GradV=abs(G1-G3)

Then horizontal interpolation value and vertical interpolation values are calculated: Gh=(G4+G2)/2, Gv=(G1+G3)/2

Final product is interpolated as Gx = Gv*Wv+Gh*Wh, where Wv and Wh are horizontal and vertical weights and are calculated as functions of GradH and GradV, so that larger weight is assigned to the one of two directions where less change occurs.

The example below is a vertical dark line on a white background.

Gh=(200+220)/2, Gv=(10+10)/2, GradH=20, GradV=0.

Since GradV=0 and GradH>0 our weight function will force Wh=0,Wv=1, so Gx=Gv*1+Gh*0 or Gx=Gv=(G1+G3)/2=(10+10)/2=10, which restores thin vertical line properly.

If linear interpolation is used or the direction is detected incorrectly due to noise or aliasing, solid edge or line in the real world after the demosaicing will be reproduced as a zig-zagged or saw-like edge as depicted below.

Improperly reconstructed (demosaiced) edge leads to wrong pattern and object detection

In last 20 years ISP engineers were challenged with detail reconstruction and learned how to score well with the wedges and solid edges used for MTF such as in the 12223 chart using advanced version of edge directed interpolation and also methods based on multi-scale self-similarity, which are too complex to describe in this article.

It is important to understand that different demosaicing algorithms produce drastically different results from the point of view of detail reproduction and low quality demosaicing algorithm can often be the bottleneck in computer vision pipeline and require overkill size network to compensate for all artifacts of incorrect pattern reconstruction.

Human Vision charts

As modern ISPs learned to outsmart charts like ISO12233, it is suggested that human vision charts are most efficient way to measure detail reproduction of a camera. Such charts consist of numbered lines of text and each line is smaller than the previous one. We place the camera 10ft or 20ft away from the chart at determine which line can be reliably read in the captured image. Vision chart characters have much more complex skeleton than horizontal, vertical or slanted edge so none of the existing ISP processors has the intelligence to reconstruct them based on just edge continuity presumption. AI algorithm could potentially remember the whole chart and render it from memory, but none of existing ISP’s have such ability.

After capturing vision chart from 10ft (20ft) we determine which is bottom line where we can read without confusion and look for the score to the right (left) of the line.

Sharpness, contrast and MTF are not equal to detail reproduction!

Here it is important to mention that such metrics as contrast, sharpness or MTF do not directly translate into detail reproduction

In the example above Image A has higher detail reproduction than Image B and Image C, but both Image B and Image C are sharper (have higher local contrast) than Image A, and Image C has higher contrast globally than both images B and A.

Detail reproduction indeed does depend on the MTF as a function of lens and sensor combo, so when we explore detail reproduction we place the chart in different ROI’s and also explore MTF map for minimums and consistency.

Detail reproduction of automotive cameras

Bellow is the vision chart captured with 2mpx sensor and using 30,60 and 120 horizontal FOV angle lenses. The green underline is our vision score and indicates which is the bottom line we can read without confusion, yellow line — where we have some confusion, and red — where we can not read anything at all.

Left: 30D fov, Center: 60D fov, right: 120D fov

And in the example below we kept the lens same and replaced the sensor from 2mpx to 8mpx and compared the vision scores:

As we can observe 8mpx camera has the score is 1.25 vs 0.67 vision score for 2mpx camera. The difference in detail reproduction can be expressed in 3 lines of difference:

Later we learned how vision scores would extrapolate to object detection capabilities when we did a study to detect a pedestrian placing those cameras side by side on a runway using human mannequin as a target.

rendering of pedestrian at different distances with 2mpx and 8mpx sensor and same lens of 30D fov

After collecting consolidated human confidence from a target group of our labeling team we formed the distance/vision trend for each camera and found out that at the range of 300–500 meters on average 8mpx camera image produces 2x of consolidated human confidence compared to 2mpx camera, which is very close to to our vision scores of 1.25 and 0.67.

From the visualization of the vision trend displayed below one can also narrate that with 8mpx camera we can consistently reach the same solid confidence higher than 0.5 with around 70m advantage compared to 2mpx camera.

2mpx and 8mpx cameras and vision/distance trends for human detretion

iPhone 12 detail reproduction

As iPhone camera is available to everyone for reference we are also providing vision scores for each of 3 iPhone12 pro front facing cameras from 10ft:

Lets see how such vision scores of each iPhone cameras translate into representing a monumental object located near Pony.AI parking lot 225m away from the camera.

Each 2.0x is using original resolution and 1.0x and 0.5x are upscaled and aligned

Zoomed 1:1, still impossible to see the difference

at 3.0x zoom the 0.5x image starts to look unnatural with excessive step overshoot

at 7.0x similar problem happens with 1.0x

digital zoom: 10x, sharpening at 1.x increases aliasing or a smooth curve

Street sign with 3 cameras of iPhone 12pro

Small details of tree branches of with iPhone 12 pro

Since car wheel spoke is can be considered a thin line, I loaded the image in ImageJ software and displayed a 1-d horizontal cross-section where the height represents the luminance value at each horizontal location:

Horizontal cross section of a single spoke on each camera

Looking at the pixelwise horizontal cross section of the same feature through each lens we can see that on a 2.0x image the spoke occupies 3 pixels, so at least 5 pixel convolutional layer would be required to detect it, while at 0.5x image at least 7 pixel wide kernel would be required.

Let’s look at other finest details and build another cross section:

Horizontal cross-section of tree branch.

As we can observe from above cross section excessive sharpening appliesd to 2.0x image results in 2-spikes in the peak which is represented with smooth gaussian normally distributed peak in the 0.5x version of the same detail, which will require totally different convolutional kernel to detect. Ideally we want to minimize such variation in the representation of same details, to reduce the number of heavy convolutional layers at the top of our network. For that we only use sharpening applied to frequencies which are larger than the size of the convolutional kernel of our first layer in the network. How to do frequency based spectral decomposition of the signal will be described in another publication.

At last I’d like to show how much data is used to represent a small piece of a tree brunch, which is less than 0.01 of a single image