A Deep Learning Model Can See Far Better Than You

Andrés Camilo Rodríguez
EcoVisionETH
Published in
6 min readJul 8, 2020
Predicted Coconut Trees in Palawan, Philippines

Can we use a Deep Learning model to count objects at a sub-pixel scale? Deep Learning has been successfully used for automating several tasks that used to be manually done; we use it having in mind that we want to stop performing manual tasks. But what about tasks that are difficult for humans?

This is a Sentinel-2 (A free access satellite from the European Space Agency) from a parking lot in Victorsville, CA. Can you tell that there are cars in the image? Would a naked eye be able to count them?

We developed a model that performs just this: Counting objects from space that usually are not visible by the naked eye.

Why not using just high-resolution images?

With high-resolution images this task would be quite straight forward, but the related costs of acquiring such images would make it unfeasible for large-scale analysis. Besides, using Sentinel-2 imagery we obtain a new image everywhere in the world every 5 days. Opening new opportunities such as mapping large scale areas or following trends.

Left: Model density prediction (total: 4100 cars), Right: High-res satellite imagery (total: 4300 cars)

Taking the same low-res satellite image from the parking lot, we can feed it to our model and predict the number of cars per pixel, which total 4100 cars in the image. By looking at the high-resolution reference data, the model’s estimate is within 5% error. Not bad for a blurry input image!

How can we count sub-pixel scale objects?

With sub-pixel scale objects it is difficult to estimate the exact position of objects, which is why we cast the task as a regression task where the output is a continuous number of the estimated density of objects in a certain pixel.

This idea is commonly used in crowd counting tasks, where some persons in the image appear at very low scales and can’t be located properly [1,2]. The challenge, as in many machine learning problems, is to obtain high-quality reference data to train the model. Since our objects are at a sub-pixel scale, we first resorted to use high-resolution satellite imagery where we can clearly see the objects. Over each area of interest, we first labeled a small area and used an automatic object detector on the high-resolution imagery to obtain locations of 1.6 Million objects in different locations around the world.

In the figure below, the automatic detections are represented by the red bounding boxes on the left. For a scale comparison, the yellow bounding boxes represent the 10x10m Sentinel-2 pixels. The next step is to convert this information into a count of objects of interest inside each yellow bounding box. This procedure results in the count map from the right.

Sentinel-2 pixel scale and density count per pixel

Since we are working on a much smaller scale than the original high-resolution image where the data was obtained from. This change of scale will cause problems at training time.

To obtain the reference data in the 10x10m pixel scale we blur the count map with a Gaussian Kernel with its sigma as: σ = K /𝜋. Where K is the scale ratio between the high-resolution image and the low-resolution image.

Since we are interested in obtaining a clean count also on areas different from the object of interest such as empty lands or natural forests. Besides the Density task, we additionally train a Semantic task that classifies each pixel as either containing any object of interest or background. For which we threshold the reference count map to obtain a binary map. We observed empirically that this helped reduce the noise in non-dense areas.

Now we have the data ready to train a model. The model’s architecture consists of 6 ResNet blocks followed by independent streams for each task, the Semantic and the Density task.

Model Architecture

Why not use a larger network pretrained on ImageNet?

For a large set of Computer Vision tasks, Transfer Learning is largely beneficial. This practice consists of re-using large models that were used to solve image classification on a large-scale dataset, usually ImageNet. One can achieve superior performance because the trained filters that were used in ImageNet, can help identify other kinds of objects in the image making the new task easier to solve.

However, objects in ImageNet consist usually of several thousand pixels in the image. This large scale difference reduces the benefits of Transfer Learning for our sub-pixel task. Furthermore, such models reduce the spacial dimension of feature maps of the image, this is both to (1) generate higher-level features over large areas to learn context and to (2) reduce the overall size of the model.

We go in another direction: we keep the same spatial dimension of the features maps throughout the network to allow the full details of the input image go to the predicted output. This benefits the level of details and the overall performance of the network. See an example comparison with DeepLab-V2 a popular semantic segmentation architecture [3].

Predicted coconut counts. Comparison of DeepLab-2 and Our method

Which objects can be counted?

To count sub-pixel object we rely on two main aspects:

  1. Objects should be in a similar pattern as in the training dataset: Parking lot, and a more or less similar plantation pattern of trees.
  2. For trees the spectral signature of a certain species helps the model to tell apart one species from the other.

Taking into account these aspects, we tested our method with objects of different sizes. The smallest object we tested our method with was cars, which accounts for 1/10th of a pixel. On the other hand, we tested three different types of trees, Palm Oil Trees, Coconut Trees and Olive Oil trees. They all have different patterns of plantation and specific spectral signatures

Object Types Evaluated

Example Results

See below some example predicted density maps for Coconut and Palm Oil trees. Note that although some of the high densities are not correctly predicted, the overall count is still within a small error margin.

Coconut Trees: top GT: 88.5K, bottom: Prediction 84.6K (-4.3%)
Palm oil Trees: top GT: 143.1k, bottom: Prediction 137.1k (-4.2%)

Conclusion

We showed how a model can be developed to solve object counting at the sub-pixel scale, this method relies on the spectral signature of the trees and the plantation patterns of the objects. By relying only on Sentinel-2 imagery, this method could be applied for large scale analysis and even evolution of crops over time.

If you want more details check out our paper:

Rodriguez, Andres C., and Jan D. Wegner. “Counting the uncountable: deep semantic density estimation from Space.” German Conference on Pattern Recognition. Springer, Cham, 2018. https://arxiv.org/abs/1809.07091

References

  1. Meynberg, Oliver, Shiyong Cui, and Peter Reinartz. “Detection of high-density crowds in aerial images using texture classification.” Remote Sensing 8.6 (2016): 470.
  2. Shang, Chong, Haizhou Ai, and Bo Bai. “End-to-end crowd counting via joint learning local and global count.” 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016.
  3. Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834–848.

--

--

Andrés Camilo Rodríguez
EcoVisionETH

PhD at ETH Zurich, saving our forests with Machine Learning and Remote Sensing