Can we use a Deep Learning model to count objects at a sub-pixel scale? Deep Learning has been successfully used for automating several tasks that used to be manually done; we use it having in mind that we want to stop performing manual tasks. But what about tasks that are difficult for humans?
We developed a model that performs just this: Counting objects from space that usually are not visible by the naked eye.
Why not using just high-resolution images?
With high-resolution images this task would be quite straight forward, but the related costs of acquiring such images would make it unfeasible for large-scale analysis. Besides, using Sentinel-2 imagery we obtain a new image everywhere in the world every 5 days. Opening new opportunities such as mapping large scale areas or following trends.
Taking the same low-res satellite image from the parking lot, we can feed it to our model and predict the number of cars per pixel, which total 4100 cars in the image. By looking at the high-resolution reference data, the model’s estimate is within 5% error. Not bad for a blurry input image!
How can we count sub-pixel scale objects?
With sub-pixel scale objects it is difficult to estimate the exact position of objects, which is why we cast the task as a regression task where the output is a continuous number of the estimated density of objects in a certain pixel.
This idea is commonly used in crowd counting tasks, where some persons in the image appear at very low scales and can’t be located properly [1,2]. The challenge, as in many machine learning problems, is to obtain high-quality reference data to train the model. Since our objects are at a sub-pixel scale, we first resorted to use high-resolution satellite imagery where we can clearly see the objects. Over each area of interest, we first labeled a small area and used an automatic object detector on the high-resolution imagery to obtain locations of 1.6 Million objects in different locations around the world.
In the figure below, the automatic detections are represented by the red bounding boxes on the left. For a scale comparison, the yellow bounding boxes represent the 10x10m Sentinel-2 pixels. The next step is to convert this information into a count of objects of interest inside each yellow bounding box. This procedure results in the count map from the right.
Since we are working on a much smaller scale than the original high-resolution image where the data was obtained from. This change of scale will cause problems at training time.
To obtain the reference data in the 10x10m pixel scale we blur the count map with a Gaussian Kernel with its sigma as: σ = K /𝜋. Where K is the scale ratio between the high-resolution image and the low-resolution image.
Since we are interested in obtaining a clean count also on areas different from the object of interest such as empty lands or natural forests. Besides the Density task, we additionally train a Semantic task that classifies each pixel as either containing any object of interest or background. For which we threshold the reference count map to obtain a binary map. We observed empirically that this helped reduce the noise in non-dense areas.
Now we have the data ready to train a model. The model’s architecture consists of 6 ResNet blocks followed by independent streams for each task, the Semantic and the Density task.
Why not use a larger network pretrained on ImageNet?
For a large set of Computer Vision tasks, Transfer Learning is largely beneficial. This practice consists of re-using large models that were used to solve image classification on a large-scale dataset, usually ImageNet. One can achieve superior performance because the trained filters that were used in ImageNet, can help identify other kinds of objects in the image making the new task easier to solve.
However, objects in ImageNet consist usually of several thousand pixels in the image. This large scale difference reduces the benefits of Transfer Learning for our sub-pixel task. Furthermore, such models reduce the spacial dimension of feature maps of the image, this is both to (1) generate higher-level features over large areas to learn context and to (2) reduce the overall size of the model.
We go in another direction: we keep the same spatial dimension of the features maps throughout the network to allow the full details of the input image go to the predicted output. This benefits the level of details and the overall performance of the network. See an example comparison with DeepLab-V2 a popular semantic segmentation architecture .
Which objects can be counted?
To count sub-pixel object we rely on two main aspects:
- Objects should be in a similar pattern as in the training dataset: Parking lot, and a more or less similar plantation pattern of trees.
- For trees the spectral signature of a certain species helps the model to tell apart one species from the other.
Taking into account these aspects, we tested our method with objects of different sizes. The smallest object we tested our method with was cars, which accounts for 1/10th of a pixel. On the other hand, we tested three different types of trees, Palm Oil Trees, Coconut Trees and Olive Oil trees. They all have different patterns of plantation and specific spectral signatures
See below some example predicted density maps for Coconut and Palm Oil trees. Note that although some of the high densities are not correctly predicted, the overall count is still within a small error margin.
We showed how a model can be developed to solve object counting at the sub-pixel scale, this method relies on the spectral signature of the trees and the plantation patterns of the objects. By relying only on Sentinel-2 imagery, this method could be applied for large scale analysis and even evolution of crops over time.
If you want more details check out our paper:
Rodriguez, Andres C., and Jan D. Wegner. “Counting the uncountable: deep semantic density estimation from Space.” German Conference on Pattern Recognition. Springer, Cham, 2018. https://arxiv.org/abs/1809.07091
- Meynberg, Oliver, Shiyong Cui, and Peter Reinartz. “Detection of high-density crowds in aerial images using texture classification.” Remote Sensing 8.6 (2016): 470.
- Shang, Chong, Haizhou Ai, and Bo Bai. “End-to-end crowd counting via joint learning local and global count.” 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016.
- Chen, Liang-Chieh, et al. “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834–848.