Car Localization and Counting with Overhead Imagery, an Interactive Exploration

Adam Van Etten
The DownLinQ
Published in
8 min readMar 15, 2017


Vehicle localization from satellite imagery has myriad use cases in the commercial, national security, and humanitarian realms. On the commercial front, a number of companies have attempted to infer retail traffic from parking lot density levels, and tracking delivery trucks in near real-time is one of the far-field goals of satellite imagery analytics. In the realm of national security, detecting the buildup of war materiel in unstable regions would provide obvious value, as would locating convoys of vehicles vectored towards unmanned border crossings, or identifying a large number of vehicles staging just outside the range of terrestrial border monitoring equipment. On the humanitarian front, one might attempt to infer the scope of natural disasters from clusters (or absences) of vehicles, or determine optimal travel routes for disaster relief in unknown areas based on observations of local vehicle movements.

In this post we apply the YOLT2 convolutional neural network object detection pipeline (1, 2, 3, 4) to the task of car detection in overhead imagery. For a very large and diverse set of test imagery at 30 cm resolution we correctly localize cars 90% of the time, and achieve an error rate on car counts of 5%. For readers looking for greater detail we include interactive plots and links to high resolution imagery.

1. Dataset Overview

The Cars Overhead with Context (COWC) dataset is a large, high quality set of annotated cars from overhead imagery. The data consists of ~33,000 unique cars from six different image locales: Toronto Canada, Selwyn New Zealand, Potsdam and Vaihingen Germany, Columbus and Utah United States. The dataset is described in detail by Mundhenk et al, 2016, along with an interesting approach to car counting using this dataset. The Columbus and Vaihingen datasets are in grayscale, which we discard for now. The remaining datasets are 3-band RGB images. Data is collected via aerial platforms, but at a nadir view angle such that it resembles satellite imagery. Labels are composed of a single pixel marker at the centroid of the car. It is worth noting that only personal vehicles are labelled; commercial delivery trucks and tractor-trailers are not labelled. Therefore any algorithm for vehicle detection on the COWC dataset must be able to differentiate cars from other vehicle types.

The imagery has a resolution of 15 cm ground sample distance (GSD) that is approximately twice as good as the current best resolution of commercial satellite imagery (30 cm GSD for DigitalGlobe). Accordingly, we convolve the raw imagery with a Gaussian kernel and reduce the image dimensions by half to create the equivalent of 30 cm GSD images. Subsequent posts will explore the impact of various resolutions on car localization, but in this post we utilize the down-sampled 30 cm GSD imagery.

Previous studies with YOLT2 demonstrated reasonable results with small training sets. Accordingly, we reserve the largest geographic region (Utah, with 19,807 cars) for testing. This leaves 13,303 training cars among Potsdam, Selwyn, and Toronto.

2. YOLT2 Training Data

Data labels consist of an image mask where non-zero pixels denote car centroids. Recall that YOLT2 requires requires rectangular bounding boxes as labels (see Figure 4 of the first YOLT blog). Bounding box labels are created by assuming a mean car size of 3.0 meters and transforming the image mask into bounding boxes 20 pixels on a side centered on the label point. The original labels and inferred YOLT2 bounding boxes are displayed in Figure 1 below.

Figure 1. Partial sample COWC image over Potsdam at native 15 cm GSD with labels overlaid. Original labels are shown by a red dot located at each car centroid, while inferred 3 meter YOLT2 bounding box labels are shown in blue. Note that large trucks and other vehicles are not labelled, only cars. Imagery courtesy of ISPRS and Mundhenk et al, 2016.

Training images are down-sampled to 30 cm GSD and sliced into 416 x 416 pixel cutouts for input to YOLT2 for training. Over the three training regions (Canada, New Zealand and Germany) and 13,303 car labels we aggregate 2418 image cutouts for training. We train the YOLT2 model for 1600 epochs (one epoch is a complete pass through all the training data), which takes 4 days on a single NVIDIA Titan X GPU.

Figure 2. Potsdam 30 cm GSD YOLT2 training images 416 pixels on a side, with blue 3 meter bounding box labels overlaid. Imagery courtesy of ISPRS and Mundhenk et al, 2016.

3. Test Data

We reserve the Utah region for testing purposes, which contains ~50% more cars than the total of our training regions. The Utah region differs significantly in car density, building architecture, and vegetation patterns from our training regions in Germany, New Zealand, and Canada; Utah therefore provides a rigorous test case for our detection pipeline. One of the nine Utah test images is over central Salt Lake City (12TVL240120.png) and 13,213 cars exist in this single image. We don’t want a single high density image to completely dominate scene statistics, so in an effort to build meaningful statistics over our test set we split the central Salt Lake City image into sixteen 4000 x 4000 pixel cutouts with slight overlap at the edges. This overlap raises the car count since the overlap regions are counted twice. We remove one scene (12TVL120100-CROP.png) that contains only 61 cars from the test set because of its low statistics. This leaves 23 test images over Utah at 30 cm GSD, and a total of 25,980 cars.

4. Test Procedure

One of the strengths of the YOLT2 detection pipeline is speed; image inference proceeds at 44 frames per second. This speed is important given the very large sizes of many satellite imagery corpora. We iterate over the large test images (some of which are over 13,000 pixels on a side) and dissect test images into smaller cutouts. For the largest test images this yields over 2,400 individual cutouts, which still takes under one minute to process. The final output is created by stitching the many cutouts back together, resolving overlapping detections that arise from cutout overlap, applying non-max suppression, and computing performance metrics from proposals and ground truth data.

We adopt the performance metric first described in the HOG Boat Detection post, Section 2: a true positive is defined as having a Jaccard index (also known as intersection over union) of greater than 0.25. A Jaccard index of 0.5 is often used as the threshold for a correct detection, though as in Equation 5 of ImageNet we select a lower threshold since we are dealing with very small objects. We adopt a color scheme of: blue = ground truth, green = true positive, red = false positive, yellow = false negative.

The true and false positives and negatives are aggregated into a single value known as the F1 score, which varies from 0 to 1 and is the harmonic mean of precision and recall. We also compute the predicted number of cars in the scene as a fraction of the number of ground truth cars.

5. Car Localization Results

Over the entire corpus of 23 test images we achieve an F1 of 0.90 +/- 0.09 (see Figure 5), and the fraction of predicted number to ground truth number is 0.95 +/- 0.05 (see Figure 6). More interesting than the aggregate results are the results scene-by-scene, which we explore below.

Figure 3. Central Salt Lake City test scene. Ground truth boxes are blue, true positive proposals are green, false positives are red, and false negatives are yellow. Overall, we achieve an F1 of 0.95 for this urban scene. The high resolution image is hosted here. Imagery courtesy of AGRC and Mundhenk et al, 2016.

The overall performance for each scene is computed and plotted in Figures 4–6 below. These interactive plots were created using Bokeh; hovering over each data point brings up relevant information, and tapping on an individual data point links to the high resolution evaluation image. Each unique Utah test scene is assigned its own color, with the multiple red dots denoting the central Salt Lake City image that was dissected into multiple cutouts.

Figure 4. Number of cars in each test image. Clicking on the link brings up the interactive Bokeh plot with hover (displayed for Image 9) and tapping enabled.

In Figures 4–6 we compute the mean and error (assigned as one standard deviation) of each performance measure, weighted by the number of cars in each scene. The dotted line denotes the weighted mean, with the yellow band displaying the weighted standard deviation.

Figure 5. F1 score for each test image. Clicking on the image brings up the interactive Bokeh plot that links to raw test images. Imagery courtesy of AGRC and Mundhenk et al, 2016.

The weighted mean F1 score over all test images is 0.90, though the weighted mean F1 is 0.94 for the red dots denoting the center of Salt Lake City; this area more closely resembles the training data than some of the surrounding test sites. For example, image 2 has a much lower score given that this is a junkyard scene and differentiating overlapping and decaying cars is quite difficult.

Figure 6. Fraction of number of predicted cars to ground truth cars; a value of 1.0 denotes a perfect prediction.

Total car count in a specified region may be a more valuable metric in the commercial realm than F1 score. Like the F1 score, a value of 1.0 denotes perfect prediction for the fractional car count metric plotted in Figure 6. The weighted mean of 0.95 is closer to 1.0 than the F1 score since false positives and false negatives partially cancel out.

6. Conclusions

In this post we applied the YOLT2 detection pipeline to the problem of localizing cars in large overhead images from the COWC dataset. We also implemented interactive plots and hosted high resolution evaluation images for the interested reader to explore. Over a large test corpus of 20,000+ cars we achieve an F1 score of 0.90 +/- 0.09 for 30 cm imagery. Aggregated car counts are somewhat better, with an accuracy of 95 +/- 5%. Accuracies are highest (as high as F1 = 0.97) in urban scenes with dense car counts that resemble the training scenes, and we observe similar localization accuracies for both static and mobile vehicles.

In future posts we will explore localization performance as a function of both image resolution and training time, aiming to shed light on the hardware and software requirements for accurate car localization. In the meantime we encourage the interested reader to explore the linked high resolution images to further ascertain the scene types and conditions that remain a challenge for computer vision algorithms.

Thanks to David Lindenbaum and Vishal Sandesara for help with plot hosting. Thanks to lporter and Lee Cohn for useful comments.

May 29, 2018 Addendum: See this post for paper and code details.