Setting a Foundation for Machine Learning: Datasets and Labeling

Adam Van Etten
The DownLinQ

--

In the coming weeks we will post a number of quantitative studies on image classification, resolution dependence, object detection, image enhancement, and algorithm tradeoffs. In this post, however, we describe some important details about the datasets that underpin the analyses that will be presented in forthcoming posts.

We originally focus on the maritime domain, namely ship detection and heading classification. We view this problem as a compelling use case that also provides insights into terrestrial problems of interest in the satellite imagery domain, such as object localization in both rural areas and dense urban regions.

1. Labeled Training Data

For boat heading classification purposes we use two DigitalGlobe (DG) images. DigitalGlobe satellite images have very high resolution, as measured by the ground sample distance (GSD), or physical pixel size; a 1 meter GSD means that each pixel in that image has a spatial extent of 1 meter. Our training data consists of one 16K × 16K pixel GeoEye-1 image with 0.5m GSD and one 16K × 16K pixel WorldView3 image with 0.34m GSD, both taken near the Pacific entrance to the Panama Canal.

Figure 1: Snapshot of the region used to extract training data, sourced from GeoEye 1 (0.5m GSD) and World View 3 (0.34m GSD) (Imagery Courtesy of DigitalGlobe).

2. Boat Labeling

Each boat is labeled with a line segment stretching from stern to bow. These line segments are drawn in the geographic coordinate system (GCS) so they have latitude and longitude points. This line segment gives us the length of the boat and heading direction. Figures 2 and 3 detail dataset characteristics.

Figure 2. Summary of training dataset.
Figure 3. Distribution of boat sizes for training data, the majority of boats are around 20m in length.

3. Boat Cutouts

To create a training set of objects of interest, we extract 812 cutouts of the boats in open water, harbors, and anchorages. These cutouts are rotated in 5-degree increments about the unit circle, forming 72 different sets of positive training images. These cutouts range in size from [8 x 8] to [376 x 376] pixels, with small boats obviously at very low resolution (see Figure 4).

Figure 4: Example satellite cutouts oriented at 0 degrees (top) and 200 degrees (bottom) at native resolution (Imagery Courtesy of DigitalGlobe).

Negative training data is created through random selection of regions of the image that contain open water, land, or clouds. The images are then checked to insure they lack any objects of interest. A total of 6580 negative images of varying pixel size were created.

Figure 5. Negative training data comprising land, clouds and sea (Imagery Courtesy of DigitalGlobe).

4. Validation Data

To ensure robustness, classification algorithms (also known as supervised machine learning methods) are trained on one set of data, then tested for accuracy on a different set of data (the validation dataset). For our purposes, this independent validation set stems from a third image from a DigitalGlobe WorldView2 satellite with 0.5m GSD. The independent validation data set was created by extracting 516 boat cutouts from the WorldView2 image, in the same way as the 812 positive test images; we also include 4356 negative cutouts from the same validation image in order to test the robustness to false positives. Figure 6 below provides a summary of the validation data used.

Figure 6. Summary of independent test dataset.

5. Potential Labeling Errors

One potential source of error for boat heading classification is difficulty differentiating a 180-degree ambiguity caused by apparent visual symmetry. For vessels with a high degree of bilateral symmetry, both machine learning classifiers and human labelers often misclassify heading by 180 degrees.

Figure 7: A boat with a potentially misclassified heading due to its bilateral symmetry (Imagery Courtesy of DigitalGlobe).

6. Conclusion

In this post we described our research emphasis and corresponding satellite imagery datasets. We initially focus on a maritime domain awareness challenge, and extract a corpus of boat cutouts at various headings for training our algorithms. A similar, though entirely separate, dataset is extracted for algorithm validation. Modulo potential labeling errors caused by vessel symmetries, we now possess a high quality data corpus that we used in our research during the past few months, the results of which will be summarized in upcoming posts.

--

--