Setting a Foundation for Machine Learning: Datasets and Labeling
In the coming weeks we will post a number of quantitative studies on image classification, resolution dependence, object detection, image enhancement, and algorithm tradeoffs. In this post, however, we describe some important details about the datasets that underpin the analyses that will be presented in forthcoming posts.
We originally focus on the maritime domain, namely ship detection and heading classification. We view this problem as a compelling use case that also provides insights into terrestrial problems of interest in the satellite imagery domain, such as object localization in both rural areas and dense urban regions.
1. Labeled Training Data
For boat heading classification purposes we use two DigitalGlobe (DG) images. DigitalGlobe satellite images have very high resolution, as measured by the ground sample distance (GSD), or physical pixel size; a 1 meter GSD means that each pixel in that image has a spatial extent of 1 meter. Our training data consists of one 16K × 16K pixel GeoEye-1 image with 0.5m GSD and one 16K × 16K pixel WorldView3 image with 0.34m GSD, both taken near the Pacific entrance to the Panama Canal.
2. Boat Labeling
Each boat is labeled with a line segment stretching from stern to bow. These line segments are drawn in the geographic coordinate system (GCS) so they have latitude and longitude points. This line segment gives us the length of the boat and heading direction. Figures 2 and 3 detail dataset characteristics.
3. Boat Cutouts
To create a training set of objects of interest, we extract 812 cutouts of the boats in open water, harbors, and anchorages. These cutouts are rotated in 5-degree increments about the unit circle, forming 72 different sets of positive training images. These cutouts range in size from [8 x 8] to [376 x 376] pixels, with small boats obviously at very low resolution (see Figure 4).
Negative training data is created through random selection of regions of the image that contain open water, land, or clouds. The images are then checked to insure they lack any objects of interest. A total of 6580 negative images of varying pixel size were created.
4. Validation Data
To ensure robustness, classification algorithms (also known as supervised machine learning methods) are trained on one set of data, then tested for accuracy on a different set of data (the validation dataset). For our purposes, this independent validation set stems from a third image from a DigitalGlobe WorldView2 satellite with 0.5m GSD. The independent validation data set was created by extracting 516 boat cutouts from the WorldView2 image, in the same way as the 812 positive test images; we also include 4356 negative cutouts from the same validation image in order to test the robustness to false positives. Figure 6 below provides a summary of the validation data used.
5. Potential Labeling Errors
One potential source of error for boat heading classification is difficulty differentiating a 180-degree ambiguity caused by apparent visual symmetry. For vessels with a high degree of bilateral symmetry, both machine learning classifiers and human labelers often misclassify heading by 180 degrees.
6. Conclusion
In this post we described our research emphasis and corresponding satellite imagery datasets. We initially focus on a maritime domain awareness challenge, and extract a corpus of boat cutouts at various headings for training our algorithms. A similar, though entirely separate, dataset is extracted for algorithm validation. Modulo potential labeling errors caused by vessel symmetries, we now possess a high quality data corpus that we used in our research during the past few months, the results of which will be summarized in upcoming posts.