Establishing a Machine Learning Workflow

Published in

The DownLinQ

5 min readSep 2, 2016

While the rapid advancements in neural networks offer enormous promise for the field of imagery analytics, application of these technologies against an ever-increasing volume of commercially available satellite imagery remains in its early stages. CosmiQ Works has developed and tested a feasible technical workflow to quantitatively assess the performance of open-source machine learning (ML) algorithms against satellite images.

The workflow, illustrated below, includes all of the necessary steps to test the performance of ML algorithms against satellite imagery. At the highest level, the workflow can be classified into two parts: human intensive activities; and computationally intensive activities. It is important to determine the necessary human and computational resources before beginning an ML experiment.

We will explore each part of the workflow in greater detail in upcoming blog posts. We plan to apply a majority of our focus to the most resource intensive activities: data labeling, training the imagery classifier and testing the imagery classifier.

Several logos appear in the workflow to represent products that CosmiQ Works used during each of the steps of the workflow. The list of logos is not meant to be exhaustive but rather suggestive of products that facilitate the ML process, some with special support for satellite imagery.

Labeling:

The human intensive aspect of ML is the labeling of the data. Proper labels often require subject matter expertise in the objects of interest and in the desired labels to be associated with the objects. Satellite imagery has attributes that complicate the labeling process. For example, satellite imagery is large, multi-channeled, geo-tagged, and each pixel may store more than eight bits per channel. Common image libraries typically do not handle the imagery very well. CosmiQ Works used the Open Source Geographic Information Systems QGIS as a framework for managing, displaying, and labeling the satellite imagery. The label is stored as a vector layer with geo-references and is typically output as either a GeoJSON file or as an Esri Shapefile. For feature extraction, OpenCV supports standard 3-band imagery while Orfeo Toolbox is specifically designed for remote sensing applications. GDAL is a useful library for processing the geospatial data from the command line or Python scripts.

Pre-process Data & Feature Extraction:

Automated analysis is usually optimized for a specific type of input, e.g., size of input image, contrast within the image, and physical dimension of the image. To improve accuracy, imagery is pre-processed by a variety of techniques. CosmiQ Works performed only minimal processing in terms of pixel normalization and image scaling using a standard computer vision library OpenCV.

Feature extraction differentiates deep learning from classical machine learning. In classical machine learning features are determined a priori to the training. Sometimes these features are extracted manually (hand-crafted) and can be extremely effective for identifying regions and objects of interest, but often require expertise. More frequently, however, features are extracted automatically by a pre-determined suite of algorithms. On the other hand deep learning algorithms have the capability of learning a hierarchy of features during the training process, which can greatly increase the flexibility of the models; some learned features are optimized for the intended task and some are general enough to be used across multiple tasks.

While OpenCV is good for manipulating 3-band imagery, it does not support the geographic data embedding in the imagery. GDAL is a useful software library for working with satellite imagery, supporting multi-band, 16-bit color depth, geotagged imagery. Many GIS frameworks rely on GDAL for basic functionality. Classical feature extraction (such as SIFT/SURF/Canny edge detection) is supported by OpenCV as well as the Orfeo Toolbox specifically designed for remote sensing.

Dividing Data into Test Data and Training Data:

The reason for dividing data into training and testing data sets is to provide independent validation in order to prevent overfitting model parameters during the training process. While this may seem straightforward for computer vision, geographic and temporal correlations are unavoidable in satellite imagery. For our experiments, we wrote Python scripts to randomly divide labeled sub-regions of satellite imagery into either a testing set or a training set. The ML algorithm only optimizes using the training set. The testing set is used to provide a measure of overfitting of the training data. For additional validation, we reserved separate imagery to validate the trained classifiers.

Synthesize New Labeled Imagery:

When there is a significant difference in accuracy of a trained classifier on a test set compared to the accuracy on a training set, overfitting is the likely culprit. Increasing the amount of labeled data helps mitigate overfitting. For satellite imagery, the objects of interest may be either limited or expensive to find and label. Using symmetries in the imagery, one can simulate new images from the initial training set. For example, some common symmetries include image rotations, increased noise, change in lighting, and scaling of imagery. Generating new images using these symmetries introduces correlations in training data but the benefits usually outweigh the newly introduced correlations.

Import parameters:

In deep learning, one can reduce training time and data requirements by importing parameters from previously trained neural networks. Even if the applications are different, the features represented by the lower layers of a trained neural network are often sufficient features for training a new classifier. AlexNet, GoogLeNet, and ResNet are previous ImageNet winners that have publicly available neural networks and trained weights. While these networks have not been optimized for satellite imagery, we have used the pre-trained weights as starting points for the training process.

Training and Testing:

The computational intensive part of ML is the optimization of the parameters of the classifier. Neural networks commonly have millions of parameters and can only be optimized using special hardware, software libraries, and starting values for the parameters.

Hardware options for deep learning include:

· Specially designed, energy efficient ASICs;

· Reprogrammable FPGAs;

· Highly parallelized GPUs; and

· Large memory supported CPUs.

We used high-end consumer hardware to train algorithms for image classification and object detection; our computational server is the NVIDIA DevBox with four Maxwell GeForce Titan X GPUs. Advantages for different hardware choices depend on the application, available budget, and the desired training time.

There are several software frameworks for performing deep learning; most support parallelization on NVIDIA GPUs. Initially, we chose to work with the deep learning framework Caffe because of the Python support and the access to pre-trained networks. For programmers comfortable with Python or C++, Tensorflow is a well-documented framework with a growing developer base. We generally design new network architectures in Tensorflow, but use Caffe to fine-tune pre-trained networks. NVIDIA Digits is a polished frontend to labeled data management and Caffe-based model training.