Solaris Model Deployment: From Start to Finish

Demystifying geospatial deep learning with In-Q-Tel CosmiQ Works’ Solaris

Roshan Ram
The DownLinQ
10 min readAug 14, 2020

--

Preface: SpaceNet LLC is a nonprofit organization dedicated to accelerating open source, artificial intelligence applied research for geospatial applications, specifically foundational mapping (i.e., building footprint & road network detection). SpaceNet is run in collaboration by co-founder and managing partner CosmiQ Works, co-founder and co-chair Maxar Technologies, and our partners Amazon Web Services (AWS), Capella Space, Topcoder, IEEE GRSS, the National Geospatial-Intelligence Agency and Planet.

Solaris Logo and Description

What’s that? You want to learn how to efficiently pre-process your imagery, create your geospatial computer vision models, run those models seamlessly, and score them? You’ve come to the right place — meet Solaris, new and improved.

“Performing machine learning (ML) and analyzing geospatial data are both hard problems requiring a lot of domain expertise. These limitations have historically meant that one needs to be an expert in both to perform even the most basic analyses, making advances in AI for overhead imagery difficult to achieve. We at CosmiQ Works have asked ourselves: is there anything we can do to reduce this barrier to entry, making it easier to apply machine learning methods to overhead imagery data?” ~Nick Weir

See Nick Weir’s article on what Solaris is for more details.

In this article, we’ll take a dip into the major Solaris functions and their uses, and break down how all of these functions work together to create a deep-learning model deployment pipeline, from start to finish. We’ll use the SpaceNet 4 dataset to provide some context for the functions we dive into.

For more details and insight on code and functions, be sure to check out my accompanying video, Solaris Model Deployment, on In-Q-Tel’s YouTube channel. The accompanying video tutorial on YouTube presents the SpaceNet 2 challenge as a use-case, to see how far we’ve come in computer vision and with Solaris, with the code of top-ranking competitor XD_XD. At each stage of this article, you can find the corresponding timestamps in bold where the content is covered in the YouTube video.

Introduction

YouTube Timestamp for Introduction: 0:36

We first note that there are 2 different ways to approach Solaris functions and utilities:

  1. Using the Command Line Interface (CLI), using the Command Prompt/Shell, if on a PC, or Terminal, if on a Mac. This approach is most useful when dealing with large amounts of data and trades flexibility for ease of use. This approach is recommended if you are looking to apply a cookie-cutter approach, or primarily stick with Solaris operations.
  2. Using the Solaris Python API, in a Jupyter Notebook — an interactive Python environment. This approach is recommended if you have experience with scripting and coding in Python, and trades ease of use for flexibility.
Solaris Overview from Documentation

This article will detail how to, from start to finish, deploy a top-scoring geospatial deep-learning model within minutes utilizing the Solaris pipeline.

YouTube Timestamp for Solaris Overview: 0:52

Taken from Roshan Ram’s YouTube video on Solaris Model Deployment

We will go through each step, in detail:

  • Data Collection and Organization [obtaining and organizing data for the task at hand]
  • Mask/Label Creation [this step is necessary only if not already present in the collected data. Here, we’ll create masks to overlay over our images, to help us with our segmentation task at hand]
  • Training Data Prep [we’ll create a CSV file with all the necessary details about the training data to feed into Solaris]
  • Testing Data Prep [we’ll create a CSV file with all the necessary details about the testing data and where we want our output inference to end up, to feed into Solaris]
  • Model Creation [we’ll create/configure our Deep Learning model in a very simply formatted file — what’s known as a YAML file]
  • Model Execution [finally, the fun stuff. As aforementioned, we can choose to use either the Solaris Python API or the native Command Line Interface Sit (CLI) to execute our training/testing process. Sit back, relax, and let Solaris do all the work as we kick off cutting-edge algorithms in less lines of code than there are in this paragraph.]

Note: for more granular details on any of the below functions, parameters, or code: see the Solaris documentation and the Solaris GitHub repository.

Data Collection and Organization

YouTube Timestamp: 7:11

  • First, open your Terminal and ensure that the AWS command line interface is installed, or install it. Then configure your AWS account credentials. The details for these steps can be found on the AWS documentation, here.
  • Create a folder, then note down the path. This could be something like /Users/johndoe/Desktop/sn4.
  • Obtain the training data by running the following command:

Solaris Installation

  • Installation with Docker using the Dockerfile in solaris/solaris/docker is recommended, but Conda will work.
  • Set up as instructed here: https://github.com/CosmiQ/solaris
  • Once set up, make sure you run the following command to activate your environment using Conda:

Create masks for your training data

YouTube Timestamp for Mask Creation: 7:46

  • Create a CSV file with the following columns, either by using a self-created script, or manually:

[contains the paths to vector-formatted label files that you wish to transform to masks (the geojson files)]

[contains the paths to training images that correspond to the same geographies as the vector labels you’ve been using (the tif files)]

[the path to save the output masks to. The values in these 2 columns must be matching geographies across the row, or you’ll get empty masks]

Masks CSV Example Excerpt:

- Now, run the following command:

  • Make sure to replace argument_csv with the Mask CSV you just created
  • Create a reference file to all your training data and testing data

Training Data

YouTube Timestamp for Training/Testing Data CSVs: 11:24

  • Create a CSV with the following columns:

[Define the paths to each image file be used during training, one path per row]

[Define the paths to each mask corresponding to the image file to be used in training]

  • NOTE THAT EACH IMAGE AND LABEL IN EACH ROW MUST MATCH

You can find more details here.

The training data CSV should look something like this:

Left Column:

Left Column of Sample Training Data CSV

Right Column:

Right Column of Sample Training Data CSV

Testing Data

YouTube Timestamp for Training/Testing Data CSVs: 11:24

  • Create a CSV with the following columns:

[Define the paths to each image file to be tested on during inference, one path per row]

This will look something like this:

Sample of Testing Data CSV

Creating the CSVs

YouTube Timestamp for CSV Creation: 11:45

First, ensure that your Solaris environment is activated. Then, open up a Jupyter Notebook.

Now, we can use the sol.utils.data.make_dataset_csv function to effortlessly create the CSV file with the data we need.

Let’s take a look at the make_dataset_csv function:

In essence, we have to pass in what’s called a regular expression pattern, or RegEx for short. This will allow for us to group file names by a certain common substring. For example, if an image file has the following path name:

and the corresponding label/mask file has the following path name:

Then we would need to extract something that these image and label paths have in common: the image number. In this case, we see that it is

so we can create a Regular Expression to select the number 482, which comes after “img” and before “.tif”. See here for more information on how to use RegEx.

Model Creation/YAML File Configuration:

Model Architecture YouTube Timestamp: 4:43

YAML File Configuration YouTube Timestamp: 14:40

  • Configure the model using a pre-trained model:

Take a look at xdxd_spacenet4.yml from here for an example of what a complete model configuration YAML file should look like. There are a multitude of different parameters which can be tweaked, and the detailed explanations regarding each of the parameters can be found here, on the Solaris GitHub repository. However, the main ones that must be changed are the following:

Set this to the path to your training data csv from before.

Set this to the path to your testing data csv from before

This is where you want model checkpoints to save. Model checkpoints are basically snapshots of your model at a particular epoch, or iteration of the training process.

This is where you want your final model to save. This final model will, by default, be the model with the lowest validation loss throughout your training process.

Where you want the inferred final output images to save. This is the output of your model’s hard work — the result of running the trained model on the test images that you specified in the Testing Data CSV.

Tweak the batch size and number of epochs, respectively, as desired for your machine under the following parameters

o More details on YAML File Configuration from the docs.

Model Training and Inference

YouTube Timestamp for Model Training and Inference: 18:53

Once you have created your model and configured your YAML file, run the following line in your solaris environment, replacing path/to/xdxd_spacenet4.yml with your true path to the YAML file.

Model Scoring

YouTube Timestamp for Analyzing model performance: 20:54

  • Finally, the model can be scored using the ground truth CSV (which, for SpaceNet 4 data, is not available publicly)
  • First, you will need to create a proposal CSV from your output data
  • This can be done using the Solaris Python API and pandas
  • Run the following line, replacing the paths for the proposal_csv, truth_csv, and output_file with your own
  • Read more details about the proposal_csv and scoring at the respective Solaris docs.

Want to try it yourself? Grab some data from the SpaceNet website and follow along!

Follow the The DownLinQ on Medium for updates on blogs like this one, and connect with me on LinkedIn if you’d like to talk about machine learning.

References

Acknowledgements

Thanks to Daniel Hogan and Ryan Lewis for their mentorship, and the rest of the CosmiQ Works team for their feedback and guidance!

--

--

Roshan Ram
The DownLinQ

Undergraduate student at Carnegie Mellon studying Information Systems and Machine Learning + Statistics. linkedin.com/in/roshanr11