Solaris Model Deployment: From Start to Finish
Demystifying geospatial deep learning with In-Q-Tel CosmiQ Works’ Solaris
Preface: SpaceNet LLC is a nonprofit organization dedicated to accelerating open source, artificial intelligence applied research for geospatial applications, specifically foundational mapping (i.e., building footprint & road network detection). SpaceNet is run in collaboration by co-founder and managing partner CosmiQ Works, co-founder and co-chair Maxar Technologies, and our partners Amazon Web Services (AWS), Capella Space, Topcoder, IEEE GRSS, the National Geospatial-Intelligence Agency and Planet.
What’s that? You want to learn how to efficiently pre-process your imagery, create your geospatial computer vision models, run those models seamlessly, and score them? You’ve come to the right place — meet Solaris, new and improved.
“Performing machine learning (ML) and analyzing geospatial data are both hard problems requiring a lot of domain expertise. These limitations have historically meant that one needs to be an expert in both to perform even the most basic analyses, making advances in AI for overhead imagery difficult to achieve. We at CosmiQ Works have asked ourselves: is there anything we can do to reduce this barrier to entry, making it easier to apply machine learning methods to overhead imagery data?” ~Nick Weir
In this article, we’ll take a dip into the major Solaris functions and their uses, and break down how all of these functions work together to create a deep-learning model deployment pipeline, from start to finish. We’ll use the SpaceNet 4 dataset to provide some context for the functions we dive into.
For more details and insight on code and functions, be sure to check out my accompanying video, Solaris Model Deployment, on In-Q-Tel’s YouTube channel. The accompanying video tutorial on YouTube presents the SpaceNet 2 challenge as a use-case, to see how far we’ve come in computer vision and with Solaris, with the code of top-ranking competitor XD_XD. At each stage of this article, you can find the corresponding timestamps in bold where the content is covered in the YouTube video.
We first note that there are 2 different ways to approach Solaris functions and utilities:
- Using the Command Line Interface (CLI), using the Command Prompt/Shell, if on a PC, or Terminal, if on a Mac. This approach is most useful when dealing with large amounts of data and trades flexibility for ease of use. This approach is recommended if you are looking to apply a cookie-cutter approach, or primarily stick with Solaris operations.
- Using the Solaris Python API, in a Jupyter Notebook — an interactive Python environment. This approach is recommended if you have experience with scripting and coding in Python, and trades ease of use for flexibility.
This article will detail how to, from start to finish, deploy a top-scoring geospatial deep-learning model within minutes utilizing the Solaris pipeline.
We will go through each step, in detail:
- Data Collection and Organization [obtaining and organizing data for the task at hand]
- Mask/Label Creation [this step is necessary only if not already present in the collected data. Here, we’ll create masks to overlay over our images, to help us with our segmentation task at hand]
- Training Data Prep [we’ll create a CSV file with all the necessary details about the training data to feed into Solaris]
- Testing Data Prep [we’ll create a CSV file with all the necessary details about the testing data and where we want our output inference to end up, to feed into Solaris]
- Model Creation [we’ll create/configure our Deep Learning model in a very simply formatted file — what’s known as a YAML file]
- Model Execution [finally, the fun stuff. As aforementioned, we can choose to use either the Solaris Python API or the native Command Line Interface Sit (CLI) to execute our training/testing process. Sit back, relax, and let Solaris do all the work as we kick off cutting-edge algorithms in less lines of code than there are in this paragraph.]
Data Collection and Organization
- First, open your Terminal and ensure that the AWS command line interface is installed, or install it. Then configure your AWS account credentials. The details for these steps can be found on the AWS documentation, here.
- Create a folder, then note down the path. This could be something like /Users/johndoe/Desktop/sn4.
- Obtain the training data by running the following command:
aws s3 cp s3://spacenet-dataset/Spacenet_Off-Nadir_Dataset/ /Users/johndoe/Desktop/sn4
- Installation with Docker using the Dockerfile in solaris/solaris/docker is recommended, but Conda will work.
- Set up as instructed here: https://github.com/CosmiQ/solaris
- Once set up, make sure you run the following command to activate your environment using Conda:
source activate solaris
Create masks for your training data
YouTube Timestamp for Mask Creation: 7:46
- Create a CSV file with the following columns, either by using a self-created script, or manually:
[contains the paths to vector-formatted label files that you wish to transform to masks (the geojson files)]
[contains the paths to training images that correspond to the same geographies as the vector labels you’ve been using (the tif files)]
[the path to save the output masks to. The values in these 2 columns must be matching geographies across the row, or you’ll get empty masks]
Masks CSV Example Excerpt:
- Now, run the following command:
make_masks -t --batch --argument_csv mask_reference.csv --footprint
- Make sure to replace argument_csv with the Mask CSV you just created
- Create a reference file to all your training data and testing data
YouTube Timestamp for Training/Testing Data CSVs: 11:24
- Create a CSV with the following columns:
[Define the paths to each image file be used during training, one path per row]
[Define the paths to each mask corresponding to the image file to be used in training]
- NOTE THAT EACH IMAGE AND LABEL IN EACH ROW MUST MATCH
You can find more details here.
The training data CSV should look something like this:
YouTube Timestamp for Training/Testing Data CSVs: 11:24
- Create a CSV with the following columns:
[Define the paths to each image file to be tested on during inference, one path per row]
This will look something like this:
Creating the CSVs
YouTube Timestamp for CSV Creation: 11:45
First, ensure that your Solaris environment is activated. Then, open up a Jupyter Notebook.
import solaris as sol
Now, we can use the sol.utils.data.make_dataset_csv function to effortlessly create the CSV file with the data we need.
Let’s take a look at the make_dataset_csv function:
def make_dataset_csv(im_dir, im_ext='tif', label_dir=None, label_ext='json', output_path='dataset.csv', stage='train', match_re=None, recursive=False, ignore_mismatch=None, verbose=0): """Automatically generate dataset CSVs for training.
This function creates basic CSVs for training and inference automatically. See `the documentation tutorials <https://solaris.readthedocs.io/en/latest/tutorials/notebooks/creating_im_reference_csvs.html>`_
for details on the specification. A regular expression string can be provided to extract substrings for matching images to labels; if not provided, it's assumed that the filename for the image and label files is identical once extensions are stripped. By default, this function will raise an exception if there are multiple label files that match to a given image file, or if no label file matches an image file; see the `ignore_mismatch` argument for alternatives. Arguments
im_dir : str
The path to the directory containing images to be used by your model.
Images in sub-directories can be included by setting ``recursive=True``. im_ext : str, optional
The file extension used by your images. Defaults to ``"tif"``. Not case sensitive. label_dir : str, optional
The path to the directory containing images to be used by your model. Images in sub-directories can be included by setting ``recursive=True``. This argument is required if `stage` is ``"train"`` (default) or ``"val"``, but has no effect if `stage` is ``"infer"``. output_path : str, optional
The path to save the generated CSV to. Defaults to ``"dataset.csv"``. stage : str, optional
The stage that the csv is generated for. Can be ``"train"`` (default), ``"val"``, or ``"infer"``. If set to ``"train"`` or ``"val"``, `label_dir` must be provided or an error will occur. match_re : str, optional A regular expression pattern to extract substrings from image and label filenames for matching. If not provided and labels must be matched to images, it's assumed that image and label filenames are identical after stripping directory and extension. Has no effect if ``stage="infer"``. The pattern must contain at least one capture group for compatibility with :func:`pandas.Series.str.extract`. recursive : bool, optional
Should sub-directories in `im_dir` and `label_dir` be traversed to find images and label files? Defaults to no (``False``). ignore_mismatch : str, optional
Dictates how mismatches between image files and label files should be handled. By default, having != 1 label file per image file will raise a ``ValueError``. If ``ignore_mismatch="skip"``, any image files with != 1 matching label will be skipped. verbose : int, optional
Verbose text output. By default, none is provided; if ``True`` or ``1``, information-level outputs are provided; if ``2``, extremely verbose text is output. Returns
output_df : :class:`pandas.DataFrame`
A :class:`pandas.DataFrame` with one column titled ``"image"`` and a second titled ``"label"`` (if ``stage != "infer"``). The function also saves a CSV at `output_path`. """
In essence, we have to pass in what’s called a regular expression pattern, or RegEx for short. This will allow for us to group file names by a certain common substring. For example, if an image file has the following path name:
and the corresponding label/mask file has the following path name:
Then we would need to extract something that these image and label paths have in common: the image number. In this case, we see that it is
so we can create a Regular Expression to select the number 482, which comes after “img” and before “.tif”. See here for more information on how to use RegEx.
Model Creation/YAML File Configuration:
- Configure the model using a pre-trained model:
Take a look at xdxd_spacenet4.yml from here for an example of what a complete model configuration YAML file should look like. There are a multitude of different parameters which can be tweaked, and the detailed explanations regarding each of the parameters can be found here, on the Solaris GitHub repository. However, the main ones that must be changed are the following:
Set this to the path to your training data csv from before.
Set this to the path to your testing data csv from before
This is where you want model checkpoints to save. Model checkpoints are basically snapshots of your model at a particular epoch, or iteration of the training process.
This is where you want your final model to save. This final model will, by default, be the model with the lowest validation loss throughout your training process.
Where you want the inferred final output images to save. This is the output of your model’s hard work — the result of running the trained model on the test images that you specified in the Testing Data CSV.
Tweak the batch size and number of epochs, respectively, as desired for your machine under the following parameters
o More details on YAML File Configuration from the docs.
Model Training and Inference
YouTube Timestamp for Model Training and Inference: 18:53
Once you have created your model and configured your YAML file, run the following line in your solaris environment, replacing path/to/xdxd_spacenet4.yml with your true path to the YAML file.
solaris_run_ml -c path/to/xdxd_spacenet4.yml
- After having inferred the results, these files will need to be converted to vector labels, which can be done with the use of the Solaris Python API
- YouTube Timestamp for Visualizing Inference Results: 20:07
- Finally, the model can be scored using the ground truth CSV (which, for SpaceNet 4 data, is not available publicly)
- First, you will need to create a proposal CSV from your output data
- This can be done using the Solaris Python API and pandas
- Run the following line, replacing the paths for the proposal_csv, truth_csv, and output_file with your own
§ spacenet_eval --proposal_csv /path/to/sample_preds_competition.csv --truth_csv /path/to/sample_truth_competition.csv --challenge 'off-nadir' --output_file /path/to/outputs.csv
- Read more details about the proposal_csv and scoring at the respective Solaris docs.
- Download the open-source Solaris package and accelerate your deep learning workflow:
- A Solaris Tutorial Notebook
- Solaris Tutorials from Documentation
- Our new Solaris preprocessing library blog and video tutorial
- SpaceNet 2 Challenge Results Blog
- SpaceNet Challenges
- SpaceNet 4 Winners’ Code
- SpaceNet 2 Winners’ Code