Raster Vision: A Geospatial Deep Learning Framework

Published in

PyTorch

11 min readDec 7, 2021

Authors: Adeel Hassan and Lewis Fishgold

The problem

A satellite image is more than its pixels — it is also its location. Typically encoded as a GeoTiff, such an image will also have georeferencing metadata — such as coordinates, a coordinate system, and a projection transform — that defines a mapping from pixel-based coordinates (ie. row and column indices) to positions on the Earth’s surface (ie. latitude and longitude). The same holds true for any annotations we might create for such an image — these might take the form of GeoJSON files with vector annotations (eg. polygons), or be GeoTiffs themselves. With the right tools, we can extract from these files a correctly transformed raster image and a corresponding label that we can happily feed into our computer vision models; but ultimately, to be useful in the real world, any insights gained from these models must also be mapped back to geographical locations. What use is detecting a wildfire if we don’t know where it is?

The differences between standard computer vision datasets and remote sensing datasets do not end here. Another complication is that these images tend to be too large to feed directly into a neural network and must first be broken up into smaller “chips”. You can see even more differences in the table below.

Handling these differences is not a trivial matter and often acts as a barrier to entry for both computer vision researchers wishing to make an impact in the field of remote sensing, and conversely, domain experts who are new to deep learning. But what if all of that was taken care of behind the scenes and everything just worked?

Enter Raster Vision — an open-source computer vision framework developed by Azavea.

Azavea is a geospatial software design and development company based in Philadelphia. As a certified B Corporation, our mission is to apply geospatial technology for positive civic, social, and environmental impact and to advance the state-of-the-art through research.

Raster Vision to the rescue

Raster Vision knows how to handle geospatial data and will do it for you. It will rasterize and vectorize, download and upload, analyze and normalize, chip and clip, concat and extract, and, generally speaking, do whatever it takes to ensure that the data arrives in the right shape at the right spot. Under the hood, Raster Vision makes extensive use of GDAL, Rasterio, Shapely, and, of course, NumPy to accomplish this. The figures below show some of Raster Vision’s extraordinary data processing powers.

Raster Vision’s label inference capability allows you to convert vector annotations (left) to chip classification annotations (middle), while its rasterization capability allows you to convert them to semantic segmentation masks (right).

Using the extent cropping functionality, you can split a single scene into spatially disjoint training, validation and test sets.

Raster Vision can also train deep learning models. It offers fully implemented training pipelines for the computer vision tasks of chip classification, object detection, and semantic segmentation right out of the box. The models, loss functions, and optimizers are based on PyTorch and TorchVision and are highly configurable. Originally, Raster Vision used Tensorflow, but we switched to PyTorch because it made it easier to implement and debug custom models and loss functions. It also simplified the codebase by providing a single standard library covering the three different computer vision tasks.

Zooming out, we can summarize Raster Vision as a framework that enables developers to quickly and repeatably configure pipelines that go through the core components of a machine learning workflow: analyzing and pre-processing training data, training models, creating predictions, evaluating models, and bundling the model files and configuration for easy deployment. The entire Raster Vision pipeline looks like so:

(More details, including installation instructions, can be found in the official documentation.)

So how do we harness all this power?

Raster Vision in action

Getting started with a basic semantic segmentation example

Let’s apply Raster Vision to a semantic segmentation problem. We will start with a minimal example and then explore some more advanced features. We will use the ISPRS Potsdam Semantic Segmentation dataset, which contains six classes: car, building, low vegetation, tree, impervious, and clutter. The labels are distributed as RGB GeoTIFF files with a different color for each class, which can be seen below. The full dataset comprises 38 scenes, but here, for simplicity, we will only use two.

A sample training image (left) which is 6000x6000 pixels and annotations (right) for the six classes: **car (yellow), building (blue), low vegetation (cyan), tree (green), impervious (white), and clutter (red)**. Scene image and annotations source: ISPRS Potsdam Semantic Segmentation dataset.

Before running a pipeline, we need to configure it by writing a Python file that has a get_config function that returns a PipelineConfig object. Below is a bare-bones config that uses a single scene for training and another one for validation. For a fuller example, see the isprs_potsdam.py example in the repo.

After this, we can use the rastervision run CLI command to run the pipeline.

rastervision run local “./example.py” -a root_uri “./output/”

The output will be written to the ./output directory, and will include training logs, debug visualizations, model weights, predictions for each validation scene, evaluation metrics, and a model bundle for future deployment. After Raster Vision is done running, the full directory tree will look like so:

We can now examine this basic model’s predictions for the validation scene (predict/6_12/labels.tif) to see how well it does. If we load the predictions in a GIS software like QGIS, we will see that they are geographically “located” at the same place as the input.

A scene and predictions made by the model in RGB raster format overlaid on a basemap in QGIS. Note that this is the output of a model trained on one scene only; a model trained on the full dataset would produce much crisper results. Basemap source: OpenStreetMap. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

Predictions as vectors and probability maps

In addition to the RGB output, we can obtain vector output (polygons) as well as a full probability map for each of the classes by passing some additional arguments to label_store. The smooth_as_uint8 quantizes the floating point probability values to 256 levels and saves them as bytes to save space.

Predictions for buildings, cars, and trees after conversion to vector format. Note that this is the output of a model trained on one scene only; a model trained on the full dataset would produce much crisper results. Basemap source: OpenStreetMap.

The predicted probabilities for buildings, cars, and trees. Note that this is the output of a model trained on one scene only; a model trained on the full dataset would produce much crisper results. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

Working with multispectral images

Satellites often have advanced sensors that pick up a wide range of the electromagnetic spectrum, resulting in images with more bands than the usual red, green, and blue. Raster Vision makes it trivial to use as many of these bands as you like. What’s more, if you’re using a model pre-trained on RGB images, Raster Vision can modify the first convolutional layer to accept additional (or fewer) channels while retaining the existing pre-trained weights.

An RGBIR image. Note how the trees are more distinguishable in the IR band — this suggests the usefulness of including the IR band. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

Two ways of reading chips (sliding window and random sampling)

We have already seen one way of sampling chips from a large raster in the example above — using a sliding window with a stride equal to the window size. This gets us something like what is shown in the image below.

100 sliding windows of size 600 pixels with a stride of 300 pixels. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

But these 100 chips are only a small fraction of all possible chips that can be extracted from this scene. To get more chips, we can allow overlaps in neighboring chips by reducing the stride of the sliding window. The following image shows how we can quadruple the number of chips by halving the stride.

400 sliding windows of size 600 pixels with a stride of 300 pixels. The fainter lines show the boundaries of the additional windows obtained by reducing the stride. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

An alternative option is to sample chips randomly from anywhere within the raster — as many as we want. This feature also allows us to sample windows of different sizes — this can potentially help the model develop a level of robustness to scale. The snippet below shows how we can tell Raster Vision to sample 200 square windows with sizes ranging from 200 to 400 pixels.

200 randomly sampled windows of sizes ranging from 400 to 800 pixels. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

Handling areas of interest

Fully annotating a several thousand by several thousand pixel scene is costly and time consuming. What if we wanted to learn from a partially labeled scene? Or, perhaps, we have divided up a single scene into a training region and a validation region. How do we restrict sampled chips to one region?

Raster Vision allows us to enforce this constraint by specifying an Area of Interest (AOI) in the form of one or more polygons. These can be provided as GeoJSON files to aoi_uris as shown below.

Sampling windows from within an AOI. **Left:** Sliding windows of size 300 pixels with a stride of 300 pixels. **Right:** An equal number of randomly sampled windows of sizes ranging from 200 to 400 pixels. Scene image source: ISPRS Potsdam Semantic Segmentation dataset.

Adding data augmentation

Data augmentation is an essential element of neural network training and Albumentations is one of the most popular data augmentation libraries around. Raster Vision allows you to specify arbitrarily complex Albumentations transforms (as long as they are serializable) and use them for data augmentation:

Example data augmentations using ToGray, CutOut, ImageCompression, and GridDistortion Albumentations transforms.

Using custom models and loss functions

By default, Raster Vision provides support for some basic TorchVision models such as ResNets (-18/50/102) for chip classification and DeepLabV3 for semantic segmentation. But, this can be unnecessarily restrictive if you want to make architecture customizations specific to your task or just want to try out the flavor-of-the-month Transformer. Which is why Raster Vision also provides the freedom to import and use whatever model you want as long as it interfaces correctly with the training and inference code. It allows importing arbitrary loss functions as well.

This functionality is made possible by the excellent Torch Hub module. In fact, as part of the work on it, we ended up contributing to the Torch Hub source code! It is now capable of loading model definitions from local directories instead of just GitHub repositories.

The following snippet shows how we can modify our semantic segmentation example to use a Panoptic FPN as our model and Focal Loss as our loss function.

Running on AWS Batch

Efficiently training models using Raster Vision requires the use of a high-end GPU and multiple CPU cores. Since many users do not have this kind of hardware in-house, Raster Vision comes with support for running pipelines in the cloud using AWS Batch.

Specifying batch in the run command, causes Raster Vision to submit a DAG (directed acyclic graph) representation of the pipeline to Batch, which will then run the pipeline using EC2 instances. This DAG contains a node for each Docker command to run, and an edge for each command that consumes the output of another command. In addition, this DAG specifies whether each command should run on a CPU or GPU instance, and how commands should be parallelized across instances. For example, the predict command can split the work across several nodes which run in parallel. AWS will then automatically start the instances that are needed, execute the commands, retry in case of failure, and then shutdown instances that are no longer needed. When running on Batch, all output is stored on S3, and input is retrieved from S3 or http. Manually setting up Batch to use with Raster Vision can be a bit complicated, so we provide a CloudFormation template to automate the process.

rastervision run batch "example.py" --splits 2 \
    -a root_uri "s3://my-bucket/output"

Going beyond geospatial data

Although Raster Vision was built as a tool to learn from geospatial data, its potential applications extend much farther.

Take the field of digital histopathology. An area that deals with very large raster images, it has, in recent years, received much attention from deep learning and computer vision researchers. In this context, the rasters are mega-pixel or giga-pixel resolution scans of microscope slides known as whole slide images (WSIs) and the annotations are usually polygons which can be specified in GeoJSON files.

Not only can Raster Vision be used to train deep learning models for histopathology, it has been successfully used for it.

A WSI (source: CAMELYON16 dataset) overlaid with sliding windows produced by Raster Vision.

Conclusion and future plans

Raster Vision is an open source framework that bridges the divide between the world of GIS and deep learning-based computer vision. It provides a configurable computer vision pipeline that works on chip classification, semantic segmentation, and object detection, and seamlessly handles the idiosyncrasies of working with big geospatial datasets. The project began over four years ago when we competed in the ISPRS Potsdam Semantic Segmentation challenge, and has evolved to accommodate many of our client projects. Beyond Azavea, it has been used by graduate students, GIS consultants, non-profits, and governments around the world. In the past year, we have made the framework more flexible by adding support for multiband imagery, and custom models, loss functions, and data augmentations.

The immediate focus for Raster Vision is to bring it into a form that is most compatible with typical machine learning workflows, so that users can more easily make use of its unique capabilities. As part of this, we want to refactor Raster Vision into separate libraries, so that users are able to make use of individual parts (such as GeoDatasets). We would also like to have better support for the SpatioTemporal Asset Catalog (STAC) datasets. STAC is an increasingly popular data format for storing geospatial data. As for Raster Vision’s computer vision capabilities, we want to add support for instance segmentation and multi-GPU training. In the long-term, we would also like to establish a formal governance structure for the library.

Contributing to Raster Vision

There are many ways to contribute to Raster Vision. Users can ask (and answer!) questions using Github issues, or by posting in our Gitter channel. Making issues for bug reports and feature requests, and small pull requests for bug fixes and documentation improvements are always welcome. For larger pull requests, we encourage users to discuss the idea in an issue before getting too deep into code writing. We are happy to give advice on how to implement things. We are also interested in developing longer-term relationships with other organizations that use Raster Vision who can help us develop a roadmap and maintain the project; email us if you are interested.

We hope that you’ll give Raster Vision a try! Further material can be found in the following places:

Documentation
Quickstart
Examples
Github Repo
Gitter Channel
Azavea Blog Posts on Deep Learning + Remote Sensing
- Cloud Detection in Satellite Imagery
- Transfer Learning from RGB to Multi-band Imagery
- Using Noisy Labels to Train Deep Learning Models on Satellite Imagery

Acknowledgement: Thanks to Rob Emanuele, James McClain, and all of the other past and present contributors to Raster Vision.