Broad Area Satellite Imagery Semantic Segmentation (BASISS)

Extracting Road Masks from Massive SpaceNet Images

Published in

The DownLinQ

7 min readJan 24, 2018

Road network detection from overhead imagery is of great interest to a diverse set of disciplines, from humanitarian crises to autonomous vehicles. Road network mapping remains especially challenging in broad expanses of underdeveloped regions. In order to help address this challenge, the latest SpaceNet dataset includes over 8000 km of road centerline labels with co-registered imagery. In a previous post we used these centerlines to create rasterized road masks. In this post we explore methods to derive road segmentation masks from SpaceNet satellite imagery, and demonstrate techniques to mitigate deep learning hardware limitations in order to infer maps over large areas. Attendant code is provided for the interested reader.

1. Existing Approaches and Data

Extracting road pixels in small image chips from aerial imagery has a rich history (e.g. Mnih and Hinton 2010, Wang et al 2016, Zhang et al 2017). These algorithms generally use lower resolution imagery (GSD >= 1 meter), and OpenStreetMap labels. The large dataset size, higher resolution (0.3 meter GSD), and hand-labeled and quality controlled labels of SpaceNet provide a significant enhancement over current datasets and provide an opportunity for algorithm improvement. SpaceNet image chips are 1300 x 1300 pixels in size (~400m x ~400m).

2. Mask Extraction

In a previous post we demonstrated how to extract road training masks from SpaceNet imagery and labels (see Figure 1). The code to create masks is slightly tweaked from the previous post, allowing either 3-band RGB or 8-band multispectral imagery to be utilized, and collects lists of files to be used for training or testing. Masks can be created by running the following bash command:

basiss_path=/raid/local/src/basiss
cd $basiss_pathconda env create -f src/apls_environment.yml   
source activate apls_environmentpython $basiss_path/create_spacenet_masks.py \
     --path_data=/path_to_spacenet_data/AOI_2_Vegas_Train \
     --output_df_path= 
           $basiss_path/packaged_data/AOI_2_Train_2m_file_locs.csv \
     --buffer_meters=2 \
     --n_bands=3 \
     --make_plots=1 \
     --overwrite_ims=1

Figure 1. Example output of the create_spacenet_masks.py script. In the upper-left we show the raw GeoJSON label, and the upper-right displays the corresponding 8-bit RGB image. The lower-left illustrates the output of the script: the pixel mask inferred from the GeoJSON label. The lower-right shows the road mask overlaid on the RGB image.

3. Segmentation Model

We apply a deep learning segmentation algorithm to determine which SpaceNet pixels belong to roads, and which to background. We experiment with multiple network architectures, though note little difference in performance among the various models. Accordingly, in this blog we focus on one of the simplest recent architectures: U-Net. We cast the training masks created in Section 2 into a 2-layer stack consisting of source (road) and background.

def unet(input_shape, n_classes=2, kernel=3,     
         loss='binary_crossentropy', data_format='channels_last'):'''https://arxiv.org/abs/1505.04597
https://github.com/jocicmarko/ultrasound-nerve-segmentation/blob/master/train.py'''print ("UNET input shape:", input_shape) 
inputs = Input(input_shape)conv1 = Conv2D(32, (kernel, kernel), activation='relu',     
                    padding='same')(inputs)
conv1 = Conv2D(32, (kernel, kernel), activation='relu', 
                    padding='same')(conv1)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)conv2 = Conv2D(64, (kernel, kernel), activation='relu', padding='same')(pool1)...

4. Image Slicing

Even for a relatively simple architecture such as U-Net, typical GPU hardware (NVIDIA Titan X GPU with 12 GB memory) will saturate for images > 1000 pixels in extent and reasonable batch sizes > 4. Therefore, in order to evaluate even the SpaceNet image chips (let alone the larger full DigitalGlobe images), we must subdivide the imagery and training masks into smaller chunks, as detailed in the code below.

def slice_ims(im_arr, mask_arr, names_arr, slice_x, slice_y, 
                    stride_x, stride_y,
                    pos_columns = ['idx', 'name', 'xmin', 
                                   'ymin', 'slice_x', 
                                   'slice_y', 'im_x', 'im_y'],
                    verbose=True):
    '''Slice images into patches, assume ground truth masks 
        are present'''
    
    if verbose:
        print ("Slicing images and masks...")
    t0 = time.time()    
    mask_buffer = 0
    count = 0
    im_list, mask_list, name_list, pos_list = [], [], [], []
    nims,h,w,nbands = im_arr.shape
    for i, (im, mask, name) in enumerate(zip(im_arr, mask_arr, 
           names_arr)):
        seen_coords = set()
        if verbose and (i % 100) == 0:
            print (i, "im_shape:", im.shape, 
                   "mask_shape:", mask.shape)
                
        # dice it up
        # after resize, iterate through image 
        #     and bin it up appropriately
        for x in range(0, w - 1, stride_x):  
            for y in range(0, h - 1, stride_y): 
                
                xmin = min(x, w-slice_x)
                ymin = min(y, h - slice_y) 
                coords = (xmin, ymin)
                
                # check if we've already seen these coords
                if coords in seen_coords:
                    continue
                else:
                    seen_coords.add(coords)
                
                # check if we screwed up binning
                if (xmin + slice_x > w) or (ymin + slice_y > h):
                    print ("Improperly binned image,")
                    return                # get satellite image cutout
                im_cutout = im[ymin:ymin + slice_y, 
                               xmin:xmin + slice_x]
                
                ##############
                # skip if the whole thing is black
                if np.max(im_cutout) < 1.:
                    continue
                else:
                    count += 1
                ###############
                
                # get mask cutout
                x1, y1 = xmin + mask_buffer, ymin + mask_buffer
                mask_cutout = mask[y1:y1 + slice_y, x1:x1 + slice_x]                # set slice name
                name_full = str(i) + '_' + name + '_' \
                    + str(xmin) + '_' + str(ymin) + '_' \
                    + str(slice_x)  + '_' + str(slice_y) \
                    + '_' + str(w) + '_' + str(h)
                pos = [i, name, xmin, ymin, slice_x, slice_y, w, h]
                # add to arrays
                name_list.append(name_full)
                im_list.append(im_cutout)
                mask_list.append(mask_cutout) 
                pos_list.append(pos)
    
    # convert to np arrays
    del im_arr
    del mask_arr
    name_out_arr = np.array(name_list)
    im_out_arr = np.array(im_list)
    mask_out_arr = np.array(mask_list)
    
    # create position datataframe
    df_pos = pd.DataFrame(pos_list, columns=pos_columns)
    df_pos.index = np.arange(len(df_pos))
    
    if verbose:
        print ("  im_out_arr.shape;", im_out_arr.shape)
        print ("  mask_out_arr.shape:", mask_out_arr.shape)
        print ("  mask_out_arr[0] == mask_out_arr[1]?:", 
                   np.array_equal(mask_out_arr[0], mask_out_arr[1]))
        print ("  Time to slice arrays:", time.time() - t0, 
               "seconds")
        
    return df_pos, name_out_arr, im_out_arr, mask_out_arr

Figure 2. Process of slicing a large satellite image (top) and ground truth road mask (bottom) into smaller cutouts for algorithm training or inference.

5. Training Procedure

Sliced mask cutouts (400 pixels in extent by default) form the training images for our segmentation algorithm. We train a separate classifier with Keras for each SpaceNet city (Las Vegas, Paris, Shanghai, Khartoum). We hold back 10% of the training set for validation purposes. We use binary cross entropy for our loss function, and utilize early stopping (we stop training when the validation loss has not improved for four consecutive steps). For the Las Vegas dataset, for example, training converges after 17 epochs, which takes 15 hours. Training can be accomplished via the command:

# train Las Vegas SpaceNet 3-band data with unet, 
#   and sliced into 400 pixel cutouts
basiss_path=/raid/local/src/basiss
outname=AOI_2_Vegas_unet_2m_train
cd $basiss_path
nohup python -u src/basiss.py \
 --path $basiss_path \
 --model unet \
 --mode train \
 --file_list AOI_2_Train_2m_file_locs.csv \
 --slice_x 400 --slice_y 400 \
 --stride_x 300 --stride_y 300 \
 --n_bands 3 \
 --n_classes 2 \
 --batchsize 32 \
 --validation_split 0.1 \
 --early_stopping_patience 4 \
 --epochs 128 \
 --gpu 0 \
 --prefix $outname > \
  results/$outname.log \
  & tail -f results/$outname.log

6. Testing Procedure

Test images are sliced as in Section 4 above, and run through the trained model. The multiple cutouts for each image are then aggregated into a final image, as detailed in Figure 3.

Figure 3. Create road prediction map for image of arbitrary size. 1. Run an overlapping sliding window over the large input image. 2. Run image chip through the segmentation algorithm. 3. Stitch image chips into total map. 4. Divide this map by the coverage map (how many times each pixel appeared in a sliding window). 5. Final output is the normalized road mask.

Testing is invoked with the following command:

# test Las Vegas SpaceNet 3-band data with unet, 
#   and sliced into 400 pixel cutouts
basiss_path=/raid/local/src/basiss
outname=AOI_2_Vegas_unet_2m_test
cd $basiss_path
nohup python -u src/basiss.py \
 --path $basiss_path \
 --model unet \
 --mode test \
 --file_list massive_file_list.csv \
 --model_weights AOI_2_Vegas_unet_2m_train_model_best.hdf5 \
 --slice_x 400 --slice_y 400 \
 --stride_x 300 --stride_y 300 \
 --n_bands 3 \
 --n_classes 2 \
 --batchsize 16 \
 --gpu 3 \
 --prefix $outname > \
  results/$outname.log & \
  tail -f results/$outname.log

7. Results

Figure 4. 400m chips over Vegas. Ground truth labels are shown at left, with the predicted mask on the right. Predictions for both of these image chips are quite good (pixel-based F1 > 0.8).

Figure 5. More challenging 400m chips in Khartoum (Top: F1 = 0.59. Bottom: F1 = 0.44). Dirt roads are particularly difficult, and dirt paths are often confused with roads.

Using the procedure detailed in Figure 3, we can also evaluate much larger images. In Figure 6 we illustrate road predictions on large images over 8000 pixels in extent. GPU inference time is ~15 seconds for images 2.5 km in extent.

Figure 6. 2.5 x 2.5 km image chips (left column) and road prediction mask (right column) over Las Vegas. F1 = 0.75 for both images.

8. Conclusions

The SpaceNet Road Detection and Routing Challenge aims to extract road network graphs directly from satellite imagery. In this post we explore one of the steps in the algorithmic chain, namely how to infer road masks from SpaceNet imagery. GPU memory limitations constrain segmentation algorithms to inspect images of size ~1000 pixels in extent, yet any eventual application of road inference must be able to process images larger than ~300m in extent. Accordingly, we demonstrate methods to infer road network masks for images of arbitrary size, and illustrate examples for images 2.5 km (>8000 pixels) in extent. Subsequent posts will explore how to turn these masks into graph structures. We encourage the interested reader to explore our github repository, as well as the SpaceNet challenge hosted on TopCoder.

* Thanks to lporter for helpful comments and David Lindenbaum for assistance with ground truth evaluation