Broad Area Satellite Imagery Semantic Segmentation (BASISS)
Extracting Road Masks from Massive SpaceNet Images
Road network detection from overhead imagery is of great interest to a diverse set of disciplines, from humanitarian crises to autonomous vehicles. Road network mapping remains especially challenging in broad expanses of underdeveloped regions. In order to help address this challenge, the latest SpaceNet dataset includes over 8000 km of road centerline labels with co-registered imagery. In a previous post we used these centerlines to create rasterized road masks. In this post we explore methods to derive road segmentation masks from SpaceNet satellite imagery, and demonstrate techniques to mitigate deep learning hardware limitations in order to infer maps over large areas. Attendant code is provided for the interested reader.
1. Existing Approaches and Data
Extracting road pixels in small image chips from aerial imagery has a rich history (e.g. Mnih and Hinton 2010, Wang et al 2016, Zhang et al 2017). These algorithms generally use lower resolution imagery (GSD >= 1 meter), and OpenStreetMap labels. The large dataset size, higher resolution (0.3 meter GSD), and hand-labeled and quality controlled labels of SpaceNet provide a significant enhancement over current datasets and provide an opportunity for algorithm improvement. SpaceNet image chips are 1300 x 1300 pixels in size (~400m x ~400m).
2. Mask Extraction
In a previous post we demonstrated how to extract road training masks from SpaceNet imagery and labels (see Figure 1). The code to create masks is slightly tweaked from the previous post, allowing either 3-band RGB or 8-band multispectral imagery to be utilized, and collects lists of files to be used for training or testing. Masks can be created by running the following bash command:
basiss_path=/raid/local/src/basiss
cd $basiss_pathconda env create -f src/apls_environment.yml
source activate apls_environmentpython $basiss_path/create_spacenet_masks.py \
--path_data=/path_to_spacenet_data/AOI_2_Vegas_Train \
--output_df_path=
$basiss_path/packaged_data/AOI_2_Train_2m_file_locs.csv \
--buffer_meters=2 \
--n_bands=3 \
--make_plots=1 \
--overwrite_ims=1
3. Segmentation Model
We apply a deep learning segmentation algorithm to determine which SpaceNet pixels belong to roads, and which to background. We experiment with multiple network architectures, though note little difference in performance among the various models. Accordingly, in this blog we focus on one of the simplest recent architectures: U-Net. We cast the training masks created in Section 2 into a 2-layer stack consisting of source (road) and background.
def unet(input_shape, n_classes=2, kernel=3,
loss='binary_crossentropy', data_format='channels_last'):'''https://arxiv.org/abs/1505.04597
https://github.com/jocicmarko/ultrasound-nerve-segmentation/blob/master/train.py'''print ("UNET input shape:", input_shape)
inputs = Input(input_shape)conv1 = Conv2D(32, (kernel, kernel), activation='relu',
padding='same')(inputs)
conv1 = Conv2D(32, (kernel, kernel), activation='relu',
padding='same')(conv1)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)conv2 = Conv2D(64, (kernel, kernel), activation='relu', padding='same')(pool1)...
4. Image Slicing
Even for a relatively simple architecture such as U-Net, typical GPU hardware (NVIDIA Titan X GPU with 12 GB memory) will saturate for images > 1000 pixels in extent and reasonable batch sizes > 4. Therefore, in order to evaluate even the SpaceNet image chips (let alone the larger full DigitalGlobe images), we must subdivide the imagery and training masks into smaller chunks, as detailed in the code below.
def slice_ims(im_arr, mask_arr, names_arr, slice_x, slice_y,
stride_x, stride_y,
pos_columns = ['idx', 'name', 'xmin',
'ymin', 'slice_x',
'slice_y', 'im_x', 'im_y'],
verbose=True):
'''Slice images into patches, assume ground truth masks
are present'''
if verbose:
print ("Slicing images and masks...")
t0 = time.time()
mask_buffer = 0
count = 0
im_list, mask_list, name_list, pos_list = [], [], [], []
nims,h,w,nbands = im_arr.shape
for i, (im, mask, name) in enumerate(zip(im_arr, mask_arr,
names_arr)):
seen_coords = set()
if verbose and (i % 100) == 0:
print (i, "im_shape:", im.shape,
"mask_shape:", mask.shape)
# dice it up
# after resize, iterate through image
# and bin it up appropriately
for x in range(0, w - 1, stride_x):
for y in range(0, h - 1, stride_y):
xmin = min(x, w-slice_x)
ymin = min(y, h - slice_y)
coords = (xmin, ymin)
# check if we've already seen these coords
if coords in seen_coords:
continue
else:
seen_coords.add(coords)
# check if we screwed up binning
if (xmin + slice_x > w) or (ymin + slice_y > h):
print ("Improperly binned image,")
return # get satellite image cutout
im_cutout = im[ymin:ymin + slice_y,
xmin:xmin + slice_x]
##############
# skip if the whole thing is black
if np.max(im_cutout) < 1.:
continue
else:
count += 1
###############
# get mask cutout
x1, y1 = xmin + mask_buffer, ymin + mask_buffer
mask_cutout = mask[y1:y1 + slice_y, x1:x1 + slice_x] # set slice name
name_full = str(i) + '_' + name + '_' \
+ str(xmin) + '_' + str(ymin) + '_' \
+ str(slice_x) + '_' + str(slice_y) \
+ '_' + str(w) + '_' + str(h)
pos = [i, name, xmin, ymin, slice_x, slice_y, w, h]
# add to arrays
name_list.append(name_full)
im_list.append(im_cutout)
mask_list.append(mask_cutout)
pos_list.append(pos)
# convert to np arrays
del im_arr
del mask_arr
name_out_arr = np.array(name_list)
im_out_arr = np.array(im_list)
mask_out_arr = np.array(mask_list)
# create position datataframe
df_pos = pd.DataFrame(pos_list, columns=pos_columns)
df_pos.index = np.arange(len(df_pos))
if verbose:
print (" im_out_arr.shape;", im_out_arr.shape)
print (" mask_out_arr.shape:", mask_out_arr.shape)
print (" mask_out_arr[0] == mask_out_arr[1]?:",
np.array_equal(mask_out_arr[0], mask_out_arr[1]))
print (" Time to slice arrays:", time.time() - t0,
"seconds")
return df_pos, name_out_arr, im_out_arr, mask_out_arr
5. Training Procedure
Sliced mask cutouts (400 pixels in extent by default) form the training images for our segmentation algorithm. We train a separate classifier with Keras for each SpaceNet city (Las Vegas, Paris, Shanghai, Khartoum). We hold back 10% of the training set for validation purposes. We use binary cross entropy for our loss function, and utilize early stopping (we stop training when the validation loss has not improved for four consecutive steps). For the Las Vegas dataset, for example, training converges after 17 epochs, which takes 15 hours. Training can be accomplished via the command:
# train Las Vegas SpaceNet 3-band data with unet,
# and sliced into 400 pixel cutouts
basiss_path=/raid/local/src/basiss
outname=AOI_2_Vegas_unet_2m_train
cd $basiss_path
nohup python -u src/basiss.py \
--path $basiss_path \
--model unet \
--mode train \
--file_list AOI_2_Train_2m_file_locs.csv \
--slice_x 400 --slice_y 400 \
--stride_x 300 --stride_y 300 \
--n_bands 3 \
--n_classes 2 \
--batchsize 32 \
--validation_split 0.1 \
--early_stopping_patience 4 \
--epochs 128 \
--gpu 0 \
--prefix $outname > \
results/$outname.log \
& tail -f results/$outname.log
6. Testing Procedure
Test images are sliced as in Section 4 above, and run through the trained model. The multiple cutouts for each image are then aggregated into a final image, as detailed in Figure 3.
Testing is invoked with the following command:
# test Las Vegas SpaceNet 3-band data with unet,
# and sliced into 400 pixel cutouts
basiss_path=/raid/local/src/basiss
outname=AOI_2_Vegas_unet_2m_test
cd $basiss_path
nohup python -u src/basiss.py \
--path $basiss_path \
--model unet \
--mode test \
--file_list massive_file_list.csv \
--model_weights AOI_2_Vegas_unet_2m_train_model_best.hdf5 \
--slice_x 400 --slice_y 400 \
--stride_x 300 --stride_y 300 \
--n_bands 3 \
--n_classes 2 \
--batchsize 16 \
--gpu 3 \
--prefix $outname > \
results/$outname.log & \
tail -f results/$outname.log
7. Results
Using the procedure detailed in Figure 3, we can also evaluate much larger images. In Figure 6 we illustrate road predictions on large images over 8000 pixels in extent. GPU inference time is ~15 seconds for images 2.5 km in extent.
8. Conclusions
The SpaceNet Road Detection and Routing Challenge aims to extract road network graphs directly from satellite imagery. In this post we explore one of the steps in the algorithmic chain, namely how to infer road masks from SpaceNet imagery. GPU memory limitations constrain segmentation algorithms to inspect images of size ~1000 pixels in extent, yet any eventual application of road inference must be able to process images larger than ~300m in extent. Accordingly, we demonstrate methods to infer road network masks for images of arbitrary size, and illustrate examples for images 2.5 km (>8000 pixels) in extent. Subsequent posts will explore how to turn these masks into graph structures. We encourage the interested reader to explore our github repository, as well as the SpaceNet challenge hosted on TopCoder.
* Thanks to lporter for helpful comments and David Lindenbaum for assistance with ground truth evaluation