Image segmentation with Neural Net

Published in

Above Intelligent™ — Latest in Artificial Intelligence

4 min readMar 14, 2017

First version 14th of March 2017

Neural Network with convolution filters are very accurate at identifying an object, or a person, in a photo. What about decomposing a scene comprising not just one object but several objects? This is the image segmentation challenge.

This post discusses :

Segmentation neural-network
Up-sampling matrices with “DeConvolution” layers
Keras implementation of non-sequential neural-network
The impact of training method on segmentation accuracy
The impact of image resolution on segmentation task

Neural-network architecture : FCN-8s

Not surprisingly re-using a 1-object classifier model can help a lot to solve the multi-object problem. This is the approach we present here. Specifically we see how VGG “1 photo => 1 class” architecture can be unrolled back to the pixel wise segmentation. This is the FCN-Xs model.

The main issue when turning VGG into pixel wise segmentation is that information localization is lost after each convolution+pooling block. On the model diagram this is represented by the downward trend : each pooling block “P” shrinks the resolution by 2. Specifically the Pooling functions groups matrices cells by 2x2 tiles and replace them with a single cell containing the maximum value of the 4 original ones.

The FCN-Xs model “U” blocks Up-sample to higher resolution using DeConvolution alias Convolution Transpose alias Fractional Convolution operations. You might have expected “un-pooling” operation to play the converse role of the “pooling” one. However “un-pooling” is parameter free and with Deep-Learning we hope to train very expressive functions from large datasets. The authors of the FCN-Xs model (see paper) opted for DeConvolution layers as it has a trainable kernel.

Remark : The final up-sampling block needs to increase resolution by 2^3=8, thus the model reference FCN-8s

Implementation with Keras

Source code and test notebook available here on GitHub.

To discuss the implementation I use a simpler model version, the FCN-16s.

Note that the model is sequential until the end of the first Up-sampling block. This sequence of operation is named FCN-32 in the following code snippets. The FCN-32 implementation is mostly the same as the VGG16 model discussed here.

In the diagram the novelty lies in :

The red arrow out of the CB4>P node: it turns a stack of ‘N’ convolution filters into 21 categorical filters (“score_pool4” in the code).
The SUM block implemented with the “merge” function.
The 2^4=16 up-sampling de-convolution (“upsample_new” in the code).

Model weights

Commonly used training dataset for image segmentation tasks :

PASCAL Visual Object Classes : VOC
Microsoft Common Object in Context : COCO

Fortunately we do not need to train FCN-8s as best-in-class trained weights are available here on the MatConvNet site. The file needs some wrangling to be converted from MatConvNet to Keras as explained in this precedent post.

Interestingly the MatConvNet site provides with 2 different set of trained weights for the FCN-8s architecture :

One comes from the team that authored the original model. This version was trained solely on the PASCAL VOC 2011 dataset.
The other one come from a refined model that was trained with a post-processing step. This version was trained using both the PASCAL VOC 2011 and the Microsoft COCO datasets.

Using two state-of-the-art calibrations of the same architecture allows us to get a sense of how important are the data and training method for the whole model accuracy.

Tests / Results

In all the model configurations we tested the neural-net identified the following classes in the image :

background : correct
person : correct
bicycle : correct
motorbike : incorrect, back wheel of the bicycle on the right

Coarsest configuration

FCN-16s trained on VOC.
Applied to a 512x512 image.

Intermediate configuration

FCN-8s trained on VOC.
Applied to a 512x512 image

Observation : Using the same training method, for this test image, the FCN-16s and FCN-8s configuration perform very similarly.

Finest configuration

FCN-8s trained on VOC+COCO using RCF.
Applied to a 512x512 image.

FCN8s — 512x512 — trained with RCF post-processing

Observation : Improving the training method of the FCN-8s improve very significantly its accuracy, for this test image.

Impact of input image size

FCN-8s trained of VOC+COCO using RCF.
Applied to images of width 256, 384 and 512.

References and source code

Paper 1 : “Fully Convolutional Models for Semantic Segmentation”, Jonathan Long, Evan Shelhamer and Trevor Darrell, CVPR, 2015.
Paper 2 : “Conditional Random Fields as Recurrent Neural Networks”, Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr, ICCV 2015.
Source code available in this GitHub project.

1 last word

If you enjoyed reading this article, you can improve its visibility with the little green heart button below. Thanks!