Test-drive your neural network, or how to annotate and train a neural network with your own data in twelve minutes.

Artwork by John Kenn Mortensen which I think encapsulates the essence of training neural networks.

Roughly two years ago I started digging into neural networks (supervised learning) and I must admit that I was surprised by the steep learning curve. From a background in game-related software engineering, I have been used to being able to whip up a decent prototype on a game-jam weekend. After I started using neural networks my “productivity” dropped drastically, as the workflows and challenges are very different and extremely time consuming.

From the online material I found and followed, my understanding is that working with Neural Networks is usually presented as:

  1. Find a dataset
  2. Choose layers, loss function and optimizer.
  3. Tweak hyperparameters (layer dimensions, learning rate, regularization, initialization etc.) while training. Endless fun here.
  4. When running out of time celebrate best convergence graph and move on.

It might be because I have a background in games and interaction, where I am used to exploring prototypes (think dynamical modelling of something like a boat), that I find this a bit like building cars you don’t test-drive. I believe this disconnection from having the neural network embedded in an interactive application is blocking us from gaining valuable intuition about their behavior and their improvements. A different insight than only watching a convergence graph.

I also believe that application implementation is underrated in the neural network material I have encountered, as most of my “breakthroughs” happen after having studied others work, revealing new tools, libraries and workflows, allowing implementation details of my own to fall into place.

This article is an attempt to share a project and hopefully spark some discussion about how one might use interactive approaches with neural networks in general.


First a demo video.

Interactive Segmentation Demo

In this project I create annotated data for the network on the fly. I have a couple of reasons for wanting to try this out.

  • I find it difficult to estimate how much data is needed to make a neural network solve a certain task. Is 100, 1000 or 10000 annotated samples needed? No one can say as the complexity of the task and many other factors has an effect. Building the dataset on the fly is a practical way of exploring this.
  • I would like to be able to interactively explore the networks behavior and find weak spots.
  • I would like to be able to focus annotations where the network needs correction.

I choose webcam input to be the data source as it is my best available source of rich data. Using the webcam it is easy to dig into weaknesses and try out things interactively. I use a modest resolution for the input (160x120) to have a somewhat responsive application. I use Pygame’s webcam implementation because it is easy and smooth.

The output of the network is a single channel image segmenting the image into foreground or background.

In my experience deciding on network architecture is close to black magic, I therefore tend to setup something relatively simple just enabling me to prove a point.

In this case I setup a sandwich of 7 convolutional layers taking the RGB image as input ending up in a one channel image of similar size as the input image. Each convolution use a 3x3 kernel size, a dilated convolution rate of two and a ReLU activation function for all but the last layer. The pixel values are floats determining if the pixel is part of foreground or background. Values above zero is considered foreground.

For display purposes I create a synthetic image in Tensorflow where the grayscaled input image is put in the green channel, while the red and blue channels are used to describe the foreground and background segmentation prediction returned from the network. It is essential to have a intuitive way of interpreting the network response.


When training the network a loss function is needed for the back-propagation algorithm to run.

One problem that stalled me for a while was that I was required to supply fully segmented images before I could start training my network. This is expensive as it requires me to decide on every pixel in the image before being able to start training. I was therefore searching for ways to cut down the cost of this. I was thinking “What if I could focus my annotations on only part of the image?”.

Conceptual annotation

Something along the lines of this, where red overlay indicate foreground, blue overlay indicate background and the rest is undecided.

This had me thinking for a bit, and then I came up with a simple solution. Use +1.0 as foreground target and -1.0 as background target value while leaving all undecided areas as zero. This way the loss function (I use mean squared error, as it is easily understood to me.) can be adopted to express loss only where annotation is available. This is achieved by multiplying the loss function with the absolute value of the target value, being 1 where annotations are given and zero elsewhere.

absy = tf.abs(y)
loss = tf.reduce_sum(tf.pow(response-y,2)*absy)/tf.reduce_sum(absy)

And voilà the network can train against partly annotated data. :)


This idea of saving annotations evolved further into the idea that “What if I could focus my annotations only on the parts of the image that my network classify wrong?” This could be accomplished by letting the network classify every image and show the result to the user. Then it would be easy for an annotator to only make annotations in areas where the network is wrong focusing annotations on the networks weak spots.


I create annotations by letting the user point the mouse and with left or right mouse button indicate if this area is part of foreground or background. By displaying the networks current segmentation in the image it is easy for the operator to direct attention to the areas where the user disagree with the networks response. Each click result in an annotation with an input image and a corresponding target image where only a part of the pixels are decided to be foreground or background (1 or -1). The size of the target can be adjusted using the scroll wheel.

Having operated the system for a while I quickly found that I did mistakes when annotating images, so I implemented an edit mode where one can cycle through past annotated images and do corrections. Note that the network response is displayed shortly while cycling the images or constantly if pressing x. I also found it a bit wasteful to lose the annotations on restart, so they are now written to disk for reuse.

To get a feel of the effect of the learning rate, it is made available as a variable using the up/down arrow keys. A large learning rate makes the network update its behavior faster, often overshooting and making the network response oscillate.

To get a feel of how regularization effects training it is also exposed using left/right arrow keys. The regularization factor adds a penalty for the size of the weights in the network. The idea is that penalizing large weights in the network should make the network generalize better. I have had a hard time seeing this effect in practice.

The project was developed in an Arch Linux environment with a Nvidia 1060 GPU. I use Python 3.6.3, Tensorflow 1.3.0, pygame and friends through the Anaconda virtual environment.


Nose segmentation in action

Running the project has given me some insights to the network and the training of it.

  • It seems important for the data points to be somewhat balanced between the foreground and background classes. Too many background samples drag the network in a too extreme pessimistic or optimistic direction.
  • It seems to work well if annotations are alternately collected from foreground and background while choosing the areas where the network is most wrong, roughly nudging the network 5–10 samples at a time. Initially I trained with a single sample at a time and then it was very important. Then I expanded training to use mini batching and that made the alternation less important.
  • It seems that the process of only annotating areas where the network is wrong is a very network state specific operation. If the the network is retrained on the same annotations in a different order it seems to give very different results. It feels like the interactive training of the network is steering the network into place, sort of like steering a car where the order of steering wheel corrections matter.
  • Looking at the network response it is possible to get a feeling of the size of the receptive field in the network. It can sometimes be seen as zebra like stripes in the beginning of the annotation process, especially when mooving your hands in front of the network. The receptive field is displayed as a rectangle in the GUI.
  • A hotkey ‘b’ is setup to collect an entire image and classify it as background. The intuition behind this was that I could remove the foreground feature from the image (physically) and quickly generate a lot of background pixels. This seems to drag the whole network too violently, especially when not using mini batching. My guess is that there need to be a balance between the number of foreground and background pixels that the network is trained with. Or perhaps I could choose a better loss function?
  • Training mode can be toggled. If training is enabled the network will train on past collected annotations allowing the network to drift into a compromise between past collected samples. If disabled it is only training on the new interactive annotations being made. This is useful when trimming the behavior of the network.
  • I have been surprised that one can achieve somewhat usable behavior on as few as 1100 very sparsely annotated images. (roughly 25000 annotated pixels) It is much lower than I initially expected. But I must note that, had the network been trained against fully annotated images, I think, much more “generality” would probably have been picked up by the network.

This is a toy example of using an extremely small data-set to train a neural network. It will most probably misbehave if exposed to different conditions (lighting, weather, people etc.), so don’t put your next minimally interactive neural network prototype into the drivers seat just yet.

The full source is available here.

Happy hacking and please comment.

Sincerely

Jesper Taxbøl