Isolating Food in Images

by way of semantic image segmentation, using convolutional neural networks

Published in

jstack.eu

11 min readJan 20, 2020

Image generated using tensorflow deeplab

Recently, I developed a proof-of-concept application that lets users snap a picture of their meal and lets them see what it would look like if it had been displayed on a different plate.

Imagine you made the perfect hotdog and want to know what the ultimate plate would be to serve the hotdog on. Sure, you could buy all the candidate plates, make a dozen hotdogs, display them on the plates and make a comparison, but this seems like too much of a hassle. You could snap a picture of your hotdog and photoshop it into pictures of candidate plates, but who has time for that? Now imagine you could just snap a picture of your hotdog with your smartphone and find out what it would look like when displayed on a whole catalog of plates that are available for purchase right from the app. Wouldn’t that make things easier?

In the following, I will describe what the main challenge was while creating this application and how I tried to tackle this challenge with both a ‘classical’ computer vision solution and with a deep learning solution.

Challenge

The main challenge of this project was to create a piece of software that could remove the ‘background’ from images of food with as little required human interaction as possible. Ideally, you just give it a picture of your meal and it returns an image containing nothing but the food from the original picture.

To perform this task, the program should be able to perform 2 steps in sequence:

Find the food in the original image
Remove everything but the food from the original image

Performing the second step of this task is trivial if the result of the first step is a bitmap, containing a set bit (1) if food is present in the pixel and an unset bit (0) if no food is present, for every pixel of the original image. Finding the food in the image will therefore be the focus of this post.

I made 2 attempts to perform this first step: one by using classical computer vision techniques and one by using deep learning. These attempts are described below.

‘Classical’ Computer Vision

Instead of immediately jumping to sophisticated deep neural network architectures, as seems to be the trend in AI these days, I decided to try some classical Computer Vision techniques first. I am, however, not very familiar with these techniques so feel free to make suggestions on how I could have done this better in the comments.

I was able to achieve an acceptable result by composing a pipeline that performs the following steps:

Resize the image to 500 by 500 pixels
Perform a median blur with a window of 5 by 5 pixels
Perform k-nearest-neighbors (kNN) clustering on the colors of the images
(with k=7)
Take the gray-scale of the image
Perform a median blur with a window of 9 by 9 pixels
Perform kNN clustering on the colors of the images
(with k=4)
Perform a binary threshold filter on the image with a threshold of 200
Find the first blob of 0’s starting from the center of the image breadth-first, flood-fill it with 1’s onto the mask bitmap and set the rest of the bitmap to 0

The output of the 8 steps described above, from top-left to bottom-right

I added the blurring steps in an attempt to smear out highlights and shadows in the images so that the kNN clustering algorithm would have a lower probability of clustering the highlights together with light colored background objects, like plates. The downside to this step is that it also blurs out the edges of the objects that I want to segment.

I also experimented with developing a custom similarity metric that takes the proximity of 2 pixels into account on top of their colors and tried using this new metric in the kNN clustering algorithm. My idea was that this new metric would help cluster areas within an object together, even though they have slightly different colors, due to lighting for example.

I ended up dropping this idea, since this new metric ended up creating a lot of unnecessary clusters within background objects. This is because background objects, like tables or plates, usually take up a large area of the image. This leads to situations where, for example, the pixels in the top right and bottom left of the image show different parts of the same table with the same color, but the kNN clustering, using this new metric, assigns them to different clusters because they are simply too far apart.

Overal, this Computer Vision pipeline approach seems to work, but is very restrictive. The implementation makes a couple of assumptions:

Food is always at the center of the image
The background surrounding the food is always significantly lighter of color than the food (no black plates allowed)
The food itself can never be very light of color (due to the binary threshold filter)
Food can never spill over its container (due to the flood-fill step)

Example results using the described pipeline

The advantages of this approach are that we fully know what actions the software performs, we can easily perform further tweaking if necessary and we need very little, if any, data to create it. The disadvantages however, namely that it is simply not very accurate and is much too restrictive, outweigh the advantages for the sake of the proof-of-concept application and I could not immediately find a way to fix these issues. This is why I ended up making the switch to a deep learning approach.

Deep Learning Background

In the following, I will give a very high-level explanation of some background that is needed to understand the rest of the this post. You can skip to the ‘Semantic Image Segmentation using keras_segmentation’ section if you are very familiar with deep learning and transfer learning.

Artificial Neural Networks

Artificial neural networks are universal function approximators and should, in theory, be able to learn any function, given that the network is large enough and enough training data is provided. It should therefore be possible to train a network to recognize food on a picture. These networks are divided into layers that extract and combine features from their input values by multiplying them with a learned weight and pass the resulting output forward through the layers.

The process of adjusting these weights in order to reduce the error of the network, and thus better approximate the target function, is what’s called ‘training’ the network.

The function that needs to be approximated for our proof-of-concept is one that takes an image as input and returns a mask containing the regions in the image that show food.

Convolutional Neural Networks

Convolutional neural networks (CNN), a type of artificial neural network, are often used as image processing deep learning models. This is because convolutional layers contain ‘filters’, as opposed to the neurons contained by regular feed-forward layers. These filters are sets of weights that are shared across the entire input of the layer. CNN’s can achieve similar results to feed-forward neural networks (FFNN) with not nearly as many weights, since every neuron in a FFNN’s assigns a unique weight to every one of its input values. The filters of a CNN effectively allow the convolutional layers of a CNN’s to learn how to recognize patterns in their inputs regardless of where within the input the pattern occurs. This is something a FFNN cannot do as efficiently.

A graphical representation of a typical FFNN

A graphical representation of a typical CNN

Semantic Image Segmentation

The specific task that needs to be performed is called semantic image segmentation, which means that the network should split the input image into regions that contain semantically similar data. Another way to look at semantic image segmentation is by interpreting it as a pixel-by-pixel classification task. In this case, the network needs to distinguish 2 regions of the image, which is the same as applying 1 of 2 labels to the pixels of the image. The network needs to make the distinction between food on the one hand and background or non-food on the other.

Semantic image segmentation using CNN’s is usually done by convolving the image down into a lower resolution, containing only the necessary semantic information and then ‘deconvolving’ the image back to its original resolution. There are many different network architectures that perform some variation of this process. I mention further down in this post which specific network I ended up using and provide a link to its specification.

A graphical representation of fully convolutional segmentation network

Labeled Data

Finding a readily available dataset of food images with pixel-by-pixel class labeling proved to be unsurprisingly difficult. This is an issue, because semantic image segmentation models need a large amount of training data in order to perform well. Manually creating a dataset, by collecting images of food from the web and drawing a black-and-white mask layer over them, is doable with the right tools, but is very tedious and should be avoided if you value your time.

Transfer Learning

You can get away with only labeling a relatively small dataset if you can find a model that is already trained to solve a similar problem, instead of training a model from scratch. You could take this model, keep the weights of its earlier layers (which are performing general feature extraction) and retrain the later layers (which are more involved in high-level processes that are specific to the problem that the network was trained for). This process is called transfer learning.

You could greatly reduce the amount of necessary training data if you could, for example, do the following:

Take a network that was trained to recognize people in pictures
Keep all the earlier layers that are presumably involved in recognizing shapes and how those shapes can combine into more complex structures
Discard the later layers that are most likely responsible for recognizing structures that are specific to human bodies
Retrain these last layers into layers that recognize structures that are specific to food

Semantic Image Segmentation using keras_segmentation

When looking for pre-trained image segmentation networks to use for transfer learning, I came across the keras_segmentation Python package on Github. The package provides some of the most popular segmentation network architectures along with methods to easily train and evaluate them. It is a very useful package and I highly recommend it to anyone who wants to try out image segmentation without having to write a lot of the boilerplate code that is involved in such a process. Most importantly, for the purposes of a proof-of-concept application, the package comes with a couple of pre-trained networks along with methods to transfer weights and perform transfer learning. I tested a couple of the pre-trained networks out and settled on the pspnet implementation that was trained on the ADE20k dataset. The paper that describes pspnet is very technical, but is a good read if you’re interested.

Segmentation examples of the pre-trained networks, provided by keras_segmentation

Process and Results

I created a dataset of 58 food images and manually created the desired masks for them. 52 of these images served as training data, while the rest served as validation data.

Keras_segmentation comes with a built-in data augmenter that performs a large number of different, randomized operations on the training and validation data. These operation include but are not limited to sheering, rotating, flipping, scaling, blurring and adding noise. Performing these operations makes each image that is fed to the network as a training example unique, albeit very similar, to every other training example and helps combat the networks tendency to over-fit.

I instantiated a new pspnet, transferred the appropriate weights from the pre-trained network and setup the training using the methods provided by keras_segmentation.

After training the network for 15 epochs with 64 batches of 16 augmented training examples per epoch, the network was able to achieve a segmentation accuracy of 0.97 on the validation data, meaning that 97% of the pixels of the validation images where classified correctly.

While these results are by no means perfect, they give a good indication of what the network could be capable of doing if it were further fine-tuned and more training data were available.

There are a couple of disadvantages to using a deep learning approach over a classic computer vision. A neural network cannot be tweaked in the same way that the earlier proposed pipeline can. You simply can’t hope to make sense of the 46,765,250 parameters of this network and manually tweak them to improve the network’s output. It is for this reason artificial neural networks are considered black-boxes. If you want to make changes to the model, you’ll have to retrain it. Another apparent disadvantage is that the edges of the masks that this model produces are noticeably less accurate than the computer vision pipeline we defined earlier, although this could probably be further improved in the future.

A major advantage to this approach however, is how few restrictions there are to the input images. Because of the data augmentation that was performed on the training data and because of the way convolutional filters can recognize patterns anywhere in the data, this network can recognize food anywhere in a picture.

This is the network that ended up being implemented in the proof-of-concept project.

Now that the app can successfully remove the background from images of food, we can finally find that perfect hotdog plate that we’ve been working towards this whole time.

Conclusion

From what I have learned, I conclude that it definitely seems possible to perform semantic segmentation on images of food. To achieve better results in the future however, I suggest looking further into different existing models to perform transfer learning, creating a larger training dataset and exploring the training hyper-parameters more rigorously. I also suggest experimenting with adding extra classes to the segmentation labels. Such an extra class could, for example, be used to label all the pixels that show the ‘container’ of the food, like plates and bowls.