U-Net: Convolutional Networks for Biomedical Image Segmentation

Khushbu Shah

Published in

ProjectPro

14 min readAug 25, 2021

Have you ever wondered how your phone unlocks with your face in less than a few seconds?

How do autonomous cars navigate their way through the road without hitting other objects?

How does the traffic control system monitor any traffic violations?

How are unattended bags monitored at airports for security concerns?

How do doctors locate tumors and other anomalies?

How do defense systems use satellite imagery to target the perpetrators?

Image segmentation makes all these systems work; it provides many applications in our daily lives. Image segmentation is a powerful artificial intelligence technique that helps in object detection, background and foreground segregation, object tracking, object analysis, detecting anomalies, medical image processing, and a lot more. Image segmentation segments the image into small fragments based on the image’s specific parameters, which are characteristic of the chosen area of the image, e.g., threshold, clusters, motion, contour, or edge of the picture. This segmented image is trained to pick up cues from the data for a more robust and accurate segmentation.

What is Image Segmentation?

Image Segmentation extracts valuable information from the visual data while using highly specialized machine Learning algorithms and architectures to make sense of the data to identify, distinguish, sort, make predictions, or replicate the visual data. You might have seen those deep fake videos, which are very famous; they use a GAN architecture but Image Segmentation at the core.

Image Segmentation helps identify, distinguish, sort, and derive meaningful information from any visual data, e.g., images, videos, etc. Image segmentation involves fragmenting any visual form of data into specific demarcated fragments representing the suitable classes or objects to infer some meaningful insights from the visual data. People often confuse image segmentation with image classification. Still, the latter is merely a process to identify and classify images, e.g., identifying tables, a specific car, person, etc. In contrast, image segmentation involves detection, classification, sorting, refining, drawing insights from the visual data, and many applications, e.g., detecting tumors, cancer, object detection, traffic control systems, video surveillance, and biomedical imaging understanding visual data. The majority of the computer vision projects involve Image Segmentation as its first step.

The most straightforward approach in the image segmentation process involves first determining the pixel data via three dimensions: Red, Blue, and Green color. RGB are the dimensions of the pixel data representing the percentage of these color values for each specific pixel. Each pixel is labeled individually and based on the specified classes; we can segment the image by clustering similar pixel values representing a specific class. In this case, the number of dimensions is only three, but we can cluster and segment the images based on different features like texture, depth, intensity, etc., to get the desired results.

Figure 1 Image Segmentation of a highway. Source: Jeong, Jongmin & Yoon, Tae & Park, Jin. (2018). Towards a Meaningful 3D Map Using a 3D Lidar and a Camera. Sensors. 18. 2571. 10.3390/s18082571.

The Need for Image Segmentation

Computer vision is a rapidly developing field of Artificial Intelligence; at the heart of all the applications it offers sits a crucial part among others: Image Segmentation. Every modern or state-of-the-art computer vision architecture like GAN, CycleGAN, etc., starts off with the basic principles of image segmentation. These specialized architectures are built over the fundamentals of image segmentation. Image segmentation offers applications in sectors like:

Defense: Satellite image analysis for identifying enemy posts and possible threats.
Healthcare: Image segmentation finds excellent use in detecting and segregate malignant and benign tumors to understand how diseases progress that will help medical practitioners enable better treatment. We can look out for possible threats and prevent the loss of lives.
Traffic control system: identifying traffic violators and preventing major accidents
Marketing: understanding customer shopping patterns and behaviors.
Autonomous Vehicles and Machines: autonomous cars, drones, VTOL’s, etc.
Security Surveillance Systems: Image segmentation can be used to identify and prevent attacks and threats. We can look for possible threats and avoid loss of lives, prevent deaths due to road accidents by proper monitoring and addressing any speeding car at the right moment.
Geo-spatial analysis: Analysing satellite imagery for geographical and environmental analysis.
Climate Change: Vast amount of satellite images are analyzed to understand environmental degradation due to climate change for problem identification, solution, and prevention.

Image segmentation is a powerful object detection and segregation technique, all of the above applications employ image segmentation at multiple levels. Image segmentation is a powerful computer vision technique that, apart from recreational uses like Snapchat filters, can save lives and progress humanity to a better safe, sustainable future.

Understanding Image Segmentation

We can understand this concept of image segmentation through a relatable and intuitive example: Close your eyes for 2 seconds. Now, look at the image shown below.

Figure 2 A woman and several dogs playing. Source: Photo by Anna Dudkova on Unsplash What did you infer at first when you saw this image.

First of all, you must have identified that there are four living beings in the picture, Then you identified a woman and three dogs are in the image

Then you identified that the woman is caressing the three dogs.

They are outdoors, in a park.

It is most probably the autumn season because the leaves have withered and fallen off.

All of this analysis which your brain carried out happened in steps and it happened so fast that you are not consciously aware and it didn’t feel like much of a task because our brain is a very fast machine with advanced computation abilities. We observed the image, segmented the classes, classified the specific objects, made necessary predictions, or infer useful information from the image. This is exactly what we train computers to do for the purpose of image segmentation. We train the computer to segment the problem as a whole and the image as well not just into pixels but also into segments of objects, classes, or instances.

Get Hands-on Experience Working on Real-World Solved End-to-End Data Science and Machine Learning Projects

Image Segmentation Methods

We prepare the machine to identify or classify an image, to simplify the task we break the process into certain steps, classes, instances, pixels which depend upon the type of application. For example, if someone asks what you see in the image, you will reply to three dogs and a woman. You have segmented the dogs into one category of living beings even though they might be different breeds of dogs. This is an example of semantic segmentation.

In another case, if I would ask which dog is “Bruno” and if you already know who Bruno is you will specifically identify Bruno and point at him out of all the three dogs. This is specific identification and recognition is a type of instance segmentation.

The most basic way of segmenting an image is first of all fragmenting the image into individual pixels. Each pixel has its associated RGB or grayscale values. Based on the application threshold values on the RGB or grayscale version of the image are selected, e.g. if you want to identify the sky and grass in the image, you select the most appropriate value on the RGB scale representing the class. The threshold can be two, three, or more, then the pixel values near to the threshold values are ascribed as a particular class and are shown in the output image with a similar hue, thus classifying a particular area of the image. Other types of segmentation methods are Region-Based Segmentation, Edge Detection Segmentation, etc. which separates the images based on edges, properties of pixels, and similar cluster pixels.

The approach taken for image segmentation has piqued the interests of many researchers and is an active research field. Different architectures and approaches can be used to segment the image, e.g. Threshold Method, Edge Segmentation, Region Segmentation, Deep Neural Networks-based segmentation, etc.

Figure 3 Types of ways to segment an image. Source: Introduction to the Artificial Intelligence and Computer Vision revolution, Scientific seminar at Politecnico di Milano, Italy 2017(https://www.slideshare.net/darian_f/introduction-to-the-artificial-intelligence-and-computer-vision revolution)

U-NET for Image Segmentation

U-Net is one of the most famous image segmentation architectures proposed in 2015 by Olaf Ronneberger, Philipp Fischer, Thomas Brox (University of Freiburg, Germany). This is an end-to-end segmentation technique, which means it takes a raw image in and outputs a defined segmentation map of the image. This is a specially designed Deep Convolutional Network architecture for the segmentation of Biomedical Imaging applications.

Challenges of Biomedical Imaging Applications

Before diving deeper into the U-Net architecture. Let’s look briefly at the main issues with Biomedical imaging to understand the motivation behind the development of this architecture. Challenges:

1. Lack of High-quality data.

2. Lack of large data sets.

3. Classifying touching objects with low to none edge drift.

4. Complex Image textures.

U-Net provides advantages in tackling some of these challenges that biomedical image processing presents. To understand how U-Net is advantageous, let’s dive deeper into its architecture.

U-Net Architecture

Figure 4 U-Net Architecture. Source: Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention — MI

The U-Net architecture is a U-shaped, symmetric convolutional network with a down-sampling contraction path and an up-sampling expansion path. The resulting segmented output image is much smaller in size than the raw input image.

Before we delve deeper into the architecture, let’s refresh some of the important terms and concepts of convolutional networks which will be touched upon shortly.

1. Kernel Size: It is the field of view of how the image will be convolved. It is basically the size of the filter mask.

2. Max-pooling: It reduces the dimensions of the features in order to extract the heavily weighted features which become easy to process, compute and require less computation power for parameter learning.

3. Stride: It is the step size of the kernel which needs to be taken when moving over the image. 4. Padding: It guides how the area around the image to be convolved is handled by extending the area of the image to avoid losing the resolution of the output image. After using padding, the output image has the same size as the input.

Methodology

The architecture methodology consists of a large number of different operations, illustrated by the arrows in the architecture diagram. The input image is fed into the network and then the data is propagated into the network along all the possible paths and at the end, the ready segmented map comes out. Each blue box corresponds to a multi-channel feature map. The top number on the blue box denotes the feature channels and the size is denoted at the bottom. Most of the operations are convolutions followed by a non-linear activation function.

The convolution operations performed are:

• Convolution 3x3, with ReLU activation function.

• Max Pooling 2x2 with a stride of 2

• 2x2 Upsampling Convolutional Operation

• 1x1 Convolutional Operation

Steps:

1. It is a standard 3x3 convolution followed by a non-linear activation function. It only uses the valid part of the convolution which is why for 3x3 convolutions, a one-pixel border is lost. This allows processing large images in individual tiles. This layer helps to extract characteristics from the data i.e. meaningful information from the data.

2. Max-pooling Operation. Reduces the x-y size of the feature map. It propagates maximum activation from each 2x2 window to the next feature map. The resulting map has factor 2 lower spatial resolution. Max-pooling is used to reduce the features’ dimensions to extract the heavily weighted features, which become easy to compute and require less computation power for parameter learning.

3. After each max-pooling operation, feature channels are scaled by a factor of 2.

4. Steps 2 and 3 together result in the sequence of convolutions and the max-pooling operation results in a spatial contraction of the image. This contraction portion helps us understand the “what” of the image, which means we understand what actually is the useful information in the image while simultaneously losing the spatial information.

5. The expansion path has a series of up-convolutions and concatenation with high-resolution features from the contraction path. The expansion path creates a high-resolution segmentation map.

6. 2x2 up-convolution: The up-convolution uses a learned kernel to map each vector to a 2x2 output window followed by a non-linear activation function. The resulting maps have a factor 2 higher resolution. The up-sampling allows the propagation of more information to the high-resolution layers.

7. The series of convolutional operations are followed with concatenations via skip connections. The skip connections are essential, they help to figure out the “where” of the image, these give valuable information about the spatial and graphical information of the image.

8. Finally, 1 × 1 convolution operation yields the final segmented image. The output segmentation map has two channels, one for foreground and another for background classes. Due to unpadded convolutions, the output is smaller than the input image. Therefore, the series of contraction and expansion paths of the U-Net architecture allows leveraging the useful information of feature mapping and pooling while retaining the spatial and graphical resolution yielding a more robust image segmentation approach.

Problems U-Net solves for and provides an advantage in:

1. Less training data: Low availability of biomedical imaging training data can hamper the robustness of the segmentation system. To tackle this and augment the training data, a random elastic deformation property is employed. The resulting image looks almost exactly like the original image and has been correctly classified with deformations as well. This is an important advantage of U-Net architecture.

Figure 5 Results for Augmented training data using deformations. Source: Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention — MI

2. Touching objects: Touching objects of the same class that have to be correctly separated. Background image pixels are inserted between all touching objects and individual loss weight is assigned to every pixel. The use of weighted loss enables a strong penalization for separating background labels between touching cells.

Figure 6 Segmentation of images with fuzzy borders and low-contrast edges. Source: Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and

This U-Net architecture has performed well for use cases like segmentation of neuronal structures in Electron microscopy which was riddled with challenges like fuzzy membranes, structures with low contrast. The results achieved were far better than the sliding window convolutional network for image segmentation. This architecture even did well for segmenting cells that had strong shape variations, weak outer borders, similar structures achieving an Intersection over Union(IoU) of 92%, and the evaluation metric used to measure the accuracy of an object detector on a particular dataset, which was much higher than the second-best method’s IoU of 83%.

So, by and large, this architecture is advantageous for image segmentation and can tackle challenges like low training data availability, touching and overlapping objects, partially invisible borders between different objects, fuzzy borders, low contrast edges, and objects with strong shape variations. Tackling these issues deems U-Net extremely advantageous for biomedical imaging applications, which is especially plagued with the aforementioned problems. It did exceedingly well with fewer data around 30 annotated images by leveraging its accuracy to tap into the robustness achieved from data augmentation with elastic deformations.

Implementing U-Net Architecture using TensorFlow:

Import:

First of all, we import the required packages and libraries, we import TensorFlow libraries like Keras, which is a neural network library.

Contraction path:

Then we move towards coding the down-sampling path. We define the input layers, describe the image channels. Then start by creating a convolutional network, we have to initialize the weights of the network and set a transformation which we want to apply. Here, we are using the ReLU activation function with a ‘he’ initializer. The output image size will be the same as the input for that. The padding is kept the same. These layers are built after a series of max-pooling operations and convolutional neural networks. So, below layer is the contraction path of U-Net

Expansive path:

This is the up-sampling path, we set our initializer, activation function, and then along with 2x2 convolution operation, we perform upscaling. In upsampling, more feature channels allow propagating more information to the high-resolution layers which are enabled by the skip-connections. We concatenate the down-sampling path outputs via the skip connections into the down-sampling path and specify this in the form of code.

At the last convolutional network, we do a 1x1 convolution using a sigmoid activation function to yield the segmented map.

Biomedical Image Segmentation Model

Here, now we choose the optimizer and the loss function for our neural network. we are measuring the model’s accuracy so we define this metric as well. Here, I have used ADAM adaptive moment estimation optimizer but we can use (SGD) stochastic gradient descent, RMSprop, AdaGrad, etc. But, it is shown that Adam gives exceedingly good results for image segmentation applications. Adam has shown to provide the best model performance for biomedical imaging use cases like Brain Tumor Segmentation in Magnetic Resonance Images. It is most suitable for classification and segmentation use-cases outperforming 10 initializers including Adagrad), Adaptive Delta (AdaDelta), Stochastic Gradient Descent (SGD), Adaptive Momentum (Adam), Cyclic Learning Rate (CLR), Adaptive Max Pooling (Adamax), Root Mean Square Propagation (RMS Prop), etc.[1]

Also, we have used the DiceLoss function. There are several loss functions to choose from like commonly used binary cross-entropy loss function, mean squared error, IoU loss, etc. The dice loss function accounts for pixel-wise loss by measuring loss between two probability distributions. Also, research has shown it is most suited for Multiclass medical imaging problems, outperforming the loss functions like crossE, IoU.[2]

Since 2015, many improvised architectures have been proposed and built on top of the U-Net to improvise it in its weak areas.

There is some incompatibility associated with the propagation of features from the contraction phase of the network to the expansion phase. Also, U-Net has discrepancies in handling multi-modal classes and multiresolution analysis. Architectures like MultiResUNet[3] have been developed by improvising the U-Net, which tackles the incompatibility issues of feature maps and aids in multi-resolution analysis. There has also been a re-invention on the skip connections of the old U-Net architecture into a new improvised architecture, UNet++[4]. The semantic gap is reduced very much, resulting in faster and much more efficient learning.

As computer vision and image segmentation are growing very fast, many recent technologies are becoming obsolete very fast — the hottest field for image segmentation application in biomedical imaging. As much as this technology can be helpful in the early detection and mitigation of diseases, it has high stakes as well. Biomedical applications are susceptible fields that demand rigorously tested, efficient, safe, and accurate models for implementation. When these factors are considered, they can immensely improve the quality of human life.

Reference:

[1] Brain Sci. 2020, 10, 427; doi:10.3390/brainsci10070427

[2] J. Imaging 2021, 7, 16. https://doi.org/10.3390/jimaging7020016

[3] N. Ibtehaz and M.S. Rahman / Neural Networks (2020). https://doi.org/10.1016/j.neunet.2019.08.025

[4] arXiv:1807.10165 [cs.CV]