U-net Unleashed: A step-by-step guide on implementing and training your own segmentation model in Tensorflow: Part 1

4 min readJan 8, 2024

In this series, we will implement Image Segmentation using a U-Net model built from scratch. The first part will cover the basics of Segmentation and how to perform it from scratch. In the second part, we will use this U-Net model to perform Segmentation on real-world data.

Part 1 — Introduction to Segmentation and Coding a U-Net (this article)
Part 2 — Performing Segmentations using U-Net

Segmentation

Segmentation is a computer vision task that divides an image into meaningful regions or partitions. It assigns each pixel within an image a specific label to highlight different objects or patterns in an image. Hence, segmentation has pivotal applications in medical image analysis, self-driving cars, robotics, and more.

U-Net

U-Net is a Convolutional neural network architecture designed by Olaf Ronneberger in 2015. As we can see in the figure, it forms a distinctive U — shape hence the name, “U-Net”. The robust ability of this architecture to contract the image, extract context, and expand the image to localize the image has proven effective in various segmentation tasks, especially in medical image analysis.

Encoder

As we can see in the architecture, the Encoder consists of an Input layer a series of two convolutional layers, and a MaxPooling layer. This repeated convolution and downsampling helps the encoder to create a hierarchy of features ranging from low-level features to high-level features. The lower number of filters in the beginning helps the encoder to capture low-level features and vice versa. Hence, in the overall U-Net architecture, the encoder does the job of figuring out the context of objects within the images.

After an image is received as an input, the two convolutional layers extract features from the image and pass them to the symmetrically opposite decoder layers. Then, the MaxPooling layer downsamples the image and passes it to the next series of Convolutional layers. In the original paper, the author downsamples four times, but ideally, you can downsample the image less than four or greater than four times according to the complexity of the task provided they are equal to the corresponding decoder steps. Now, let’s see how to code this encoder using Tensorflow and Python.

#Let's create a function for one step of the encoder block, so as to increase the reusability when making custom unets

def encoder_block(filters, inputs):
  x = Conv2D(filters, kernel_size = (3,3), padding = 'same', strides = 1, activation = 'relu')(inputs)
  s = Conv2D(filters, kernel_size = (3,3), padding = 'same', strides = 1, activation = 'relu')(x)
  p = MaxPooling2D(pool_size = (2,2), padding = 'same')(s)
  return s, p #p provides the input to the next encoder block and s provides the context/features to the symmetrically opposte decoder block

Baseline

In the original paper, the authors used two Convolutional layers as their baseline. Again, you can use any number of Convolutional layers in the baseline.

The baseline acts as a connection between the encoder and the decoder. It takes the most downsampled image as the input which enables it to extract high-level semantic features from the image and then, it sends these features to the decoder for upsampling. Let’s see how to code the baseline in Tensorflow and Python.

#Baseline layer is just a bunch on Convolutional Layers to extract high level features from the downsampled Image
def baseline_layer(filters, inputs):
  x = Conv2D(filters, kernel_size = (3,3), padding = 'same', strides = 1, activation = 'relu')(inputs)
  x = Conv2D(filters, kernel_size = (3,3), padding = 'same', strides = 1, activation = 'relu')(x)
  return x

Decoder

The decoder consists of an upsampler (usually Conv2DTranspose or UpSampling2D) and two Convolutional layers. The decoder’s job in the U-Net is to understand the locations of the object in the image. It upsamples the features sent to it by the baseline and combines them with the hierarchical features received from the encoder via skip connections and produces a higher-resolution segmentation map. This repeatedly upsampling and concatenating helps the decoder in the robust localization of objects within an image. Let’s see how to code one step of the decoder using Tensorflow and Python.

#Decoder Block
def decoder_block(filters, connections, inputs):
  x = Conv2DTranspose(filters, kernel_size = (2,2), padding = 'same', activation = 'relu', strides = 2)(inputs)
  skip_connections = concatenate([x, connections], axis = -1)
  x = Conv2D(filters, kernel_size = (2,2), padding = 'same', activation = 'relu')(skip_connections)
  x = Conv2D(filters, kernel_size = (2,2), padding = 'same', activation = 'relu')(x)
  return x

Now that we have all the necessary components ready, let’s put them together and finally build a U-Net.

def unet():
  #Defining the input layer and specifying the shape of the images
  inputs = Input(shape = (224,224,1))

  #defining the encoder
  s1, p1 = encoder_block(64, inputs = inputs)
  s2, p2 = encoder_block(128, inputs = p1)
  s3, p3 = encoder_block(256, inputs = p2)
  s4, p4 = encoder_block(512, inputs = p3)

  #Setting up the baseline
  baseline = baseline_layer(1024, p4)

  #Defining the entire decoder
  d1 = decoder_block(512, s4, baseline)
  d2 = decoder_block(256, s3, d1)
  d3 = decoder_block(128, s2, d2)
  d4 = decoder_block(64, s1, d3)

  #Setting up the output function for binary classification of pixels
  outputs = Conv2D(1, 1, activation = 'sigmoid')(d4)

  #Finalizing the model
  model = Model(inputs = inputs, outputs = outputs, name = 'Unet')

  return model

We have successfully coded the U-Net model using TensorFlow. Now, in the next part of the series, we will perform segmentation on real-world Skin cancer images using this same model. See you then. Take care!

References

U-Net: Convolutional Networks for Biomedical Image Segmentation