UNet 3+ Fully Explained — Next UNet Generation

Leo Wang
5 min readJul 28, 2022

--

Photo by photo nic on Unsplash

UNet 3+ redesigned skip connections to take in full-scale information, which makes it have much fewer parameters yet makes it perform strongly superior over its two predecessors, UNet and UNet++, and even some other popular models.

Table of Contents

· ⭐️ Intuition
· ⭐️ UNet 3+ Architecture
In simple words…
Classification-guided Module (CGM)
Hybrid Loss Function
· ⭐️ Results
· ⭐️Summary
· 🔥Implementations of U-Net Family
Tensorflow
PyTorch
· Citations

Note: this story assumes that you have had a basic understanding of UNet and UNet++. If not, you could check out this article.

⭐️ Intuition

Introduced in 2015, U-Net was one of the most popular encoder-decoder architectures used in medical image segmentation.

However, there was a lot of rooms for improvement. To better exploit the semantic information of the input and improve the gradient flow for a better performance, UNet++ was developed in 2018 that introduced dense convolutional blocks between the encoder path and the decoder path.

❗️However, despite its performance improvement over its predecessor U-Net, UNet++ is not perfect either. Specifically, UNet++did not explore full-scale information sufficiently, and the model is too “bulky” (as shown in Fig. 1b)

⭐️ UNet 3+ Architecture

In simple words…

So, to address those shortcomings in UNet++, in 2020, H. Huang et al. proposed the next generation architecture of U-Net family, UNet 3+ (Fig. 1c), which improved on the UNet++ model by,

1) adopting its deep supervision technique and modifying it to take in full-scale semantic information.
2) changing the design of the dense connections which accounts for both low and high levels of details from feature maps more effectively for segmentation.

Fig. 1 U-Net, UNet++ and UNet 3+ architectural comparison. UNet 3+ redesigns the skip connections and uses a full-scale deep supervision to combine multi-scale features.² NOTE: There’s a typo in UNet 3+ graph. For the encoder layers it should be denoted as X_En, NOT X_Ee

Previous studies have shown that feature maps at different scales (aka. different levels) exploit different types of information. For example: feature maps at lower levels capture spatial information better, like the boundaries of organs, while at higher levels positional information like the relative positions of the organs would be exploited more.

Therefore, the redesigned skip connections UNet 3+ incorporated smaller and same-scale feature maps produced by the decoder to capture both the “fine-grained” and “coarse-grained” details in full scale.

To illustrate the concept of redesigned skip connection intuitively, the author has included a graph as shown in Fig. 2:

Fig. 2: the full-scale aggregated feature map of the third decoder layer (there are five in total as shown in Fig. 1. The third decoder layer is in the middle, denoted by X³ ²)

In order to get the input to the 3rd-level decoder, the feature maps from first three encoder layers are added firstly. However, the 4th-level encoder’s feature map is NOT directly fed to the 3rd-level decoder, instead, it was passed to the 4th-level decoder first. This is similar for the 5th level, which is the bottleneck (turning point) of the network.

This is the special part of the redesigned skip connection of UNet 3+. Therefore, it uses substantially fewer parameters, yet since full-scale information (feature maps from different levels of encoders) is explored, the model has even better performance than the UNet ++ model which has much more parameters, yet did not explore the full-scale information sufficiently.

Classification-guided Module (CGM)

In their work, the authors further proposed a module called CGM. The goal was to reduce the false-positive rates, which is, predicting objects when the object is not present.

The module, as shown in Fig. 3, will try to detect whether the object is present first, before trying to segment the organ.

Fig. 3: Classification-guided Module (CGM)

Therefore, this newly proposed module could successfully reduce the false-positive rates, and can even further benefit the main segmentation tasks by leaving rich semantic information.

Hybrid Loss Function

The author also proposed a new composite loss function in order to further exploit the full-scale information.

Formula 1: proposed composite loss function

The new loss function is defined as the sum of focal loss (fl), multi-scale structural similarity index loss (ms-ssim), and intersection over union (IoU) loss (ms-ssim loss will panelize the fuzzy organ boundary predictions heavier, and therefore enhance the organ boundary segmentation.)

Therefore, the new proposed loss function would enhance the pixel-level, patch-level, and map-level segmentations, capturing full-scale semantic information with clear boundaries.

⭐️ Results

Table 1: “Comparison of UNet, UNet++, the proposed UNet 3+ without deep supervision (DS) and UNet 3+ on liver and spleen datasets in terms of Dice metrics. The best results are highlighted in bold. The loss function used in each method is focal loss.”²

The authors of the paper then conducted a qualitative comparison using Vgg-16 and ResNet-101 networks as their UNet backbones on UNet, UNet++, and UNet 3+ (shown in Table 1).

It is obvious how UNet 3+, with or without deep supervision technique, has reached the state-of-the-art (SOTA) performance on two of the most popular datasets (liver and spleen datasets), with even reduced number of parameters, over its two predecessors. UNet 3+ is indeed an “upgraded version” of UNet++!

Table 2: “Comparison of UNet 3+ and other 5 state-of-the-art methods. The best results are highlighted in bold.”²

The author also ran a quantitative comparison with other popular architectures and showed its superior performance over even the bests on the liver and spleen datasets (shown in Table 2).

⭐️Summary

UNet 3+ redesigns the skip connections to take in full-scale semantic information from the input images and proved not only more “accurate,” but also faster and therefore more efficient over many of the popular image segmentation networks.

Citations

[1] Z. Zhou, M. Siddiquee, N. Tajbakhsh, and J. Liang, UNet++: A Nested U-Net Architecture for Medical Image Segmentation (2015), 2015 Computer Vision and Pattern Recognition
[2] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y. Chen, and Jian Wu, UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation (2020), 2020 Computer Vision and Pattern Recognition

--

--

Leo Wang

Machine Learning & Deep Learning | Prospective Data Scientist | Founder & Enthusiast & Active Learner & Teacher