Are Traditional Skip Connections Adequate ? Exploring (U-Net V2) for Medical Image Precision

Maru Tech 🐞
Data And Beyond
Published in
9 min readDec 11, 2023

Hiiiiii, fellow enthusiasts! 🌟 Today, we’re stepping into the exciting realm of U-Net variations, and guess what?

A new player called U-Net v2, just entered the scene, promising to revolutionize medical image segmentation.

exciting, right?

Photo by Nimi Diffa on Unsplash

are you ready to uncover the magic woven into this innovative technology?

.

..

……

i’m hearing you saying “yess!!!” 🤩🤩🤩🤩🤩🤩🤩

alright , let’s dive in! 🎇

Photo by Gabriel Valdez on Unsplash

so today’s approach aims to improve the feature maps by simply combining the semantic richness from higher-level features and the finer details from lower-level features using the Hadamard product as a mechanism for integration

(hadamard product is an element-wise multiplication of corresponding entries in two matrices of the same dimensions)

1. state of the art :

1.1. U-Net :

Thank you for checking my previous blog-post for additional insights

https://medium.com/data-and-beyond/understanding-u-net-convolutional-networks-for-biomedical-image-segmentation-paper-92e8baab778c

fig.1

1.2. unet++ :

consists of a series of nested U-Net structures at different scales Each U-Net block is connected to others through skip connections, allowing the model to capture both fine and coarse features , it also incorporates residual connections or blocks to facilitate the training of deeper networks and mitigate the vanishing gradient problem

fig 2.

1.3. MDU-Net (Multi-level Feature Pyramid Network with Dense Upsampling):

It is an extension of the U-Net architecture that includes three different multi-scale dense connections for the encoder, decoder, and across them the architecture directly fuses neighboring different scale feature maps, which improves the information flow between the encoder and decoder and the multi-scale dense connections also make much deeper U-Net to reduce overfitting for a better accuracy

The proposed architecture has been evaluated on the MICCAI 2015 Gland Segmentation dataset and has shown improved performance compared to the U-Net architecture

fig 3.

1.4. Reverse attention learning (RAN) :

fig 4.

The reason behind this work is that, at times when there are areas in a picture where two or more things look similar or overlap , a computer can make mistakes in determining to which class the this area belongs an example is depicted in fig 5.

fig 5.

You can observe in the heatmaps that the filters , particularly in the lower part of the 2 animals produce a high response across the entire area without distinguishing between the two classes which indicates that the conventional method of teaching computers to recognize things doesn’t work well in these confusing areas ,hence the creators of this method proposed something called “reverse attention learning” (RAN) to address this issue

with RAN, we not only inform the computer about what belongs to a certain group (like saying “this is a cat”) but also specify what doesn’t belong to that group essentially stating “hey, this confusing part definitely isn’t a cat !!!!!!!”

what RAN does is incorporating 2 new branches into the usual model used for discerning the contents of a picture , the first branch is known as the Reverse Branch, aiding the computer in learning the reverse class (eq. background) as shown below in fig 6.

fig 6.

the second branch is called the Attention Branch , it utilizes a normal convolution to learn the class features and then applies negation to obtain the reverse of these feature maps , subsequently , it employs a sigmoid function to highlight higher values and generates an attention map to produce a mask , the resulting mask is then multiplied with the result of the Reverse Branch and subtracted from the original branch’s feature map and finally, it’ll be upscaled to yield the resulting final segmentation map

fig 7.

note : why they’ve used attentino branch and not just the reverse branch ?the decision to incorporate both is that at times the reverse branch may not accurately predict the opposite class (eg. background) , in such cases the attention branch comes into play by helping the model pay attention to areas it might have missed during the reverse branch

2. U-Net v2 solution :

2.1 Overall architecture

The U-Net v2 consists of three main parts: the encoder, the SDI (Semantic and Detail Infusion) module, and the decoder

the encoder takes an input image and produces features in multiple levels , these features are then refined in the SDI module and passed to the decoder to finally get the segmented image

fig 8.

2.2 Semantics and Detail Infusion (SDI) Module

fig 9.

in the SDI module, hierarchical feature maps from the encoder undergo an attention mechanisms , this means that each level of features gets enhanced with both local and global information , then the process will involve some mathematical operations, resulting in a refined feature map

note : in the paper, they have only highlighted the third level (l=3) feature map for illustration simplicity , therefore we will focus solely on explaining this particular aspect ( see fig 10.)

fig 10.
fig 11.

2.2.1.Spatial and Channel Attention:

fig 12.

features at each level undergo spatial (eq. incorporating max pooling) and channel attention ( eg. using global av pooling) mechanisms as depicted in fig 12.

the processed feature map is denoted as “1fi” where i is the level of the feature map , and 1<i<M ………….(M is the number of feature maps from the encoder)

2.2.2. Dimension Reduction:

A 1x1 convolution is then applied to reduce the channels of “1fi to a specified value c , and the resulting feature map is denoted as “2fi

2.2.3. Preparing for Decoder:

fig 14.

Feature maps at different levels are adjusted to match the resolution of the corresponding “2fi” of the encoder……… “2f3 in our case” , where (1 < i <M) let me explain this to you to understand more , ok so since “f2i” dimensions are used as a reference target at each decoder level this means that for the second level in the decoder (l = 2) all the features that were previously resulted from the SDI must be of the same size as “2f2”… for (l = 3) the same size as “2f3”, and so on …

fig 15.

2.2.4. Smoothing:

theta is a 3x3 convolution applied to smooth the resized feature maps

2.2.5.Enhancement with Hadamard Product:

fig 16.

all resized feature maps are then combined using the Hadamard product
therefore the resulting feature map contains enhanced information with both semantics and details

2.2.6. Sending to Decoder:

“5fi” is sent to the i-th level of the decoder for further resolution reconstruction and segmentation

3. Experiments

3.1. Datasets

thy’ve tested their U-Net v2 on 2 datasets for skin lesion segmentation (ISIC 2017 and ISIC 2018) and 5 datasets for polyp segmentation (Kvasir-SEG, ClinicDB, ColonDB, Endoscene, and ETIS) they followed specified train/test split strategies for fair comparisons

3.2 Experimental Setup

  • NVIDIA P100 GPU using PyTorch
  • backbone -> The Pyramid Vision Transformer (PVT)
  • Adam with lr = 0.001, β1 = 0.9, and β2 = 0.999
  • polynomial learning rate decay with a power of 0.9
  • max epochs = 300
  • Hyper-parameter c = 32
  • Evaluation metrics :Dice Similarity Coefficient, Intersection over Union, Mean Absolute Error

each experiment was run 5 times, and averaged results were reported

3.3 Results and Analysis

3.4 Ablation Study

An ablation study was conducted using ISIC 2017 and ColonDB datasets to assess the effectiveness of U-Net v2

Results showed that ‘U-Net v2’ with the SDI module significantly outperformed ‘U-Net++’ (using PVT as the encoder) and highlighted the crucial contribution of the SDI module to overall performance

’note : the decrease of unet++ results may be attributed to the simple concatenation of multi-level features generated by dense connections, which could confuse the model and introduce noise’

3.5 Qualitative Results

fig 17.

those qualitative results demonstrates the U-Net v2’s capability to incorporate semantic information and finer details into feature maps at each level, allowing the model to capture finer details of object boundaries

3.6 Computation, GPU Memory, and Inference Time

fig 18.

The experiments utilize float32 as the data type, resulting in 4B of memory usage per variable , the GPU memory usage reflects the size of parameters and intermediate variables stored during the forward/backward pass, conducted on an NVIDIA P100 GPU with an input image size of (1, 3, 256, 256)

the results in fig 18. show that UNet++ introduces more parameters, leading to larger GPU memory usage due to the storage of intermediate variables, such as feature maps, during the dense forward process , in addition , the U-Net v2 exhibits superior FPS compared to UNet++ , and despite the reduction in FPS compared to U-Net (PVT), the overall computational efficiency of U-Net v2 remains notably high

Conclusion

Finally , we can say that U-Net v2 marks a significant advancement in medical image segmentation , its innovative design characterized by strategically crafted skip connections, facilitates the seamless integration of high-level semantic information and intricate details at every stage of feature map generation by the encoder

Rigorous experiments on several datasets have substantiated U-Net v2’s effectiveness in real-world applications, Moreover, a meticulous analysis of its computational complexity underscores its efficiency in terms of FLOPs and GPU memory utilization , hence , we can say that U-Net v2 stands not just as an evolution but as a sophisticated solution, embodying the next frontier in precise medical image segmentation

Photo by Vanessa Serpas on Unsplash

References

https://arxiv.org/pdf/2311.17791.pdf

--

--

Maru Tech 🐞
Data And Beyond

Deep learning & computer vision engineer | Algeria | Data And Beyond Author