The road to U-Net reproduction

10 min readApr 20, 2020

This project was part of multiple reproduction attempts of many different articles. For reproduction attempts of other articles, please visit https://reproducedpapers.org/.

In biomedical image processing, the development of the U-Net allowed for varied tasks to be performed without large datasets. Published in 2015, the study included state-of-the-art results, ranking among the best in the EM segmentation challenge started at ISBI in 2012 and the segmentation task part of the ISBI cell tracking challenge 2014 and 2015. This paper is cited more than 13k times and follow-up works (in 3D images and extending to multiple cases) continued pushing high scores and providing a more generalized tool for biomedical tasks.

Here, we present our reproduction of the network and some of its results. We attempted full independent reproduction, based only on the paper’s content. We did this to evaluate whether articles published by the scientific community can be reproduced properly with the information provided by the authors. We allowed ourselves to examine the main articles, any supplementary material, and some follow-up papers.

Although supplementary material is mentioned in the article, it does not appear to be publicly available. Also, we tried to get at least the hyper-parameters the authors used for their results from their published caffe implementation, but to no avail. We also contacted the authors, but did not receive a reply.

What is U-Net’s goal?

Its main purpose is to segment biomedical images into two or more classes. In the original paper, the net is applied to three segmentation tasks:

Neuronal structures in electron microscopy recordings (Experiment 1)
Light microscopy images of Glioblastoma-astrocytoma U373 cells (Experiment 2)
Light microscopy images of HeLa cells (Experiment 3)

What did we do?

We replicated Experiment 1, in which we classify the incoming image into two classes: cell (white) and membrane (black) with information publicly available.

The training data with labels. From: ISBI

We did our work in Python 3 using the pytorch library. For our GPU power needs, we used Google Colaboratory.

In this article, we explain in detail what U-Net is, what our approach was for reproduction and the results we got.

Preparation

Dataset

First, we need to load the data provided by the challenge. They provide 30 training images with their corresponding targets (seen in the GIF above), and a set of 30 test images (with labels only known by the creators of the challenge). The images are provide as three multipage .tiff files in grey-scale with dimensions 512×512.

Because the targets for the test data are not provided, we split the training dataset into training and evaluation sets (60/40 split) to be able to evaluate how well our training is doing on an unseen test set.

Data Augmentation

A big part of what makes U-Net successful is the use of data augmentation. The authors unfortunately do not specify how much data augmentation they use to achieve their results. We opted to try to reproduce their efforts by augmenting the dataset with 5 extra images per image resulting in a dataset with 180 images and targets. So, how did we augment our data?

For reasons that will become clear later, we first add reflections of the image on each side. We then randomly added rotation to the image (between 0 and 360 degrees), followed by random elastic deformation. Finally, we cropped the image to arrive at a final size of 700×700, removing as much reflected information as possible. We made sure to apply the same transformations to the image as to the target.

Example of data augmentation for an image.

Example for data augmentation for a target corresponding to the image above.

‘But what is elastic deform you may ask? It is when we overlay a coarse grid over the image, randomly move the intersection grid points and interpolate the final result. Practically speaking, we used the elasticdeform library available on PyPI.

U-Net Architecture

Concept

U-Net, as the authors say, is a “fully convolutional network”. In the first stage, it aims to extract many features in a traditional unpadded convolutional path, which consequently reduces the image size (‘contracting path’). During a second stage, the network upsamples this information and combines channels to increase the resolution again (‘expanding path’).

Example of what a U-Net can look like. The ‘U’ shape the figure makes is what gives the model its name. From: source.

Because the convolutions are unpadded, the images that come out of the model have a smaller size than the images that went in. In an example architecture, provide by the autors of the original article, the input is 572×572 and the output is 388×388. We chose to keep the output information content as close to the target as possible, so we used as input size 700×700 (which is why we needed the reflections earlier) and got 516×516 as output size, which we then needed to downsample to 512×512 afterwards.

The u-net is mainly composed of:

Contracting path: 2 unpadded convolutions (3x3) are applied followed by ReLU and max pooling (2x2), halving the image and doubling the channels. Repetitions follow until the end of the path.
Expanding path: 2 unpadded convolutions (3x3) are applied followed by ReLU and upsampling (2x2), doubling the image and halving the channels. Repetitions follow until the end of the path.
Concatenation with cropped contracting path results

A final convolution layer (1x1) translates the 64 channels into the 2 classes channels. The weights are initialized using the Kaiming initialization.

Our Pytorch implementation

The network implementation is quite straight forward. We used the following operators from the pytorch library:

For all max pooling operations we used nn.MaxPool2d(2, stride=2) .

For convolutions we used nn.Conv2d() . The first convolution, e.g.
self.conv1 = nn.Conv2d(1, 64, 3, padding=0) .

For the upsampling we used nn.ConvTranspose2d(), e.g. the first one:
self.upconv1 = nn.ConvTranspose2d(1024, 512, 2, stride=2, padding=0) .

For initializing weights we used self.apply(self._weights_init) .

To combine tensors from the contracting path with the expanding path, we first cropped the image, e.g. x_from_side = self.crop(x4, 4) , followed by concatenation: x = torch.cat((x_from_side, x_from_down), 1) .

So what does the network looks like? We used thetorchsummary library to tell us that and to evaluate if everything is well connected. We got over 31 million parameters to train!

A summary of our pytorch implementation of U-Net.

Training, Evaluation and Testing

The training was performed as in a traditional convolutional network:

Forward pass: output = self.forward(input)) .
Loss computation: loss = criterion(output, target) .
Backward pass, using SDG with high momentum (0.99): loss.backward() .
Weight update: optimizer.step() .

With training done, it is time to predict some images!

Loss is computed with pixel-wise cross-entropy (proceeded by a softmax function). At training and evaluation, we also computed the pixel error, which is a metric of binary classification accuracy also used in the original article:

precision = tp / (tp + fp)

recall = tp / (tp + fn)

Fscore = 2⋅ precision⋅ recall / (precision + recall)

PixelError = 1 - Fscore

where tp indicates true positive pixels, fp indicates false positive pixels, fn indicates false negative pixels.

To obtain the final segmented map for pixel error calculation and visualization, we apply softmax (output = nn.Softmax(dim=1)(output)) and pytorch's argmax (index = torch.argmax(output, dim=1)) to the output, obtaining the index of the channel where the value of the output is maximal.

So far, everything seems as if it all went very smoothly, but in practice, the lack of knowledge of proper hyper-parameters really hurt the performance of this model. We had no information of the number of epochs used to train the model, learning rate scheduling was not mentioned and as mentioned before, we had no information of how much augmentation was applied to the images. The learning rate appeared to be an exception to this: 0.1 was mentioned in the article. We experience, however, that this learning rate performed terribly: we were getting either maps of all ones or maps of all zeros. It could be that scheduling would have alleviated some problems or that the combination of the hyper-parameters we set did not work well with a learning rate of 0.1. Lowering the learning rate to 0.01 yield the graph below, which barely improves the results.

We started tuning the learning rate further. We realized we needed quite some epochs (>200) to start getting better results. We concluded that learning rates along 0.001 and 0.0001 worked best for our set of hyper-parameters.

In terms of performance, the article says it takes about 10 hours on a 6 GB GPU to train the model. Attempting to run this model on our poor CPU’s was quickly dismissed, and a 2GB GPU didn’t fit the model in the memory. Therefore, we were only able to run the model on Google Colab, taking around the described time to train it. Unfortunately, however, we would lose some of the simulations due to disconnections.

Results

Below are results for a learning rate of 0.01 and 500 epochs. With pixel errors of around 10% in the training set, while the paper achieved 6% in the test set, it is clear that results are poor, also looking at the final test set results.

Attempting to predict the training set works quite nicely…

…But predicting on the test set could be better.

The best model we could fit was obtained with a learning rate of 0.0001. We trained it with the full training set. As can be seen in the graphs below, at epoch 900 the pixel error on the training set was below 3%. Based on visual inspection and to avoid overfitting, we submitted the model trained on 500 epochs to the challenge.

While training, the curves showing the training loss and the pixel error on the training set looked much better:

The predictions we generated look like this:

However, in the time between the publication of the original article and our reproduction attempt, the metrics used by the challenge have changed. They used to return the Warping Error, Rand Error and Pixel Error but they changed over to Rand Score Thin and Information Score Thin. Thus, we are unable to compare our performance to the article’s based on the original metrics. The challenge helpfully displays the new metrics also for older submissions, including the original article’s.

Below, we compare the errors found by the original article and ours.

The errors presented by the original article:

And using the new metrics:

From this, we can see that with both metrics we are unfortunately not close at all to the results from the original article (although we are not the worst submission of the challenge :) ).

This can be explained by both our relative inexperience in the field and the minimal details provided by the original article. Going into the first point first, our inexperience was clear by some of the problems we faced. With only a few days before the deadline we were faced with an apparent bug in the predict function that prevented us from starting to train models with different hyper-parameters. It turned out that we were just impatient and the model was learning, but slowly. We just needed to give it more epochs to work with. We learned from this experience to be more patient with our model and give it time, but also it gave us some feeling of what the influence of some hyper-parameters have on learning. We also learned that the models were doing very well in training and poorer in testing, thus lacking generalization. This could be the result of insufficient (too few) or poor (missing transformations) augmentation, for example. We take all this knowledge into our next projects.

Secondly, without the resources to perform an extensive hyper-parameter search we were largely dependent on information provided by the authors of the original article. Unfortunately, this was sparse in some areas. Other than the lack of information on the hyper-parameters, how to properly handle the smaller images that come out of the model was also not expanded upon. This left us with the only option to just try some things and see what works best. As seen from the results we were able to generate, we can say this is not the best approach.

We hope that this project shows the importance of proper documentation for the scientific deep learning community. It is certainly a lesson we take away from this project.

We want to thank our supervisor for his helpful advise, especially when we needed him the most.