Transpose Convolutions vs Resizing Images in Segmentation of Mammograms
I have written previously about adapting my convolutional classifier for mammograms to a network that will do dense prediction for each pixel, resulting in the segmentation of the images into normal and abnormal.
I recently made two major changes, both of which drastically improved the results. The changes were using resizing rather than transpose convolutions to upsample and changing the prediction from a hardmax to a softmax.
When I started working adapting the classifier to do dense prediction I followed the blueprint from “Fully Convolutional Networks for Semantic Segmentation” by Long, et al.  Very simply this method removed the last fully connected layer from the classifier and replaced it with a succession of tranpose convolutions to upsample the image back to full resolution. Skip connections are added from the later downsizing layers to the corresponding upsampling layers in order to allow the output to more closely resemble the input.
This worked well enough for images with abnormalities, but images without abnormalities resulted in output that resembled the input, as seen in figures 1 and 2. Note that at this point the prediction is a hard max on the logits.
In Figure 1 the prediction mostly matches the label, but the problem is in Figure 2 where the prediction seems to just highlight the parts of the input with higher pixel values. I assumed that this was caused by the skip connections introducing irrelevant features of the input into the upsampling section, and I tried to address this by adding 1x1 convolutions before the addition in order to extract only the relevant features from the earlier layer. Unfortunately this did not help much as seen in figure 3.
Extracting layers from the model before the prediction showed very noisy images with a lot of artifacts, as seen in Figure 10.
My first concern was the artifacts, and that led me to find “Deconvolution and Checkerboard Artifacts” by Odena, et al,  which suggests replacing transpose convolutions with nearest neighbors resizes to reduce the artifacts produced by transpose convolutions. Not willing to fully commit, I replaced some of the intermediate tranpose convolutions with resizes, but left the first two and the last two. This did improve the results, as well as significantly improving training speed, but did not help with the incorrect predictions for negative images.
While investigating the problems, I found it much more useful to view the softmax probabilities than the hard predictions. I also realized that this would be true for a radiologist using such a system, so I changed the output from the hard prediction to the softmax probabilities. These changes definitely improved the model, but the problems with normal images resulting in false positives still remain, as seen in figures 4 and 5.
At this point I took a few days off from working on this project to think about these issues and read some papers on dense prediction. This led me to wonder what the point was of downsampling the input from 640x640x1 to 5x5x1024 only to have to reverse the process and upsample that back to the original resolution. This level of downsampling makes sense for a classifier, where we want to end up with a single prediction, but in the absence of fully connected layers at the end of the network the final layers of downsampling seemed counterproductive, with the only advantage being that it allowed me to speed up training by taking advantage of my pre-rained classifier.
Researching this line of inquiry led to papers on dilated convolutions  which combine the advantages of greater field of reception while still maintaining resolution. As I suspected that the skip connections were causing most of my problems, anything that allowed me to eliminate them seemed worth trying. I spent a few days tinkering with some different changes, and then decided to scrap the entire upsampling section and rebuild it from scratch. The resulting model is seen in Figure 11.
In the previous models conv 5.2 had been followed by a max pool, a 5x5 convolution with a stride 5 to reduce the dimensions to 2x2x512, and then two fully connected layers, which in this case were 1x1 convolutions. That was all removed and replaced with two additional 3x3 convolutions dilated by 2. This changed the input for the upsampling section from 2x2x2048 to 20x20x512, which maintained enough resolution to remove both skip connections and the first transpose convolution.
Then, as per Odena, et al, I replaced the transpose convolutions with nearest neighbors resizes followed by normal convolutions, and only kept two tranpose convolutional layers — the second to last layer upsamples from 320x320 to 640x640 and is followed by a transpose convolution with a stride of 1 to smooth out artifacts, which is followed by the logits.
Figures 6 and 7 show the outputs for images with the first upsampling layer a transpose convolution with stride 2. Note the checkerboard artifacts in Figure 6 and the large amount of noise in Figure 7.
Figures 8 and 9 show the outputs with all upsampling done with resizes and the only transpose convolutions as the last two pre-logit layers. These images were generated after initializing the weights for the downsampling section from a previous model and training the upsampling layers for only five epochs.
Despite the very small amount of training, we can see that the amount of noise has been drastically reduced in both the positive and negative images, the checkerboard artifacts have completely disappeared, and the predictions are much closer to the labels. Most importantly, the output for the negative image in Figure 9 highlights the areas in the image that appear more likely to contain abnormalities rather than resembling a blurred version of the input image.
I would like to reiterate that the results in figures 8 and 9 were generated after training the upsampling section of the network for five epochs, while the previous figures were generated by networks that had been trained for 20+ epochs. While the preliminary results are very promising, this is no guarantee that the model will perform as well after additional training.
 D. Levy, A. Jain, Breast Mass Classification from Mammograms using Deep Convolutional Neural Networks, arXiv:1612.00542v1, 2016
 N. Dhungel, G. Carneiro, and A. P. Bradley. Automated mass detection in mammograms using cascaded deep learning and random forests. In Digital Image Computing: Techniques and Applications (DICTA), 2015 International Conference on, pages 1–8. IEEE, 2015.
 N.Dhungel, G.Carneiro, and A.P.Bradley. Deep learning and structured prediction for the segmentation of mass in mammograms. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 605–612. Springer International Publishing, 2015.
 J.Arevalo, F.A.González, R.Ramos-Pollán,J.L.Oliveira,andM.A.G.Lopez. Representation learning for mammography mass lesion classiﬁcation with convolutional neural networks. Computer methods and programs in biomedicine, 127:248–257, 2016.
 Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
 The Digital Database for Screening Mammography, Michael Heath, Kevin Bowyer, Daniel Kopans, Richard Moore and W. Philip Kegelmeyer, in Proceedings of the Fifth International Workshop on Digital Mammography, M.J. Yaffe, ed., 212–218, Medical Physics Publishing, 2001. ISBN 1–930524–00–5.
 Current status of the Digital Database for Screening Mammography, Michael Heath, Kevin Bowyer, Daniel Kopans, W. Philip Kegelmeyer, Richard Moore, Kyong Chang, and S. Munish Kumaran, in Digital Mammography, 457–460, Kluwer Academic Publishers, 1998; Proceedings of the Fourth International Workshop on Digital Mammography.
 Rebecca Sawyer Lee, Francisco Gimenez, Assaf Hoogi , Daniel Rubin (2016). Curated Breast Imaging Subset of DDSM. The Cancer Imaging Archive.
 Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045–1057.
 O. L. Mangasarian and W. H. Wolberg: “Cancer diagnosis via linear programming”, SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
 William H. Wolberg and O.L. Mangasarian: “Multisurface method of pattern separation for medical diagnosis applied to breast cytology”, Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193–9196.
 O. L. Mangasarian, R. Setiono, and W.H. Wolberg: “Pattern recognition via linear programming: Theory and application to medical diagnosis”, in: “Large-scale numerical optimization”, Thomas F. Coleman and YuyingLi, editors, SIAM Publications, Philadelphia 1990, pp 22–30.
 K. P. Bennett & O. L. Mangasarian: “Robust linear programming discrimination of two linearly inseparable sets”, Optimization Methods and Software 1, 1992, 23–34 (Gordon & Breach Science Publishers).
 K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556, 2014
 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
 C. Szegedy, S. Ioffe, V. Vanhoucke, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv:1602.07261v2, 2016
 K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, arXiv:1512.03385, 2015
 J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look One: Unified, Real-Time Object Detection, arXiv:1506.02640, 2015
 R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv:1311.2524, 2013
 Odena, et al., “Deconvolution and Checkerboard Artifacts”, Distill, 2016. http://doi.org/10.23915/distill.00003
 J. Long, E. Shelhammer, T. Darrell, “Fully Convolutional Networks for Semantic Segmentation”, 2014, www.arxiv.org/abs/1411.4038
 F. Yu, V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions”, 2015, https://arxiv.org/abs/1511.07122