ConvNets for Detecting Abnormalities in DDSM Mammograms
Breast cancer is the second most common cancer in women worldwide. About 1 in 8 U.S. women (about 12.4%) will develop invasive breast cancer over the course of her lifetime. The five year survival rates for stage 0 or stage 1 breast cancers are close to 100%, but the rates go down dramatically for later stages: 93% for stage II, 72% for stage III and 22% for stage IV. Human recall for identifying lesions is estimated to be between 0.75 and 0.92 , which means that as many as 25% of abnormalities may initially go undetected.
The DDSM is a well-known dataset of normal and abnormal scans, and one of the few publicly available datasets of mammography imaging. Unfortunately, the size of the dataset is relatively small. To increase the amount of training data we extract the Regions of Interest (ROI) from each image, perform data augmentation and then train ConvNets on the augmented data. The ConvNets were trained to predict whether a scan was normal or abnormal.
There exists a great deal of research into applying deep learning to medical diagnosis, but the lack of available training data is a limiting factor. [1, 4] use ConvNets to classify pre-detected breast masses by pathology and type, but do not attempt to detect masses from scans. [2,3] detect abnormalities using combinations of region-based CNNs and random forests.
The MIAS dataset is a very small set of mammography images, consisting of 330 scans of all classes. The scans are standardized to a size of 1024x1024 pixels. The size of the dataset made this unusable for training, but it was used for exploratory data analysis and as a supplementary test data set.
The DDSM  is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The DDSM is saved as Lossless JPEGs, an archaic format which has not been maintained for several decades.
The CBIS-DDSM  collection includes a subset of the DDSM data selected and curated by a trained mammographer. The CBIS-DDSM images have been pre-processed and saved as DiCom images, and thus are better quality than the DDSM images, but this dataset only contains scans with abnormalities. In order to create a dataset which can be used to predict the presence of abnormalities, the ROIs were extracted from the CBIS-DDSM dataset and combined with normal images taken from the DDSM dataset.
In order to create a training dataset of adequate size which included both normal and abnormal scans, images from the CBIS-DDSM dataset were combined with images from the DDSM dataset. While the CBIS-DDSM dataset included cropped and zoomed images of the Regions of Interest (ROIs), in order to have greater control over the data, we extracted the ROIs ourselves using the masks provided with the dataset.
For the CBIS-DDSM images the masks were used to isolate and extract the ROI from each image. For the DDSM images we divided the images into slightly overlapping tiles, excluding tiles which contained unusable data.
Both offline and online data augmentation was used to increase the size of the datasets.
Multiple datasets were created using different ROI extraction techniques and amounts of data augmentation. The datasets ranged in size from 27,000 training images to 62,000 training images.
Datasets 1 through 5 did not properly separate the training and test data and thus are not referenced in this work.
- Dataset 6 consisted of 62,764 images. This dataset was created to be as large as possible, and each ROI is extracted multiple times in multiple ways using both ROI extraction methods described below. Each ROI was extracted with fixed context, with padding, at its original size, and if the ROI was larger than our target image it was also extracted as overlapping tiles.
- Dataset 8 consisted of 40,559 images. This dataset used the extraction method 1 described below to provide greater context for each ROI. This dataset was created for the purpose of classifying the ROIs by their type and pathology.
- Dataset 9 consisted of 43,739 images. The previous datasets had used zoomed images of the ROIs, which was problematic as it required the ROI to be pre-identified and isolated. This dataset was created using extraction method 2 described below.
As Dataset 9 was the only dataset that did not resize the images based on the size of the ROI we felt that it introduced the least amount of artificial manipulation into the data which led us to focus on training with this dataset.
ROI Extraction Methods for CBIS-DDSM Images
The CBIS-DDSM scans were of relatively large size, with a mean height of 5295 pixels and a mean width of 3131 pixels. Masks highlighting the ROIs were provided. The masks were used to define a square which completely enclosed the ROI. Some padding was added to the bounding box to provide context and then the ROIs were extracted at 598x598 and then resized down to 299x299 so they could be input into the ConvNet.
The ROIs had a mean size of 450 pixels and a standard deviation of 396. We designed our ConvNets to accept 299x299 images as input. To simplify the creation of the images, we extracted each ROI to a 598x598 tile, which was then sized down by half on each dimension to 299x299. 598x598 was just large enough that the majority of the ROIs could fit into it.
To increase the size of the training data, each ROI was extracted multiple times using the methodologies described below. The size and variety of the data was also increased by randomly horizontally flipping each tile, randomly vertically flipping each tile, randomly rotating each tile, and by randomly positioning each ROI within the tile.
ROI Extraction Method 1
The analysis of the UCI data indicated that the edges of an abnormality were important as to determining its pathology and type, and this was confirmed by a radiologist. Levy et al  also report that the inclusion of context was an important factor for multi-class accuracy.
To provide maximum context, each ROI was extracted in multiple ways:
- The ROI was extracted at 598x598 at its original size.
- The entire ROI was resized to 598x598, with padding to provide context.
- If the ROI had the size of one dimension more than 1.5 times the other dimension it was extracted as two tiles centered in the center of each half of the ROI along it’s largest dimension.
ROI Extraction Method 2
Method 1 relied on the size of the ROI to determine how to extract it, which requires having the ROI pre-identified. While this provided very clear images of each abnormality, the use of the size of the ROI to extract it introduced an element of artificiality into the data which made it not generalize well to classifying raw scans. This method was designed to eliminate that artificiality by never resizing the images, and just extracting the ROI using its center.
The size of the ROI was only used to determine how much padding to add to the bounding box before extraction. If the ROI was smaller than the 598x598 target we added more padding to provide greater variety when taking the random crops. If the ROI was larger than 598x598 this was not necessary.
- If the ROI was smaller than a 598x598 tile it was extracted with 20% padding on either side.
- If the ROI was larger than a 598x598 tile it was extracted with 5% padding.
- Each ROI was then randomly cropped three times using random flipping and rotation.
Segmentation of Normal Images
The normal scans from the DDSM dataset did not have ROIs so were processed differently. As these images had not been pre-processed as had the CBIS-DDSM images they contained artifacts such as white borders, overlay text, and white patches of pixels used to cover up identifying personal information. Each image was trimmed by 7% on each side to remove the white borders.
To keep the normal images as similar to the CBIS-DDSM images, different pre-processing was done for each dataset created. As datasets 6 and 8 resized the images based on the ROI size, to create the DDSM images for these datasets, each image was randomly sized down by a random factor between 1.8 and 3.2, then segmented into 299x299 tiles with a variable stride between 150 and 200. Each tile was then randomly rotated and flipped.
For dataset 9, each DDSM image was cut into 598x598 tiles without being resized. The tiles were then each resized down to 299x299.
To avoid the inclusion of images which contained the aforementioned artifacts or which consisted largely of black background, each tile was then added to the dataset only if it met upper and lower thresholds on mean and variance. The thresholds were selected by randomly sampling tiles and adjusted until most of the useless tiles were not included.
In reality, only about 10% of mammograms are abnormal. In order to maximize recall, we weighted our dataset more heavily towards abnormal scans, with the balance at 83% normal and 17% abnormal.
The CBIS-DDSM dataset was already divided into training and test data, at 80% training and 20% test. As each ROI was extracted to multiple images, in order to prevent different images of the same ROIs from appearing in both the training and holdout datasets we kept this division. The test dataset was divided evenly, in order, between holdout and test data, which ensures that no more than one image of one ROI would appear in both datasets.
The normal images had no overlap, so were shuffled and divided among the training, test and validation data. The final divisions were 80% training, 10% test and 10% validation. It would have been preferable to have large validation and test datasets, but we felt that it was easier to use the existing divisions and be sure that there was no overlap.
All images were labeled as 0 for negative/normal and 1 for positive/abnormal.
Our first thought was to train existing ConvNets, such as VGG or Inception, on our datasets. These networks were designed for and trained on ImageNet data, which contains images which are completely different from medical imaging. The ImageNet dataset contains 1,000 classes of images which have a far greater amount of detail than our scans do, and we felt that the large number of parameters in these models might cause them to quickly overfit our data and not generalize well. A lack of computational resources also made training these networks on our data impractical. For these reasons we designed our own architectures specifically for this task.
We started with a simple model based on VGG, consisting of stacked 3x3 convolutional layers alternating with max pools followed by three fully connected layers. Our model had fewer convolutional layers with less filters than VGG, and smaller fully connected layers. We also added batch normalization  after every layer. This architecture was then evaluated and adjusted iteratively, with each iteration making one and only one change and then being evaluated. We also evaluated techniques including Inception-style branches [16, 17, 18] and residual connections .
To compensate for the unbalanced nature of the dataset a weighted cross-entropy function was used, weighting positive examples higher than negative ones. The weight was considered a hyperparameter for which values ranging from 1 to 7 were evaluated.
The best performing architecture will be detailed below.
The best performing model was model 220.127.116.11, consisting of nine convolutional layers and three fully connected layers. The convolutional layers used the philosophy of VGG, with 3x3 convolutions stacked and alternated with max pools.
The graphs also included online data augmentation and contrast adjustment, which were both evaluated.
Models 18.104.22.168 and 22.214.171.124 were the same architecture as 126.96.36.199, but with different scaling of the input data. Model 188.8.131.52 took the raw pixel values as input, 184.108.40.206 centered the inputs without scaling them, and 220.127.116.11 centered and scaled the input.
Reduced versions of VGG-16 and Inception v4 were also trained on the datasets. Training the full models required more time and computation than we had available, so we adjusted the architectures by reducing the numbers of filters in each layer, as well as adjusting the models to take 299x299 images as inputs.
Table 1 shows the accuracy and recall on the test dataset for selected models trained for binary classification. The most-frequent baseline accuracy for the datasets was .83. We should note that a recall of 1.0 with accuracy around .17 indicates that the model is predicting everything as positive, while an accuracy near .83 with a very low recall indicates the model is predicting everything as negative.
Figure 3 shows the training metrics for model 18.104.22.168 trained on dataset 9 for binary classification. This model was trained with a cross entropy weight of 6, which compensates for the unbalanced nature of the dataset and encourages the model to focus on positive examples.
Table 2 shows the accuracy and recall of selected models on the MIAS dataset. If we recall that the MIAS dataset was completely separate from, and unrelated to, the DDSM datasets, these results should indicate how well the model will perform on completely unrelated images.
Effect of Cross Entropy Weight
A weighted cross entropy was used to improve recall and counter the unbalanced nature of our dataset. Increasing the weight improved recall at the expense of precision. With a cross entropy weight of 1 to 3, our models tended to initially learn to classify positive examples, but after 15–20 epochs started to predict everything as negative. A cross entropy weight of 4 to 7 allowed the model to continue to predict positive examples and greatly reduced the volatility of the validation results. Cross entropy weights above 7 resulted in improved recall at the expense of precision.
Effect of Decision Threshold
A binary softmax classifier has a default threshold of 0.50. We used pr curves during training to evaluate the effects of adjusting the threshold. We found that we could easily trade off precision and recall by adjusting the threshold, allowing us to achieve precision or recall close to 1.0. We can also see the effects of using different thresholds on recall in figure 8.
Figure 4 is the curve for model 22.214.171.124b.98 after 40 epochs of training. The points on the lines indicate the threshold of 0.50. Precision is on the y-axis and recall on the x-axis.
While we were able to achieve better than expected results on datasets 6 and 8, the artificial nature of these datasets caused the models to not generalize to the MIAS data. Models trained on dataset 9, which was constructed specifically to avoid these problems, did not achieve accuracy or recall as high as models trained on other datasets, but generalized to the MIAS data better.
While we were able to achieve recall above human performance on the DDSM data, the recall on the MIAS data was significantly lower. However, as a proof of concept, we feel that we have demonstrated that ConvNets can successfully be trained to predict whether mammograms are normal or abnormal.
We should note that we can not eliminate the possibility that the network was using information from each image unrelated to the presence of abnormalities. The fact that the positive and negative images came from different datasets makes it possible that features like the contrast of the images or the highest pixel values played an important role. We are currently attempting to address this issue.
The life and death nature of diagnosing cancer creates many obstacles to putting a system like this into practice. We feel that using a system to output the probabilities rather than the predictions would allow such a system to provide additional information to radiologists rather than replacing them. In addition the ability to adjust the decision threshold would allow radiologists to focus on more ambiguous scans while devoting less time to scans which have very low probabilities.
Future work would include creating a system which would take an entire, unaltered scan as input and analyse it for abnormalities. We are currently working on applying semantic segmenation to the scans, using the masks as labels. Other options include sliding windows, FCNs, YOLO, &c.
The source code for exploratory data analysis and creation of the datasets is available in this GitHub repository: https://github.com/escuccim/mias-mammography
The source code used to create and train the models is available here: https://github.com/escuccim/mammography-models
A training dataset not referenced in this work, but created using the methods described, is available on Kaggle. This dataset is similar to dataset 9, but with the criteria used to exclude tiles relaxed, resulting in the inclusion of tiles which do contain background. https://www.kaggle.com/skooch/ddsm-mammography
 D. Levy, A. Jain, Breast Mass Classification from Mammograms using Deep Convolutional Neural Networks, arXiv:1612.00542v1, 2016
 N. Dhungel, G. Carneiro, and A. P. Bradley. Automated mass detection in mammograms using cascaded deep learning and random forests. In Digital Image Computing: Techniques and Applications (DICTA), 2015 International Conference on, pages 1–8. IEEE, 2015.
 N.Dhungel, G.Carneiro, and A.P.Bradley. Deep learning and structured prediction for the segmentation of mass in mammograms. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 605–612. Springer International Publishing, 2015.
 J.Arevalo, F.A.González, R.Ramos-Pollán,J.L.Oliveira,andM.A.G.Lopez. Representation learning for mammography mass lesion classiﬁcation with convolutional neural networks. Computer methods and programs in biomedicine, 127:248–257, 2016.
 Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
 The Digital Database for Screening Mammography, Michael Heath, Kevin Bowyer, Daniel Kopans, Richard Moore and W. Philip Kegelmeyer, in Proceedings of the Fifth International Workshop on Digital Mammography, M.J. Yaffe, ed., 212–218, Medical Physics Publishing, 2001. ISBN 1–930524–00–5.
 Current status of the Digital Database for Screening Mammography, Michael Heath, Kevin Bowyer, Daniel Kopans, W. Philip Kegelmeyer, Richard Moore, Kyong Chang, and S. Munish Kumaran, in Digital Mammography, 457–460, Kluwer Academic Publishers, 1998; Proceedings of the Fourth International Workshop on Digital Mammography.
 Rebecca Sawyer Lee, Francisco Gimenez, Assaf Hoogi , Daniel Rubin (2016). Curated Breast Imaging Subset of DDSM. The Cancer Imaging Archive.
 Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository, Journal of Digital Imaging, Volume 26, Number 6, December, 2013, pp 1045–1057.
 O. L. Mangasarian and W. H. Wolberg: “Cancer diagnosis via linear programming”, SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
 William H. Wolberg and O.L. Mangasarian: “Multisurface method of pattern separation for medical diagnosis applied to breast cytology”, Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193–9196.
 O. L. Mangasarian, R. Setiono, and W.H. Wolberg: “Pattern recognition via linear programming: Theory and application to medical diagnosis”, in: “Large-scale numerical optimization”, Thomas F. Coleman and YuyingLi, editors, SIAM Publications, Philadelphia 1990, pp 22–30.
 K. P. Bennett & O. L. Mangasarian: “Robust linear programming discrimination of two linearly inseparable sets”, Optimization Methods and Software 1, 1992, 23–34 (Gordon & Breach Science Publishers).
 K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556, 2014
 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
 C. Szegedy, S. Ioffe, V. Vanhoucke, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv:1602.07261v2, 2016
 K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, arXiv:1512.03385, 2015
 J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look One: Unified, Real-Time Object Detection, arXiv:1506.02640, 2015
 R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, arXiv:1311.2524, 2013