Uncovering hidden clues in noncancerous mammogram images with deep learning

An overview of mammogram research with deep learning for pushing boundaries on earlier cancer risk management and tumor detection

Anne Marie Ou
Slalom Technology
15 min readDec 21, 2020

--

Using only non cancerous images to detect the cancer status of a patient

Background

What if it were possible to know if you have a higher risk for cancer based on mammograms rather than lifestyle factors alone? Imagine identifying cancer risk in images without having a visible tumor, and being able to proactively monitor an at risk patient. These are the questions that our principal researcher, a leader in epidemiology research focused on breast cancer, wanted to explore in mammogram information. She suspects there are potential cancer risk indicators in mammograms before a tumor is visible.

There are enough data and computing resources available now to leverage machine learning to answer this question and demonstrate what is possible through the application of computer vision on mammograms to identify patterns not visible to the human eye. Practitioners are trained to analyze one image at a time and identify a set of visual cues of tumor presence. We want to leverage mass datasets, in this case thousands of mammogram images, to define patterns that demonstrate cancer risk; this is only possible with deep learning. This article will provide details on how Slalom and researchers partnered to:

  • Identify the appropriate dataset
  • Create a technical approach
  • Apply machine learning
  • Identify findings and future work

With the findings, we hope to move toward improving patient outcomes and augmenting trained practitioners in new ways. If there are truly clues in these images, our research could support earlier detection to improve outcomes for the millions of men and women diagnosed with breast cancer.

The Data

We started our project with the public Digital Database for Screening Mammography (DDSM) dataset for testing and development of our deep learning pipeline. We selected DDSM because it was publicly available and is labeled. This dataset has 2,620 scanned film mammograms. These images have basic metadata with each image labeled as normal, benign, or malignant. The DDSM dataset patient data is limited to age, gender, and tissue density. The files are in DCIM format and we converted them to PNG to read in raster form in Python. Given that this dataset was released in 1997 and produced by a few different mammogram machines, there was variation in resolution and quality of the images. We reviewed the general dataset and selected 4000 x 6000 pixels as the normalized size for all the images.
We plan to share our code in a proper open source repo in the future. The code snippets we will featured show some parameters we used, but the nature of all image processing involves tuning and analysis for parameters that best fit your dataset. We dedicated time to experiment and tune parameters that would work best for preprocessing. The code will show our parameters that work for DDSM, but we encourage you to explore and change these parameters to be optimal for your set of images.

Only images without a cancerous tumor are introduced to the model for training and testing. Our labels were defined as noncancerous patient and cancerous patient. For example if a patient had cancer, only the mammogram without a cancerous tumor was seen by the model and the classification label is cancerous patient and non-cancerous patient.

The Approach

Image Preprocessing

In the medical imaging community, there are quite often labels and stickers applied to the mammogram. We want to remove these distractions since they tend to be bright intrusions that can confuse our deep learning model since tumors and denser tissue is indicated by increased brightness. We want to understand the difference between images which involves registration which is systematically aligning two images based on features. We chose a model with 1x1 convolutional layers and prioritized preprocessing greatly due to the importance of catching very slight patterns.

Working with mammograms presents a wide variety of different specimens. We had to adjust a few parameters to make our processing pipeline work best for most of our images. With regard to this dataset, there was a fail rate of about 2–5% on several of these steps due to severe irregularities in mammograms that could not be reconciled in our pipeline and had to be manually tuned.

Normalizing Background

Image 1: Background noise present before normalization and image after normalization

DDSM was comprised of film mammograms where differences in machine settings and processing techniques introduced a wide variety of patterns in the background of an image. In the first row of Image 1, the mammogram looks completely unchanged, but if you look at the binary level, you can see a grid pattern that embeds a lot noise. Given that the background does comprise a significant portion of the image, it’s important to remove the considerable brightness and pattern contribution of the unprocessed background.

This method for normalizing the background involves selecting random parts of the image and assigning each pixel to a cluster based on pixel value using K-means. The background is separated from the breast image because the pixels are mostly dark and form a population of pixels that belong to the background cluster. The breast tissue pixels are an order of magnitude higher than the background pixel values and will generate a separate cluster.

Artifact Removal

Image 2: Visual example of sticker artifact removal

Quite often, technicians provide labels which are useful for documentation but are quite large and can be a distraction for the deep learning model (see above: Image 2). Bounding boxes allowed us to quickly remove these artifacts in a more effective, automated way: Our researchers have had to manually tape these artifacts out and re-scan images in the past.

We used the python image processing library cv2 bounding boxes to detect and remove the labels. This method involves finding contours. The labels were generally bright, but sometimes had faded letters and borders in our dataset, so we binarized the images to help bring out the full location of these artifacts. After the locations were determined, we set pixels in these areas to be black. Parameters to tune in this step include the threshold for binarizing the image and minimum area threshold which designates the minimum size of the bounding box to avoid an overly sensitive box highlighting every irregularity in the image.

Pectoral Muscle Filtering

Image 3: Visual example of muscle filtering

The patient pectoral muscle images vary widely in brightness and in proportion to the mammogram. We used Otsu thresholding techniques to bring out the brightest section. The difference in brightness separates the pectoral muscle due to that tissue being denser and brighter in a mammogram. This method was quite successful for most mammograms, and we could bring out the pectoral muscle section. Removing the pectoral muscle from the images was critical to increasing our model accuracy, since it is also considered irrelevant to the primary question we are trying to answer.

The Otsu method is applied to a subsection of the image where we suspect the pectoral muscle to be to gauge the distribution of pixel values that belong on each side of the threshold. We flipped all the mammograms to have the pectoral muscle on the left. Then, we applied Otsu to a subsection of the image where we suspected the pectoral muscle to exist. The proposed threshold is then used to generate a mask on the whole image to determine which pixels are above and below the threshold. After the mask is applied to determine the proposed region that is muscle, we apply the scikit-image Python library to generate a connected labels section to fully section off the muscle portion.

Since a significant portion of mammograms did have areas of high density near the pectoral muscle, there could be a thresholding scenario where the pectoral muscle is too connected to the breast tissue. The previously discussed method of thresholding and removing a section would not work as intended in this use case. The figure below (Image 4) shows a very dense breast where the pectoral muscle cannot be separated through threshold transformations.
In this case, we separated the muscle with a line drawn from a subsection of the image. For our final pipeline, we applied both methods to every image. In the figure ‘Blob- pectoral muscle’ we can see the dense tissue causing the threshold technique to label too much of the image as pectoral muscle (white). When the thresholding technique generates more than 10% of the pixels to be outside of the line drawn method, we suspect the tissue to be too dense and causing excessive partition of pectoral muscle. With these cases we select the line drawn method output as the choice for preprocessing shown as ‘linecropped image’ in the figure below.

Image 4: Example of high density, connected pectoral muscle

Image Registration

Image 5: Example of transformations supporting alignment and study of asymmetries

One of our goals was to set up the foundations for a a way to eventually compare sets of mammograms to capture temporal changes in tissue. We planned on computationally differencing mammogram images to accentuate the areas of most tissue change over time. With the DDSM dataset, we had the left and right breast for comparison and alignment testing. For the desired longitudinal application, image registration is applied for understanding the asymmetries between the left and right breast tissue. Presence of tissue asymmetry in the breasts is a documented risk indicator that shows potentially cancerous abnormalities.

During the mammogram capturing process, there is compression and rotation that can cause variation in the angle of image capture. Simply overlaying mammograms without alignment would be an oversimplification of the comparative properties of the image and cause misleading representations of asymmetry in the breast tissue.

Published research has leveraged the nipple and outline as the points of reference to align images. We opted for using the entire image to generate alignment points since it was very important to compare tissue of the whole breast. Image 5 shows the moving image (right breast) being transformed to align with the fixed image (left breast). We used the Simple ITK (SITK) python package which is an efficient python library and offers a variety of segmentation and registration algorithms. We selected Mattes Mutual Information to reduce difference between the two images via rigid transformations. Tuning the sampling rate and smoothing parameters helped ensure quality transformations.

Shannon entropy is defined with equation (1) below. We calculate this for a distribution of points and p represents the probability of co-occurrence of the pixels at points I and j. The probability of pixels matching in both images is quantified with the Bayesian expressions in equation 2, 3, and 4. This quantifies the similarity between the two images.
Let us call the left breast mammogram to be image A and right breast mammogram to be image B. We look at the mutual information term (I) which is derived from the distributions of both images and reflects the entropy of each image. Because we believe these images to be worth registering and aligning, we describe H(B|A) and H(A|B) to be the conditional entropies. These values are representing the probability of a pixel distribution in one image as informative to the probability distribution in the other image. This is where sampling is important because you do not want to over sample and cause lack of convergence. Meanwhile, you want to sample enough that major features are lining up (e.g. nipple, major areas of tissue in breast formation, etc). The algorithm is solving for maximizing I.

There are quite a few transformations we can use to align the two images and maximize the mutual information term. In the backend, the solver will use gradient descent to iteratively propose the best transformation parameters to minimize. We can shift the image in the x and y direction as well as rotate it to get the two images to align as much as possible. With the rotation aspect, there needs to be an interpolation algorithm to maintain the same resolution for the output image. It is best to start with a rigid registration step and see how those results play out.

In our case, we elected to use 2D transformation to propose an x and y shift, as well as rotation. The solver also iterates by taking scale factors and smoothing kernel radius to try to get a match on the images without considering the full resolution of the image with every iteration. You can tune these for performance and clarity. The histogram variable can also be tuned for more precise outcomes in aligning the probability distributions of the pixel values that define H. There are a lot of parameters to tune depending on how entropic your images are and the range of pixel values you are working with.
If there is significant need for deformative transformations to improve registration, then add on a separate step for contraction and expansion of certain sections of the image. More complex versions of this method involve a z-axis, which would leverage a multivariate mutual information definition. This methodology corrects for variations in images that are caused by changes in angle of capture or compression of the breast that is very uneven between images that are being registered. We opted to stay with 2D transformations for this project, but plan to explore these processing techniques for the next evolution of this effort.

Model Selection

Our dataset was quite limited in terms of size for deep learning applications. To combat the challenge of a limited dataset of only around 3,000 images, we opted for a simpler CNN model. To effectively choose which model to proceed with, we considered and tested ResNet50, VGG19, and Xception, which have precedence in medical research. However, after testing these models, many of these resulted in overfitting.

Therefore, the transfer learning model we chose for modeling in this project was ResNet18, which addressed the overfitting challenges resulting from the limited size of the dataset. We also were particularly interested in testing ResNet18 because of it’s reputable research results in a groundbreaking study by Connie Lehman.

Training

Due to the size of the images, the constraint we were maximizing for during training was memory. The size-normalized images are 6,000 x 4,000 x 3 pixels, each one encoded with granular and detailed breast tissue information that would be lost with excess down sampling. Many modern datasets are scaled down to 255x255 pixels to avoid the large memory growth associated with creating convolution layers on larger and larger images. We also included rotation and translations to augment our dataset. These smaller images are especially advantageous because they allow for the use of GPU and TPU hardware acceleration which are typically only able to store models up to 13MB in size. However, using the DDSM mammogram dataset where every pixel matters, these image sizes are too small to achieve a significant result. Over the course of our training experiments, we found that both small CNNs and large transfer learning models either overfit on the training data or were not able to converge given the 3,000 255x255 mammogram images.
In order to accommodate a larger image size, we leveraged Horovod distributed machine training on the smallest of the transfer learning models, ResNet18. Because virtual machines are less constrained by memory than GPUs, were able to prototype more image sizes and found that the transfer learning models performed optimally when we scaled down the images by 5, or to 1,200 x 800 x 3 pixels. Using a batch size of 32 on 4 AWS ml.m5.16xlarge instances, we were able to train a model over 100 epochs that achieved a significant .59 AUC result.

Model Results

Our model showed there was significant information in the complimentary breast that signals cancer presence. These findings have potentially profound implications; it suggests that there is such a wealth of information in the noncancerous tissue that predictive performance is comparable to the state-of-the-art providing significant motivation to rigorously study this often overlooked region of data.

We achieved .59 AUC with a small fraction of the dataset size used in other research models, and with noncancerous tissue used for prediction. These results show that there is a clear signal in the images with no tumor present to inform the cancer status of a person. Most research relies on tumor identification and determining cancer status after tumor manifestation. In comparison with other notable published deep learning research that focuses on cancer prediction without tumors, the AUCs range from .62-.67 with over ten times more training images than in our case.

Due to the promising results with our DDSM dataset, we have demonstrated that significant information exists in non-cancerous breast tissue and this type of data should be examined further. We reviewed our findings with our collaborating epidemiologist researchers, who were intrigued with our findings. We plan on furthering our collaboration and continue the research detailed in the next section.

Image 6: Final model results

Future Work

In the next phase of our project, we intend to leverage the Harvard Nurses Study dataset which features a group of women who have had their mammograms taken periodically over several years. This dataset allows us to trace tissue developments over 20 years and to potentially detect cancer risk through sequential images before a tumor status is confirmed. With this goal in mind we structured the work process to target changes in images and use solely non-cancerous images to determine cancer status of the patient. Furthermore, the Harvard Nurses Study dataset has a lot more meta data for each patient regarding 90+ lifestyle and risk factors that we could use to correlate cancer risk.

The pipeline designed in this phase of the project will be leveraged for ease of preprocessing and data management when we get access to the Harvard Nurses Study dataset. In the future we would be able to efficiently adapt our code and process a different dataset or different variations of training sets. We plan to share our learnings with the community by hosting an open source repo for full transparency. We essentially formed the foundation of our toolkit to use on subsequent mammogram analysis projects. This notably includes, this project’s Phase II: identifying asymmetries in mammograms taken over time to predict breast cancer risk — an analysis that has never been performed before and we are expecting to set new benchmarks in breast cancer risk prediction.

Findings from this project has shown significant promise in further exploring computer vision applications to find cancer risk in noncancerous mammograms. We have identified the following opportunities to continue this collaboration and learning about cancer detection.
One possibility for further work is building the application to include a review step where partner teams (consisting of radiologists, and epidemiologists) can see the risk score of a mammogram, compare the different images, and document their expertise with annotations.

Learning from the model and helping the model learn are two separate goals that involve strategic data storage and separate analyses. Our model is hosted on the cloud leveraging AWS tools to construct a feedback loop that documents useful information for the data scientist to improve the model in the future. We want to see the clinicians hypothesizing and making connections with lifestyle factor data with respect to the scores from the model. These crucial conversations help us understand risk factors regarding mammogram patterns that are informing the deep learning model to associate cancer presence. On the other hand, we also want to study the images that the model does very well or poorly on for future training and advancements in image-based cancer detection.

Deep learning is a natural fit for medical imaging applications because biological variation and minute tissue patterns hold so much information. As researchers start to explore deep learning applications for targeted questions, there is so much possibility for uncovering signals in images not visible to the human eye that are teaching us about cancer presence and development. Deep learning is particularly useful in leveraging mass datasets to pick up on these biological cues. If these nuances are identified in mammograms, we have a real opportunity to detect cancer risk in much earlier time periods than thought possible.

Usually when a clinician reviews a non-cancerous mammogram, the patient is sent home without instruction for risk-based monitoring. For example, the patient may not have real signs of a cancerous tumor but could be at high risk that warrants more frequent mammogram visits. Currently a clinician can’t gauge risk effectively and cannot prescribe a more conservative monitoring plan. Research and improvement in deep learning applications for analyzing cancer likelihood is pushing the boundaries of earlier detection. Our second phase of this project involves looking at mammograms of the same person over time and leveraging changes in breast tissue to determine cancer risk. We look forward to our second phase of research work since medical imaging data contains immense opportunity for understanding cancer and should be studied with the computational resources and algorithmic tools we have today.

Learn more about us

Learn more about AI at Slalom and projects supported via Innovation for Good and the AI Center of Purpose.

--

--

Anne Marie Ou
Slalom Technology

Chemical Engineer turned Data Scientist based out of Seattle, WA. Always curious about sustainability, math, and culture.