How Different Are Cats and Cells Anyway?

Closing the Gap for Deep Learning in Histopathology

Michael Bereket
Stanford AI for Healthcare
17 min readFeb 7, 2018

--

Written by Michael Bereket and Thao Nguyen.

Deep learning has revolutionized the field of computer vision. So why are pathologists still spending their time looking at cells through microscopes?

Examples of cat detection and nucleus detection (bounding boxes in green, yellow, red) (Source: cats, cells)

In recent years, the field of computer vision has undergone a revolution. Deep learning techniques, powered by increases in data and improvements in computational power, have enabled breakthroughs in tasks like image classification (Krizhevsky et al., 2012) and facial recognition (Taigman et al., 2014), which now permeate our everyday lives.

These increasingly ubiquitous breakthroughs are just beginning to reach pathologists’ microscopes. Histopathology, the microscopic study of diseased cells and tissues, presents a variety of unique opportunities to meaningfully apply deep learning. Every day, countless tissue samples must be visually inspected to diagnose and characterize a variety of illnesses, including nearly all types of cancer (Gurcan et al., 2009). Additionally, developments in whole-slide imaging technology have fueled the growth of digital pathology in recent years. While primarily used for research, education, and remote consultation, following FDA approval of Philips’ IntelliSite imaging system in 2017, whole-slide images may now be used for primary clinical diagnosis. In this context, automated image analysis of scanned microscopic slides could drastically increase diagnostic efficiency and reduce inter-observer variability and errors. This would allow fewer pathologists to serve more patients while maintaining diagnostic accuracy and precision.

Along with unique opportunities, histopathology presents unique challenges for deep learning models. Acquiring sizable datasets for histopathology is much more difficult than for normal computer vision tasks (cat pictures are much more popular and easier to label than cell pictures). Additionally, whole-slide images have very high resolution and are susceptible to many sources of variation. Even with the necessary data, models that are created for diagnostic tasks must also achieve greater interpretability and performance than models used in many other contexts.

Mitosis under a fluorescence microscope (source)

In this post, we will focus on mitosis detection, which shares many of the challenges facing other nuclear detection tasks, as a motivating example. Mitosis is the process by which non-reproductive cells divide into two genetically-identical child cells. Mitosis detection in microscopy slides is important for the analysis of many diseases — for example, pathologists will count mitosis events to determine how quickly a cancer is replicating. In particular, mitosis events are usually detected on static Hematoxylin and Eosin (H&E) stained slides (first image in the next section), making the task much more difficult than recognizing mitosis in the above GIF.

In the subsequent sections, we aim to address the following questions:

  • Why automate nuclear detection tasks? The potential impact of slide analysis automation on healthcare
  • Why deep learning, and how is it used? Why this is a good problem for deep learning and how it has been used to achieve remarkable results
  • Why are nuclear detection tasks hard? The various challenges and existing partial solutions to address them
  • What’s next? Thoughts on bringing deep learning solutions into regular clinical practice

The need for automated nuclear detection

H&E Stained Breast Cancer Samples from ICPR Dataset (source)

Imagine that you are a pathologist, and a doctor has sent you a sample of an abnormal mass observed during a routine breast cancer screening. Your job is to determine the nature of the cells in the sample.

Let’s say the you determine the mass to be a malignant cancer — now you, the doctor, and the rest of the healthcare team must work together to determine treatment options. One important factor in selecting treatments is the “grade” of the cancer, which represents the cancer’s biologic aggressiveness. Applying the widely-used Nottingham Grading System, you analyze three morphologic factors: tubule formation, the degree to which the cancer cells have developed normal breast cell structure, nuclear grade, which is based on the morphology of tumor nuclei, and mitotic index, a measurement of the rate at which cells are dividing (National Cancer Institute).

Let’s take a closer look at how you would determine the mitotic index. The Mitotic Activity Index is calculated as the number of mitotic events in a 2 mm² area of the tissue sample. This corresponds to the count of nuclei undergoing mitosis in 8–10 “high power fields” (areas visible under high magnification, typically 40X), depending on the microscope used. Identifying mitotic cells is challenging: you must discern mitotic events amongst similar-looking apoptotic (dying) cells and image artifacts, select appropriate areas for analysis, and account for variations in the appearance of mitotic figures (Veta et al., 2015). This difficulty can lead to variability in measurements and high inter-observer error. Take a look at the image below: can you tell the difference between the mitotic and non-mitotic examples?

Answer: (a)-(c ) true mitoses, (d)-(f) confounding examples (source)

Additionally, this process is highly time- and labor-intensive. A typical case will take a pathologist 5–10 minutes to analyze, with repeated examinations required in certain scenarios (Veta et al., 2015).

In 2014, 236,968 women and 2,141 men were diagnosed with breast cancer in the US (CDC, 2017) — that’s a lot of medical experts spending a lot of time counting mitotic cells. Creating fast, consistent automated solutions for mitosis detection would have a significant impact on healthcare costs and quality.

Of course, breast cancer treatment is not the only field that would benefit from automated nuclear detection solutions. Robust and accurate nuclear detection is essential for digital pathology in general — nuclear morphologic structure and arrangement provides important clues for many diagnostic tasks, ranging from cell counting and tracking to grading applications for a variety of diseases (Xie et al., 2015).

The role of deep learning

Example convolutional neural network architecture (source)

Machine learning techniques, which utilize data to learn relationships between inputs and outputs, have found wide applications in nuclear detection tasks, with a subset of techniques known as deep learning becoming the leading approach for state-of-the-art solutions. While traditional machine learning often relies heavily on feature selection and engineering for good performance, deep learning models learn useful representations from raw data through the training process. Convolutional neural networks (CNNs), characterized by convolutional layers that exploit locality, have proven particularly useful in computer vision tasks including nuclei detection. For more information on CNNs, check out Stanford’s CS231N course notes.

While supervised training of CNNs has been the backbone of deep learning successes in image tasks, unsupervised approaches where models learn representations from data without explicit human annotation have also proven useful in cases with limited data. For example, autoencoders (AEs) learn to encode inputs into a compressed latent representation that can be used as input for further supervised training. More on autoencoders here.

Progress Made

Before diving into what’s left for deep learning in nuclear and mitosis detection tasks, we will provide some examples of recent successes and the approaches used to achieve them.

Recent mitosis detection competitions (ICPR 2012, AMIDA 2013, TUPAC 2016) have provided researchers with labeled datasets and pushed the state-of-the-art for automated mitosis detection. Initial deep learning success was achieved by Ciresan et al. (2013), who won the ICPR 2012 competition with multiple CNNs making pixel-wise mitosis classification predictions. The 2013 AMIDA competition was won with a similar approach.

Fully Convolutional Deep Regression Network downsamples input then upsamples to output regression map of same input dimensions (Chen et al., 2016).

Research using these datasets continued long after the competitions concluded. Wang et al. (2014) combined hand-crafted features and smaller CNN models to reduce the computational burden of their mitosis detector. Chen et al. (2016) achieved a new best F1 score of 0.790 on the 2012 ICPR dataset by employing a fully convolutional Deep Regression Network (DRN) that made pixel-wise proximity predictions for mitotic nuclei (diagram above). The fully convolutional architecture allows for inference with a single forward pass, regardless of input image size, improving efficiency.

Beyond mitosis detection, similar approaches have achieved success in other nuclear detection tasks. Kashif et al. (2016) combined neural networks with handcrafted features to detect tumor cell nuclei. Regression approaches that exploit topological information were utilized by Xie et al (2015) and Sirinukunwattana et al. (2016) for tumor nuclei tasks. Xie et al (2015) also successfully utilized spatial information with a “deep voting” scheme.

Nuclei detection using Stacked Sparse Autoencoders. The green, yellow and red dots represent the true positives, false positives, and false negatives respectively. (Source)

Unsupervised learning techniques have also proven useful. Xu et al. (2015) stacked two sparse autoencoders to learn useful representations of nuclei arrangements in input patches, improving the performance of their classifier.

Deep learning techniques have produced state-of-the-art results, even surpassing the performance of pathologists in some cases (Bejnordi et al., 2017). However, it’s important to note that for many tasks, the datasets used for research are often small and acquired from a single or a few prominent institutions, so variations in data may not be fully represented. As a result, the generalizability of these deep learning models to new cases is still in question.

Let’s take a closer look at some of the challenges facing deep learning solutions for nuclear detection tasks, including mitosis detection, and the attempts being made to address them.

Current Barriers to Progress

Microscopic image analysis presents a series of unique challenges to deep learning models that are not encountered in typical vision tasks. Furthermore, existing healthcare resources need to evolve to be able to support and utilize new technologies as they are developed.

Machine learning challenges

This section draws heavily from this review by Xing et al., 2017.

1. Needs More (Representative, Well-Labeled) Data

The performance of a deep learning model relies on access to a sufficient amount of representative, well-labeled data. However, there are significant barriers to the acquisition of good datasets for nuclear detection:

  • lack of medical expertise: medical expertise is often required for accurate annotations of microscopy images
  • time-intensive annotations: patch-wise annotations of extremely high-resolution images are required for detection tasks
  • privacy concerns: biomedical data must be treated carefully to maintain privacy
  • expensive/rare scanners: whole slide image scanners are not currently widely available
  • imbalanced classes: for many tasks, such as mitosis detection, one label (“no mitosis”) is much more common than the other (“mitosis”)

To overcome these challenges, it is important to develop techniques to extract more information from available data or standardize data aggregated from different sources.

Data Augmentation

One standard approach to deal with limited data is data augmentation, the practice of applying label-preserving transformations to images, such as rotation, reflection, random crops, and color alterations for microscopy images, to expand the dataset. While this technique is simple, efficient, and beneficial to performance, augmented images may be highly correlated and are often insufficient to train a generalizable model.

Crowdsourcing

Crowdsourcing, the practice of using a large number of individuals of varying knowledge to label data inexpensively, is also being explored as an alternative approach to building a large dataset. Because crowdsourcing yields noisy and often conflicting labels, especially in difficult tasks like nuclear detection, researchers must develop models robust to noisy annotations.

Aggnet Approach Overview (source)

Albarqouni et al. (2016) have made early progress on this problem, working on the challenge of crowdsourcing labels for mitosis detection in breast cancer biopsy images. They introduce an additional aggregation layer to their CNNs to process non-expert annotations and estimate the validity of these labels while simultaneously tuning their classifiers. To do so, they train multiple CNNs with expert-labeled images at various scales. These networks are then used to identify candidate mitotic events from unlabeled data, which are sent to non-experts for annotation. The sensitivity and specificity of each annotator is estimated in an unsupervised manner using the EM algorithm, and these weighted annotations are used to train the classifier (while also incorporating a trust metric from the crowdsourcing platform). The team achieved promising results, setting the stage for further exploration of crowd-sourcing techniques in automated microscopic image analysis.

Transfer Learning and Unsupervised Pre-Training

Transfer learning is a popular practice in computer vision to build generalizable models when there is not a lot of data available. The process involves using a model trained on a different, large dataset (such as ImageNet) for initialization or feature extraction. Additional layers may then be added and the entire network may be fine-tuned in a supervised manner. Alternatively, a scheme of unsupervised pre-training and supervised fine-tuning can be used to take advantage of unlabeled data. In this approach, an unsupervised model is used to determine a latent representation of the input, which is then extended and trained in a supervised manner.

These techniques have been applied to nuclear detection tasks. Because transfer learning is more effective when the pre-training dataset is similar to dataset for the desired task, a generic model pre-trained on microscopy images from various microscopes and staining preparations could perform better than one pre-trained on standard natural images. Unfortunately, no general pre-trained microscopy model is currently available.

2. High-Dimensional Data

For many nuclear detection tasks, it is important to have solutions that work with an entire whole slide image, rather than just manually cropped selections. However, these images are often enormous (1–40 GB!), with dimensions over 50,000 x 50,000 pixels. Additionally, these images may contain tens of thousands to millions of objects of interest for some nuclear detection tasks (Xing et al., 2017).

With such high dimensional data, extensive computation is required to process even a single image, especially with the CNN pixel-wise classification approach. Fully convolutional networks, while being more efficient, still face memory issues with large image sizes. Because whole slide images cannot be resized significantly without losing valuable information, images are often split into patches, processed separately, then stitched together. Designing an efficient and robust algorithm for this process remains an open problem, complicated because very few patches may contain the relevant objects and image splitting may lose contextual information (Xing et al., 2017).

3. Variations in Data Preparation

The image quality of a whole slide scan is highly dependent on an extensive preparation process. Tissues are sectioned onto glass slides and stained with different chemical substances before undergoing scanning and imaging. Image artifacts, such as tissue folds and blurred images, can present challenges to deep learning algorithms.

As we strive to improve the diversity and quantity of our data, we will need to handle images from multiple institutions utilizing a variety of microscopes and preparation protocols. These differences lead to color and scale variations, known as batch effects, which can bias the performance of predictive models (Kothari et al. 2013). Scale variations can be particularly difficult to detect due to the wide variety of cell and nuclear morphologies in healthy and diseased tissues.

Fortunately, researchers have made progress in normalizing color and scale batch effects (Bejnordi et al. 2016, Kothari et al. 2014). Deep learning methods have also been used for stain normalization without hand-crafted features (Janowczyk et al., 2017), achieving promising results. Developing techniques to effectively handle batch effects will be essential to creating widely-applicable nuclear detection models.

4. Overlapping and Cluttering of Objects

A pathology image can contain tens of thousands of nuclei, often partially overlapped and clustered into clumps. The ambiguous and potentially misleading boundaries make it very difficult to detect and potentially segment individual nuclei for further analysis. Incorporating shape prior modeling into deep neural networks has the potential to improve delineation of object boundaries, though that is in still an ongoing research question.

Healthcare challenges

Success for automatic nuclear detection should not be measured by model performance, but by actual impacts on patients. In order for any of the proposed deep learning techniques to make a difference, we must also consider the healthcare system that we are operating within.

1. Opening the “black box”

In addition to achieving good performance, it is also important for doctors to be able to understand why predictions are made. For medical applications, we must be able to verify that a model’s predictive success is derived from the proper problem representation, rather than artifacts, to confidently apply these predictions in practice (Montavon et al., 2018). Neural network interpretability is an open question and an active area of research in machine learning (for a discussion of the topic, see Lipton 2017). One solution for histopathology may be to have the neural network itself explain its predictions — for example, researchers have taken early steps towards generating descriptions to accompany predictions in the radiology domain (Shin et al. 2015). Possible joint workflows, such as a CNN that identifies candidates that are then confirmed by a pathologist, may also help address these issues.

2. High costs of digitization

Today, the cost-benefit ratio of digital pathology remains too high to many pathologists. Unlike radiology, where end-to-end digital workflows already exist, digital pathology still requires the creation of a tissue block and glass slides (and all the steps involved in these processes) before images can enter the digital workflow. Once a glass slide is produced, it is currently cheaper, faster, and easier for a pathologist to view the slide under a microscope than to scan it and view the digitized slide image on a computer. Creating tissue blocks cannot be done away with — even when digital pathology is widely adopted, blocks will still need to be made and kept in storage, in case the need for further ancillary studies (such as molecular testing and special staining) arise.

Additionally, some pathologists may be resistant to changes in their current workflow. Effective automated analysis will be instrumental in providing sufficient benefit to pathologists to adopt digital workflows; though, as we have seen, improved automated analysis requires more data. For this reason, it is important that partnerships between machine learning researchers and pathologists are formed to bring automated microscopic image analysis (and the benefits it brings) to fruition.

Conclusion: Cats and Cells

So are cats and cells really that different, from a deep learning point of view?

Yes and no.

As we’ve seen, the same deep learning techniques (e.g. CNNs and their variants) that have driven much of the recent success in general computer vision tasks have also produced state-of-the-art results in nuclear detection tasks.

However mitosis detection and nuclear detection tasks in general face a number of challenges not usually encountered in detecting our feline friends. Unlike cats, microscopy images are not plastered across the internet, and it is much more difficult to accurately annotate a mitotic or cancerous nucleus than a tabby. Additionally, whole slide images are very large and can contain thousands of cluttered and overlapping nuclei, while the population in normal cat training inputs is much smaller. Significant variations between our limited data sources also pose a challenge to building models that can generalize well. Finally, our nuclear detection models must not only successfully detect nuclei but must also be interpretable.

We believe that it is essential for pathologists and computer scientists to come together to understand and tackle the technical and healthcare challenges facing deep learning in nuclear detection tasks. Through such partnerships, we can target the most impactful problems and build the representative datasets and robust models necessary to bring the breakthroughs of deep learning to histopathology. We look forward to a future where deep learning models can effectively automate many microscopy tasks in regular clinical practice, enabling pathologists to spend less time counting cells and more time on high-level diagnostic and other analytic tasks.

Acknowledgements

We would like to thank Dr. Jeanne Shen, Assistant Professor of Pathology at the Stanford University School of Medicine, for her guidance and valuable feedback throughout the writing process. We would also like to thank Pranav Rajpurkar and Jeremy Irvin of the Stanford Machine Learning Group and Ruth Starkman of the Stanford University PWR department for their comments.

References

  1. Gurcan, Metin N., et al. “Histopathological image analysis: A review.” IEEE reviews in biomedical engineering 2 (2009): 147–171.
  2. Taigman, Yaniv, et al. “Deepface: Closing the gap to human-level performance in face verification.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.
  3. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.
  4. Veta, Mitko, et al. “Assessment of algorithms for mitosis detection in breast cancer histopathology images.” Medical image analysis 20.1 (2015): 237–248.
  5. “Tumor Grade.” National Cancer Institute, www.cancer.gov/about-cancer/diagnosis-staging/prognosis/tumor-grade-fact-sheet.
  6. “FDA Allows Marketing of First Whole Slide Imaging System for Digital Pathology.” U.S Food and Drug Administration Home Page, Office of the Commissioner, www.fda.gov/NewsEvents/Newsroom/PressAnnouncements/ucm552742.htm
  7. Xie, Yuanpu, et al. “Deep voting: A robust approach toward nucleus localization in microscopy images.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2015.
  8. Cireşan, Dan C., et al. “Mitosis detection in breast cancer histology images with deep neural networks.” International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, Berlin, Heidelberg, 2013.
  9. Unsupervised Feature Learning and Deep Learning Tutorial, ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/.
  10. “CS231n Convolutional Neural Networks for Visual Recognition.” CS231n Convolutional Neural Networks for Visual Recognition, cs231n.github.io/.
  11. “MICCAI Grand Challenge: Tumor Proliferation Assessment Challenge (TUPAC16).” MICCAI Grand Challenge: Tumor Proliferation Assessment Challenge (TUPAC16), tupac.tue-image.nl/.
  12. “ICPR 2012 — Mitosis detection contest.” ICPR 2012 — Mitosis detection contest | Image & Pervasive Access Lab, www.ipal.cnrs.fr/event/icpr-2012.
  13. “MICCAI Grand Challenge: Assessment of Mitosis Detection Algorithms 2013 (AMIDA13).” MICCAI Grand Challenge: Assessment of Mitosis Detection Algorithms 2013 (AMIDA13), amida13.isi.uu.nl/.
  14. “Breast Cancer Statistics.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 7 June 2017, www.cdc.gov/cancer/breast/statistics/index.htm.
  15. Wang, Haibo, et al. “Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features.” Journal of Medical Imaging 1.3 (2014): 034003.
  16. Chen, Hao, Xi Wang, and Pheng Ann Heng. “Automated mitosis detection with deep regression networks.” Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on. IEEE, 2016.
  17. Kashif, Muhammad Nasim, et al. “Handcrafted features with convolutional neural networks for detection of tumor cells in histology images.” Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on. IEEE, 2016.
  18. Sirinukunwattana, Korsuk, et al. “Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images.” IEEE transactions on medical imaging 35.5 (2016): 1196–1206.
  19. Xie, Yuanpu, et al. “Beyond classification: structured regression for robust cell detection using convolutional neural network.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2015.
  20. Xu, Jun, et al. “Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images.” IEEE transactions on medical imaging 35.1 (2016): 119–130.
  21. Bejnordi, Babak Ehteshami, et al. “Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer.” Jama 318.22 (2017): 2199–2210.
  22. Xing, Fuyong, et al. “Deep Learning in Microscopy Image Analysis: A Survey.” IEEE Transactions on Neural Networks and Learning Systems (2017).
  23. Albarqouni, Shadi, et al. “Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images.” IEEE transactions on medical imaging 35.5 (2016): 1313–1321.
  24. Bejnordi, Babak Ehteshami, et al. “Stain specific standardization of whole-slide histopathological images.” IEEE transactions on medical imaging 35.2 (2016): 404–415.
  25. Lipton, Zachary C. “The mythos of model interpretability.” arXiv preprint arXiv:1606.03490 (2016).
  26. Shin, Hoo-Chang, et al. “Interleaved text/image deep mining on a large-scale radiology database for automated image interpretation.” Journal of Machine Learning Research 17.1–31 (2016): 2.
  27. Kothari, Sonal, John H. Phan, and May D. Wang. “Scale normalization of histopathological images for batch invariant cancer diagnostic models.” Engineering in Medicine and Biology Society (EMBC), 2012 Annual International Conference of the IEEE. IEEE, 2012.
  28. Bejnordi, Babak Ehteshami, et al. “Stain specific standardization of whole-slide histopathological images.” IEEE transactions on medical imaging 35.2 (2016): 404–415.
  29. Kothari, Sonal, et al. “Pathology imaging informatics for quantitative analysis of whole-slide images.” Journal of the American Medical Informatics Association 20.6 (2013): 1099–1108.
  30. Grégoire Montavon, et al. “Methods for interpreting and understanding deep neural networks.” Digital Signal Processing 73 (2018): 1–15

--

--