Deep learning: From natural to medical images

How to adjust deep neural networks to medical image analysis problems

Deep neural networks have been taking the world by storm in the past years and successful applications are now abundant for problems in image analysis. Neural networks are often developed on ‘natural’ images; everyday images taken with regular RGB cameras. Recently, deep neural networks have also been adapted and deployed to detect abnormalities in medical images such as X-ray, CT and MRI scans with great success. However, there are some key differences between the two domains: medical images often contain quantitative information and objects have no canonical orientation. Taking these differences into account and adjusting algorithms accordingly can greatly boost performance.

Deep neural networks have been taking the world by storm in the past years and successful applications are now abundant for problems in image analysis. Tools for security and pedestrian detection systems in self-driving cars, could make the world a safer place and reduce accidents. Another area where AI can have an immense positive impact is the medical world.

Even though the problems are superficially similar, research on image analysis for natural and medical images has traditionally been separated. Natural image analysis often refers to problems such as object detection, face recognition and 3D reconstruction, using images from normal RGB cameras. Medical image analysis entails tasks like detecting diseases in X-ray images, quantifying anomalies in MRI, segmenting organs in CT scans, etc.

Deep learning methods for classification are infamous for being data hungry. For problems in natural image analysis, this is often solved by mining massive numbers photos from social media and having non-experts perform annotations through crowdsourcing [1]. For medical problems, this data is often harder to acquire and labeling requires expensive experts, meaning it takes longer for deep learning methods to find their way to medical image analysis. As a result, many of the algorithms for classification are prototyped on natural data and subsequently adjusted to medical problems.

There are several key differences between natural and medical data. The most obvious is the data format: natural images are typically 2 (spatial) dimensional RGB images, whereas medical data can have any form such as 2D grayscale, 2D with 4 channels, 3D volumes and even 4D (volumes changing over time). However, there are also more subtle differences. Taking these into account can push the performance towards clinically usable products.

Variance and invariance

Deep neural networks generate a hierarchical representation of the data. If this is done well, the final representation will be a subspace that retains all relevant sources of variation for the classification (or regression) task and ignores everything else, i.e., it becomes invariant to irrelevant sources. If we want to discriminate cats from dogs for instance, we should look at sources of variation like the shape of the ears, their fur, size of the tail etc, but generate invariance with respect to factors such as the exact intensity of the image, the orientation and the size of the animal.

In traditional computer vision systems, people thought long and hard to come up with complicated image processing algorithms that would generate a mapping like this. Deep neural networks can learn to generate such a representation from training data, saving tremendous amounts of engineering work. However, to learn a suitable mapping, we still have to make sure the architecture and the way we present our data actually makes it possible for the model to ignore irrelevant factors and focus on the relevant parts.

Variation in intensity

For most classification problems in natural images, the exact intensity of the image is not a relevant feature: A cat is a cat in an over and underexposed image. In traditional solutions to such problems, intensity and saturation are therefore often ignored and features are based on edges or gradients: differences between pixel values.

Left and right image vary greatly in intensity. However, if we want to classify or detect cats in images, the exact intensity is not important.

Intensity variation played a role in one of the first face recognition systems. The method proposed by Turk and Pentland [2] uses PCA to reduce the dimensionality of image data and project it onto a lower dimensional subspace. These ‘Eigenfaces’ are subsequently used for classification by a simple ‘shallow’ classifier.

‘Eigenfaces’ used in an early face detection algorithm. Images are projected down to a lower dimensional subspace using a PCA, which is subsequently used for classification. To make the method work well it is common to ignore the first one or two dimensions as they correspond to variation in intensity not relevant for the classification problem.

Because intensity is often one of the biggest sources of variation in a dataset of natural images and because the dimensions of the subspace generated by a PCA are typically ordered by their variance, the first couple of dimensions will often represent intensity. Ignoring this will make the whole pipeline work significantly better.

Since deep neural networks learn all these transformations from data, they are expected to generate a representation invariant to intensity changes (if the domain requires it). When visualizing what is learned in deep convolutional neural networks, we often see the first few convolutional layers comprise of edge filters, similar to feature extractors that were used in traditional computer vision system, which do not respond to absolute intensity changes. Additionally, techniques like batch normalization contribute to intensity invariant representations by removing some local intensity changes.

Intensity in medical images

X-ray is the oldest and most common medical imaging technique. It works by sending ionizing radiation through a body part and counting the number of photons not absorbed by the tissue. A low number of photons means dense tissue, whereas a large number of photons indicate a more porous area. Using some assumptions, mammograms (and potentially other 2D X-ray images) can be transformed into a representation where pixels convey approximate density information and used for quantitative estimation of volumetric breast density, a risk factor for cancer [6].

Apart from generating 2D images, X-ray can also be used to generate 3D volumes. CT scans generate such a volume by rotating the transducer and detector around the body part of interest and using some intelligent reconstruction algorithm. Because X-rays are sent from every angle, this generates a much more detailed representation and has as an additional advantage that creating a quantitative representation is more straightforward and accurate.

The commonly used scale to represent this information is the Hounsfield scale [7]. In this scale, air has a value of -1000, fat has a value between [-90, 130], different types of bone have different values, different types of blood have different values, etc.

Contrary to natural images where the nature of objects is typically determined by means of relative pixel intensities, the exact value of a pixels in medical images can convey information relevant for the problem. Removing this information by means of scaling or commonly used methods such as batch normalization, can decrease a deep neural network’s performance.

As mentioned above, for natural images the exact pixel value does typically not convey any information. For medical images this is can be the opposite: knowing the exact pixel value a CT image will give us information about the tissue it represents. Removing this information will effectively remove parts of the data that may be relevant to classify certain diseases or segment parts of the image.

It is important to consider this when training or fine-tuning deep neural networks to medical data. Preprocessing techniques such as standardization should not be a problem, as long as all samples in the training and test data are subject to the same normalization constants. Batch normalization layers, however, where scaling factors depend on batches and not the whole training set, may remove some of the information and are known to decrease performance in medical image analysis areas such as CT.

Variation in location

When detecting dogs or cats in images, their location is generally not important: a cat is a cat no matter if it is in the top left or bottom right of the image. Again, relative location can give a clue. When a feline is detected next to a television set, it is more likely to be a domestic cat than a cheetah.

A distinction is often made between classification and detection. In the first case the system has as input an image and output a label. In the second case the model takes the same input but additionally outputs a location of an object. In natural images, the location of the object in the image is typically not important for the label of the object, i.e., if the cat in the images would be moved slightly to the left, it would still be a cat. For medical images this can be the opposite: certain abnormalities are more likely to occur in certain parts of the scan.

In image analysis, a distinction is often made between classification and detection. In the first case the input is an image and the output is a label, in the second case the input is an image and the output is a location with a label. Most deep neural networks for classification have some translation invariance build-in, induced by the max-pooling layers. However, the final feature map still contains relative information: the top left pixel in the final feature map corresponds to the top left part of the original image, the top right pixel to the top right part, etc.

Neither traditional systems for object detection such as the Viola-Jones face detector, nor more recent object detection architectures such as R-CNN [3] and most of its variants take into account the location of the area of interest. This means using these architectures, explicit location information is lost.

Location in medical images

Just like intensity and scale, location can be a cue for the nature of an area in a medical image. Tumors are more likely to occur in certain parts of the breast; multiple sclerosis (MS) lesions appear most often in certain parts of the brain. Taking this information into account can therefore improve performance. Detection architectures working on medical data typically add location features manually somewhere in the model [8], by constructing a coordinate system and computing the location of detections relative to landmarks in the image.

Variation in scale

If we want to discriminate dogs from cats, it doesn’t really matter how large they are in the image: a cat is a cat, both in close-up and in the distance. For most computer vision applications this will hold, because the distance of an object to the camera and things such as camera parameters are typically not standardized. There are some nuances of course — cats look a lot like cheetahs, if the texture of the fur is not very clear, size may be a cue, but only when surrounding elements can be used to estimate size.

Image convolved with Gaussian kernels of different scale. Top left to bottom right: sigma = 12, sigma = 6, sigma = 3, original image. Every scale responds to features of different size. Taking the maximum response, we can generate representations invariant to scaling.

Scale space is a framework for multi-scale image (or signal in general) analysis. To generate a scale space, the image is convolved using some kernel function with a scaling parameter. By varying this scaling parameter, different elements in the image are accentuated. In a seminal paper, Koenderink [4] showed that under some reasonable assumptions the best kernel to prune the image function is the Gaussian

where the variance controls the scale.

Popular classical methods such as SIFT (Scale invariant feature transform) [5] make use of the concept of scale space to perform detection and description of interest points in an image, invariant to scale. In the detection step, convolutions are used to identify salient points in the image. In the description step, a small window around these points is taken and transformed to a feature vector. Similar to max pooling operations that generate some spatial invariance, we can use the max operation on the scale space to generate scale invariance.

Just like intensity, given a diverse dataset and the right learning conditions, we can expect a deep neural network to generate an internal representation invariant to scale, if the problem requires it.

Scale in medical images

Scale is another type of quantitative information often present in medical image analysis. For example, the size of a pixel in X-ray images is typically provided in the DICOM header. Knowing the amount of pixels an abnormality occupies and the pixel size from the header, we can compute the size of the lesion in the image.

This exact size can be an important feature. In mammography (X-ray images of the breast), for example, cancers are not scale invariant: increasing the size of a small cancer, does not result in a realistic large tumor.

Contrary to objects in natural images, medical abnormalities are in general not invariant to scaling. By taking a small early stage tumor and simply zooming in, we will not get the cancer displayed in the image above. Removing this information by means of scaling can degrade a deep neural network’s performance.

In the object detection architectures such as the R-CNN [3], regions of interest are extracted from an image using a candidate detector such as selective search and bounding boxes are then generated around these areas of interest. To feed these to a deep neural net, the boxes are warped to a fixed size, effectively removing shape and scale information.

Popular object detection architectures such as R-CNN use anisotropic resizing to feed a region to a classification CNN. This effectively removes any quantitative information about the scale and aspect ratio of the object, which can be an important feature for problems in medical image analysis.

In later meta-architectures for object detection such as Fast and Faster R-CNN, this operation was replaced with region of interest (ROI) pooling, which is done in feature space and may suffer less from the same issue. However, this type of cropping and resizing is a common operation that can remove any type of quantitative information about the size of an abnormality, which can be important features for medical problems.

Variation in orientation

Contrary to the above mentioned sources of variation, orientation may actually be relevant for some computer vision problems. Most objects in the real world have some canonical orientation: pedestrians typically have their head on top and feet below; cars have their wheels close to the ground. This is even more important in character recognition problems, such as the famous MNIST dataset: flipping the six by 180 degrees will result in a nine, a different class.

The classes in the famous MNIST are not invariant to rotations: a 180 degree rotation will change a 6 to a 9 and vice versa. Although the class label may not change in all cases, most objects in natural images do not have a canonical orientation: pedestrians have their head above their feet, cars have wheels below the windows. Therefore, models do not need to be invariant to orientation.

Deep neural networks and object detection architectures typically do not learn representations invariant to orientation unless the training data requires it.

Orientation in medical images

Contrary to natural images, orientation is generally not an important feature in medical data and objects in medical images have no canonical orientation. For instance, the tissue slices in digital pathology images are placed on the glass without predefined alignment. A similar thing holds for tumors in the chest, breast or brain and many other diseases. Even though background tissue can have a certain structure which is orientation dependent, the abnormalities themselves do not.

Contrary to objects in natural images, medical images often have no canonical orientation. By changing the architecture of CNNs to ignore orientation, the performance can improve.

Instead of changing our architecture to take this source of variation into account, we can change it to be less susceptible to it. This is done in a recent work by Cohen and Welling [9], who introduce the G-convolutions, a generalization of the convolution operator that is not only equivariant (meaning the output of an operations behaves in a predictable way) with respect to translation, but also to discrete rotations and flipping operations. This model has shown great performance for various medical image analysis problems.


Deep learning algorithms are powerful tools that, because of the larger availability of training data, are typically developed for natural images (images recorded using RGB cameras, of everyday objects). Recently, the models have also been adapted and applied to problems in medical image analysis like detecting cancer in X-rays images and segmentation tissue in MRI scans.

Contrary to natural images, medical data often contains quantitative information that can be used to make neural networks perform better. The exact intensity of a pixel, the scale of abnormalities and their location in a scan can all be important cues. Conversely and again contrary to natural image analysis, the orientation is often not relevant for medical image analysis problems. Changing the architectures that were developed for natural images and taking these differences into account can greatly improved performance of the algorithm and push for clinically viable products.


Many thanks to Robert, Rasmus, John, Marc, Markus and Jack for their proofreading and suggestions.


[1] ImageNet Classification with Deep Convolutional Neural Networks — Krizhevsky et al., NIPS 2012

[2] Turk, Matthew A and Pentland, Alex P. [Face recognition using eigenfaces]. Computer Vision and Pattern Recognition, 1991. Proceedings {CVPR’91.}, {IEEE} Computer Society Conference on 1991

[3] Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

[4] Koenderink, Jan “The structure of images”, Biological Cybernetics, 50:363–370, 1984

[5] Lowe, D.G., 1999. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on (Vol. 2, pp. 1150–1157). Ieee.

[6] van Engeland, S., Snoeren, P.R., Huisman, H., Boetes, C. and Karssemeijer, N., 2006. Volumetric breast density estimation from full-field digital mammograms. IEEE transactions on medical imaging, 25(3), pp.273–282.

[7] Prince, Jerry L., and Jonathan M. Links. Medical imaging signals and systems. Upper Saddle River, NJ: Pearson Prentice Hall, 2006.

[8] Kooi, T., Litjens, G., van Ginneken, B., Gubern-Mérida, A., Sánchez, C.I., Mann, R., den Heeten, A. and Karssemeijer, N., 2017. Large scale deep learning for computer aided detection of mammographic lesions. Medical image analysis, 35, pp.303–312.

[9] Cohen, T., Welling, M.: Group equivariant convolutional networks. In: Int. Conf. on Machine Learning. (2016) 2990–2999