Photometric data augmentation in projection radiography

Thijs Kooi
Lunit Team Blog
Published in
17 min readApr 16, 2021

For artificial intelligence to be widely applied to a variety of problems, building specific systems for all possible niche problems does not scale, we should learn from data as much as possible. With the deep learning paradigm shift in 2012, this ideal has become closer to reach. However, systems are still rife with ad-hoc engineering tricks. Data augmentation is one such trick, commonly used to add prior knowledge to a learning problem. It can be seen as a label preserving transform applied to the data that is typically designed to make the network invariant to this transformation [1].

Data augmentation techniques also play a key role in self-supervised learning methods that have recently shown impressive results, such as SimClr and SEER [2, 3, 4, 5, 6]. The model learns to map different views of a sample to a similar representation, while pushing views of different samples further apart in the embedding space. To generate views, data augmentation techniques such as cropping, flipping and color are applied. An example of a set of augmentations used in SimClr [2] is shown in figure 1.

Figure 1. Data augmentation techniques used in SimClr [2], a popular self-supervised learning method. Different ‘views’ of a sample are created by augmenting a sample using these techniques. The model then learns to map these ‘views’ to a similar representation.

Data augmentation is also highly important in the current state of development of AI applications in the medical imaging domain. Annotated data is still scarce, as obtaining annotations is expensive and/or data is simply in short supply. Therefore, augmenting data is usually quite important, as it artificially increases the amount of labeled data. On the other hand, abundant non-annotated data is (potentially) available (though not always easy to access) and could be used for self-supervised learning, at least for common examinations like chest X-ray or mammograms. Since most self-supervised learning methods heavily rely on some form of augmentation to generate different views, thinking about data augmentation is therefore very relevant.

Data augmentation techniques (for medical data) can roughly be divided into three types:

  1. Geometric augmentation These include transformations such as cropping, rotation, scaling and translation.
  2. Photometric augmentation This includes changes to the pixel values such as contrast, sharpness, blurring, brightness and color changes.
  3. Structural augmentation In the medical domain, these augmentations include things like adding tissue [7] and artifacts that could occur during the recording (such as tubes, bias fields, white lines, etc). It is less common for general computer vision problems, but an approach like cutout [8] could be seen as structural augmentation.

In this post we will go into some details on the physics of projection radiography. We will subsequently try to give a physical interpretation of commonly used photometric augmentation techniques for natural data and discuss if and why they make sense for X-ray images.

Note that the descriptions of the imaging process in this blog post are simplifications, with the goal of getting a high-level understanding of a physical interpretation of data augmentation. For detailed descriptions, please refer to the numerous excellent books and websites on medical imaging and medical physics [7, 8, 9].

X-ray

X-rays were first discovered by Wilhelm Conrad Roentgen, a German physicist when experimenting with vacuum tubes. Similar to natural light, X-rays are a type of electromagnetic radiation, but have a higher energy than visible light, which allows them to pass through objects more easily. Contrary to light captured using RGB cameras, they are not reflections of an object but projections or shadows of an object. This means images will have different properties, which will be explained below.

X-ray generation

X-rays are generated by X-ray tubes, which convert electricity into radiation that can be used for imaging. An X-ray tube consists of a cathode and an anode. To generate X-rays, a difference in electric potential is created between the two, such that electrodes start to flow from the cathode and are captured by the anode. An illustration of an X-ray tube is shown in figure 2.

When the electrons hit the anode, which is made of some heavy metal (such as Tungsten or Molybdenum), they are decelerated. During this process, their energy is converted into heat (about 99%) and about 1% is converted into two different kinds of X-rays: characteristic X-ray (sharp peaks in the spectrum associated with specific atoms in the anode) and Bremsstrahlung (the German word for braking radiation) X-ray (the rest of the spectrum), the latter are the ones used for imaging.

Figure 2. Simplified illustration of an X-ray tube. To generate X-rays, a difference in potential is generated between the cathode and anode, such that electrons start to flow between them. The anode decelerates them, generating X-rays used for imaging. [Source: https://www.orau.org/ptp/collection/xraytubescoolidge/coolidgeinformation.htm].

Some important parameters that can vary in this process are (1) the ‘electrical pressure’: the potential difference between the cathode and anode which is typically measured in kilovolt, (2) the current flowing between them, which is measured in milliampere and (3) the time this current is flowing between the objects. The last two are often summarized into milliampere seconds (mAs), measuring the number of seconds that particular amount of electricity is flowing [10].

These parameters are related to two important image descriptions: noise and contrast. Increasing the mAs, increases the number of photons, but not their intensity. Increasing the kV increases both the number of X-rays and their intensity. If the X-rays have higher energy, more photons pass through the body, which essentially decreases contrast because the difference between soft and hard tissue decreases. It also increases the brightness or intensity of the image as simply more photon particles land at the detector.

Both kV and mAs affect the noise in the final image. The most common source of noise in X-ray is referred to as quantum noise or quantum mottle [8, 9, 10]. The number of X-rays landing on the detector can be compared to a tile that gets wet in the rain, the uneven distribution results in a grainy, noisy appearance of the image [9]. The effect of changing the mAs and kV is shown in figure 3.

Figure 3. The effect of detector parameters on the image. The two parameters kilovoltage (the electric pressure) and milli-ampere seconds (the current) affect the noise andcontrast of the image. (Top) when decreasing the kV, the contrast increases decreases. (Botton) when decreasing the mAs, the noise increases. (Images taken from Huda, W. and Abrahams, R.B., 2015. Radiographic techniques, contrast, and noise in x-ray imaging. American Journal of Roentgenology, 204(2), pp.W126-W131)

Attenuation

After the X-rays are emitted from the tube, they pass through the tissue where they are attenuated: the photon energy is decreased. The degree of attenuation depends on the type of tissue. ‘Hard’ structures like bone or metal will attenuate (or even block in the case of some metals) the X-ray more strongly and soft tissue like adipose or glandular tissue will attenuate less strongly.

Figure 4. When an X-ray beam passes through tissue, it is attenuated. The degree of attenuation depends on the type of tissue and is captured by the attenuation coefficient (mu) and the height of the tissue. The pixel values (represented by the black blob at the bottom) are exponentially related to the attenuation and height. Here I and I_0 represent the image value and the incident X-ray beam respectively. Note that this is a simplification.

The relation between the photon energy entering and leaving the tissue is captured by Lambert-beer law, displayed in figure 4. Here, mu denotes the attenuation coefficient of the tissue and h the height. The projection of the X-ray beam can be seen as a path integral through some three dimensional object function. This representation will become useful when discussing computed tomography in the section below.

Detection

X-rays can be captured directly on a photographic film, but this is inefficient as only 1–2 percent are actually recorded [8]. In practice, specialized devices are used that convert X-rays into visible light. At the moment roughly three types of detection are used [11, 12]:

  1. Screen film X-ray: In this case, the images are printed directly on film. This technique is falling out of fashion, because digital techniques have shown superior.
  2. Indirect capture or computed radiography (CR): Here the image is first recorded onto a cassette and scanned in specialized equipment, which are then used for viewing. Similar to screen film, this is becoming less popular. It is often a first step for institutions when the change is made from analog to digital equipment.
  3. Direct capture or digital radiography (DR): Here, the X-rays are converted to visible light and captured by a complementary metal oxide semiconductor (CMOS) or charged coupled device (CCD) detector, similar to detectors used in consumer digital cameras.
Figure 5. Processing pipeline of digital radiographs. The raw image, containing the photon count, is converted into a ‘processed’ image using some image processing techniques designed by the manufacturer of the device to generate a ‘processed’ or ‘for presentation’ image. The medical image viewer or preprocessing steps for the AI algorithm subsequently performs a second set of transformations like window-leveling and inversion.

Post-processing & DICOMs

Medical image data are typically stored in the picture archiving and communication system (PACS) inside the hospital’s servers. The recordings are maintained in a format defined in the digital imaging and communications (DICOM) standard which was introduced by the American College of Radiology in 1985 [13, 14, 15]. The DICOM file comprises a header which contains meta information about the recording, ranging from the acquisition parameters such as the anode material used, to patient information such as the age and gender.

Screen film, CR and DR are all stored in DICOMs, but digital radiographs can still come in 2 formats:

  1. For processing or raw: In this case, the pixel values are (almost) what comes out of the detector.
  2. For presentation or processed: In the second case, the device manufacturer applied several preprocessing steps to the image such that they are easy to process by the human visual faculties. These include histogram equalization methods to improve contrast, sharpening operations to enhance fine structures and methods that compensate for unequal exposure. Unfortunately, the exact operations are typically not published, but are a major cause of domain shift and a pain point for generalization of AI methods.

After the manufacturer-defined post-processing, the user (or part of the AI system) still has to perform some image transformations. One operation is selecting the optimal ‘window-level’. The window defines the ‘width’ of the pixel range that will be displayed. The ‘level’ defines the center of the range. These parameters can change the contrast and brightness of the image. The pipeline is shown in figure 5.

Compression and quantization

The DICOM standard supports JPEG2000 compression [16, 17], which tends to be a bit faster than traditional JPEG and more suitable for large images like medical scans. JPEG compression works with wavelet transforms to selectively remove high frequency components from images that humans do not pay attention to.

After the optional compression, the images are stored in the ‘pixel_array’ of the DICOM. Digital images typically have space for 16 bit, but only 12 or 14 are actually used. Scanned screen film images are usually stored in 8 bit.

Side note: The merit of raw data

The pipeline described above defines steps done to make the viewing experience of humans optimal, such as sharpening operations and the JPEG compression, which removes components not visible to humans. This does not mean this is also optimal for statistical models. Deep neural networks can work very well with limited pre-processing of the data.

The data processing equality [18] states that the information fed through a noisy channel is smaller or equal to the information before. Theoretically, this means any post-processing would limit their performance, unless the operations performed exclusively remove information not relevant for the class labels used in the learning problem. In the ideal situation, the deep neural net would just have access to the raw data (for processing) and learn all relevant transformations from data. This would also mitigate domain shift issues, as the raw data is expected to be a lot more similar between manufacturers. Unfortunately, in practice it is difficult to get the raw data, since this is not always stored.

Computed tomography

X-ray images allow us to see through some object. However, the image is a projection or mathematically, a path integral, which is a many-to-one, or surjective mapping. The exact 3D shape can not be restored. If a nodule, tumor or some other symptom is hiding behind a thick cloudy structure, it will go unnoticed. By taking pictures from different angles, we can prevent this, essentially like looking around a corner. Moreover, if we take multiple views from many different angles, we can make an accurate 3D reconstruction of an object. This is the concept behind computed tomography or CT [18, 19, 20, 21, 22, 23] and illustrated in figure 6.

Figure 6. X-rays allow us to see through some object, but do not capture the 3D shape of the object. It is essentially a many-to-one mapping. When imaging the two black objects (cyan arrows represent X-ray beams) in the center of the image from the top, the resulting projection will look like the one on the bottom. When imaging the objects from the right, the resulting projection will look like the one on the left.

During a CT scan, the patient or object is placed inside a scanner and images are taken from 180 degree angles around it. The individual projections are reconstructed to generate the 3D object function. Using this function, we can scroll through the slices in the x, y or z directions and/or generate slices from the object.

Reconstruction

One of the most popular reconstruction algorithms is filtered backprojection, which we will explain below. Since its introduction, many variations and alternatives have been developed. Nowadays, algorithms making use of deep neural networks are also being explored [24].

The projection of the X-ray through an object can be formulated mathematically by the Radon transform:

where L(theta, t) is a line at angle theta, intersecting the detector plane at point t. Here, we assume a two-dimensional object function f(x, y), for the sake of simplicity. Reconstructing the 2D shape can be seen as taking the inverse of many Radon transforms, for many different angles simultaneously and merging the output into a single function.

A computationally efficient way to accomplish this would be through the use of Fourier transforms and the Fourier slice theorem. This theorem states that the Fourier transform of a projection of the object function f(x, y) at an angle theta is the same as a slice through the Fourier domain at the same angle theta. An illustration of this is provided in figure 7.

Figure 7. Illustration of the Fourier slice theorem. The Fourier transform of a projection at an angle theta of some two-dimensional object function f(x,y) is the same as a slice through the Fourier spectrum at the same angle
(Image courtesy of W. Van Aarle, imec-Visionlab, Univ. Antwerp, Belgium, [22]).

Using this theorem, we could fill up the frequency domain of the function and then do an inverse transformation to get the function in its spatial domain. However, the Fourier domain will be very unequally sampled: most of the samples are in the low frequency range. Filtered backprojection makes use of the Fourier slice theorem, but adds one additional step: a filter, that can mitigate this.

Filtered back projection

Instead of using the Fourier transform, we could take each projection and then ‘smear’ this out back onto the spatial domain of the function. This results in a function looking like the top row in figure 8. This is in line with what we expect from the discussion above: low frequencies are sampled more densely and therefore the image will look blurry. To mitigate this phenomenon, a high pass filter is applied that dampens low and enhances high frequencies. This filter is also known as a reconstruction kernel.

Figure 8. Top: The image function (left in figure 8) backprojected without a filter. Bottom: the same image function backprojected with a filter.
(Image Courtesy of W. Van Aarle, imec-Visionlab, Univ. Antwerp, Belgium, [22])

If we apply a simple high pass filter to the frequency domain and then perform the reconstruction, the object will look like the bottom row in figure 8. Because the filters sharpen the image, they also enhance the noise, which is not always desirable. In practice, there is a trade-off between sharpness and noise and radiologists use different filters to image different body parts: sharp filters are used for smaller parts, soft filters for larger body parts. Examples of different kernels in real images are shown in Figure 9. Also here, deep neural networks are becoming popular to, for instance, match different kernels in clinical studies.

Figure 9. Illustration of different reconstruction kernels. Sharper kernels enhance edges in the image (image taken from Choe, J., Lee, S.M., Do, K.H., Lee, G., Lee, J.G., Lee, S.M. and Seo, J.B., 2019. Deep learning–based image conversion of CT reconstruction kernels improves radiomics reproducibility for pulmonary nodules or masses. Radiology, 292(2), pp.365–373.)

Slice thickness

In early models of CT scanners, the patient was moved inside the scanner and the machine would pause at intervals. The X-ray beam is focused and rotated around a patient. This generates an image in slices at intervals. The area on which the beam is focused is related to the thickness of the slice, how far apart they are, is related to the spacing. In modern scanners (such as helical CT), the X-ray beam is continuously rotating around a patient and the slice thickness corresponds to the width of the collimated beam [25].

Similar to 2D projection radiography, important image descriptors in CT are noise and contrast. Except for the kV and mAs, the slice thickness and reconstruction kernel used also affect noise and contrast [21]. The slice thickness changes the number of X-rays entering the detector. The reconstruction kernel modulates the sharpness of the image, but also the noise, as noise are high frequency components which are enhanced by a sharp kernel.

Data augmentation techniques and their physical meaning

Let’s go back to neural network training and have a look at commonly used data augmentation techniques for vision and try to relate them to the image formation process for projection radiography described above. Below is a list of transformations, commonly used in research papers [26, 27] and augmentation libraries like Albumations [28]. For each augmentation, we will give a simple physical interpretation if there is one or argue why the augmentation may not make any physical sense otherwise.

  • Gaussian noise — X-ray tube parameters & slice thickness
    Adding noise would simulate a combination of factors. As mentioned above, the noise depends on the kV and mAs of the X-ray tube. However, the noise characteristics would be slightly different. The majority of noise is ‘shot noise’, which can be modeled using a Poisson distribution. In practice, however, it is similar to Gaussian noise.
  • Gaussian blur / sharpness — Post-processing and compression
    A Gaussian blur removes high frequency components from images and can therefore be seen as simulating parts of the JPEG compression operation, which does a similar thing (though of course not the same). The ‘inverse’ of a Gaussian blur, a sharpening operation can simulate post-processing steps applied by the manufacturer. For CT images, blurring and sharpening can simulate different reconstruction kernels.
  • Posterize — Bit depth
    During a posterizing operation, the bit depth for channels are changed, resulting in an image that looks like a ‘low quality’ version of the pre-image. Changing the image from 14 to 12 bit for example, would simulate different bit depths used to store DICOM files.

As mentioned above, there are two ‘blocks’ of operations performed to an image before being displayed. The first is a set of operations done by the manufacturer, the second one by the viewer (or AI system). The following augmentations can be seen as simulating the ‘image viewing’ block.

  • Contrast — Windowing
    When reading images, doctors typically change the contrast of an image to their liking, which can improve visibility of certain abnormalities. Deep neural networks usually require pre-processing on the window-level before training and inference. By performing contrast augmentations, this process can be simulated and is mostly related to the ‘window’ of the operation.
  • Brightness — Leveling
    Similar to contrast, brightness changes can simulate different window-level settings, but relates mostly to the ‘level’ part.

Lastly, there are some operations which do not have a clear physical interpretation.

  • Color distort — No interpretation
    This term is used in recent papers such as Moco and SimCLR for a collection of augmentations on the color of the image, such as changing the brightness, hue, contrast and saturation in separate channels. Since X-ray images are grayscale, this does not have a direct physical meaning.
  • Sobel filter — No interpretation
    The filter extracts an edge map from the image and is one of the augmentations used in SimClr. Although manufacturers often apply some type of post-processing to enhance edges, it is unlikely that a raw edge map will simulate any of the post-processing algorithms.

To summarize, many commonly used augmentation techniques have some physical interpretation in projection radiographs. When thinking about data augmentation as class preserving transformations and a means to induce invariance in the model [1], modeling all possible physical and image processing parameters (such as kV, mAs, compression, windowing, etc) that can vary during the image recording process, by means of data augmentation could improve generalization.

Figure 10. Optimal augmentation policies found by Autoaugment. The optimal augmentations do not always seem to be very realistic augmentations.

By looking at the image formation process, we can reason about what constitutes realistic augmentations. This does not necessarily mean optimal augmentations. Many augmentations used in computer vision pipelines do not generate realistic images, for example, some of the optimal policies found by auto augment do not look like a scene you will easily encounter in the real world (see Figure 10).

On top of that, the data augmentation process is likely not a realistic model of how supervised learning is performed in humans: doctors are not trained by seeing the same mammograms or chest X-rays repeatedly in slightly different views hundreds of times. The necessary invariance properties are likely already present. Learning the invariance properties in a massive backbone as done in SEER seems promising and more research is needed to harness its potential for medical imaging.

Acknowledgement

Many thanks to Minchul, Gunhee and Sergio for fruitful discussions, suggestions and proofreads.

References

[1] Chen, S., Dobriban, E. and Lee, J.H., 2020. A group-theoretic framework for data augmentation. Journal of Machine Learning Research, 21(245), pp.1–71.

[2] Chen, T., Kornblith, S., Norouzi, M. and Hinton, G., 2020, November. A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597–1607). PMLR.

[3] Chen, X., Fan, H., Girshick, R. and He, K., 2020. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.

[4] He, K., Fan, H., Wu, Y., Xie, S. and Girshick, R., 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9729–9738).

[5] Goyal, P., Caron, M., Lefaudeux, B., Xu, M., Wang, P., Pai, V., Singh, M., Liptchinsky, V., Misra, I., Joulin, A. and Bojanowski, P., 2021. Self-supervised Pretraining of Visual Features in the Wild. arXiv preprint arXiv:2103.01988.

[6] Yann LeCunn — Self supervised learning, the dark matter of intelligence https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/

[7] Kooi, T., van Ginneken, B., Karssemeijer, N. and den Heeten, A., 2017. Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network. Medical physics, 44(3), pp.1017–1027.

[8] DeVries, T. and Taylor, G.W., 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.

[7] https://howradiologyworks.com/

[8] Prince, J.L. and Links, J.M., 2006. Medical imaging signals and systems (pp. 328–332). Upper Saddle River: Pearson Prentice Hall.

[9] Sprawls — The physical principles of medical imaging http://www.sprawls.org/ppmi2/NOISE/
[10] Huda, W. and Abrahams, R.B., 2015. Radiographic techniques, contrast, and noise in x-ray imaging. American Journal of Roentgenology, 204(2), pp.W126-W131.

[11] Spahn, M., 2013. X-ray detectors in medical imaging. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 731, pp.57–63.

[12] Schaefer-Prokop, C., Neitzel, U., Venema, H.W., Uffmann, M. and Prokop, M., 2008. Digital chest radiography: an update on modern technology, dose containment and control of image quality. European radiology, 18(9), pp.1818–1830.

[13] Haak, D., Page, C.E., Reinartz, S., Krüger, T. and Deserno, T.M., 2015. DICOM for clinical research: PACS-integrated electronic data capture in multi-center trials. Journal of digital imaging, 28(5), pp.558–566.

[14] van Ooijen, P.M., Aryanto, K.Y., Broekema, A. and Horii, S., 2015. DICOM data migration for PACS transition: procedure and pitfalls. International journal of computer assisted radiology and surgery, 10(7), pp.1055–1064.

[15] Mildenberger, P., Eichelberg, M. and Martin, E., 2002. Introduction to the DICOM standard. European radiology, 12(4), pp.920–927.

[16] Koff, D.A. and Shulman, H., 2006. An overview of digital compression of medical images: can we use lossy image compression in radiology?. Canadian association of radiologists journal, 57(4), p.211.

[17] Foos, D.H., Muka, E., Slone, R.M., Erickson, B.J., Flynn, M.J., Clunie, D.A., Hildebrand, L., Kohm, K.S. and Young, S.S., 2000, May. JPEG 2000 compression of medical imagery. In Medical Imaging 2000: PACS Design and Evaluation: Engineering and Clinical Issues (Vol. 3980, pp. 85–96). International Society for Optics and Photonics.

[18] Beaudry, N.J. and Renner, R., 2011. An intuitive proof of the data processing inequality. arXiv preprint arXiv:1107.0740.

[18] Kak, A.C., Slaney, M. and Wang, G., 2002. Principles of computerized tomographic imaging.

[19] Schofield, R., King, L., Tayal, U., Castellano, I., Stirrup, J., Pontana, F., Earls, J. and Nicol, E., 2020. Image reconstruction: Part 1–understanding filtered back projection, noise and image acquisition. Journal of cardiovascular computed tomography, 14(3), pp.219–225.

[20] Goldman, L.W., 2007. Principles of CT and CT technology. Journal of nuclear medicine technology, 35(3), pp.115–128.

[21] Goldman, L.W., 2007. Principles of CT: radiation dose and image quality. Journal of nuclear medicine technology, 35(4), pp.213–225.

[22] ASTRA Toolbox: https://www.youtube.com/watch?v=YIvTpW3IevI & https://www.youtube.com/watch?v=pZ7JlXagT0w

[23] http://xrayphysics.com/ctsim.html

[24] Würfl, T., Ghesu, F.C., Christlein, V. and Maier, A., 2016, October. Deep learning computed tomography. In International conference on medical image computing and computer-assisted intervention (pp. 432–440). Springer, Cham.

[24] Choe, J., Lee, S.M., Do, K.H., Lee, G., Lee, J.G., Lee, S.M. and Seo, J.B., 2019. Deep learning–based image conversion of CT reconstruction kernels improves radiomics reproducibility for pulmonary nodules or masses. Radiology, 292(2), pp.365–373.

[25] Hu, H., 1999. Multi‐slice helical CT: scan and reconstruction. Medical physics, 26(1), pp.5–18.

[26] Shorten, C. and Khoshgoftaar, T.M., 2019. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), pp.1–48.

[27] Taylor, L. and Nitschke, G., 2017. Improving deep learning using generic data augmentation. arXiv preprint arXiv:1708.06020.

[28] Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M. and Kalinin, A.A., 2020. Albumentations: fast and flexible image augmentations. Information, 11(2), p.125.

[29] Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V. and Le, Q.V., 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 113–123)

--

--