AI in Radiology: Is It Being Adopted?
I remember my first encounter with radiology. We drove out in the middle of the night because I seemed to have fractured a bone. I recall a large plastic film with an X-ray image, placed against a brightly lit tablet, which the doctor examined with a knowledgeable look.
These days, photographic film is long obsolete. And those who have been following developments in AI for a while may remember ominous warnings that the days of radiologists themselves are numbered.
Those were the heady days of 2016, when one advancement in convolutional neural networks after another, in a span of months, made computer vision crack increasingly difficult problems. There was also optimism about self-driving cars. Fast forward to 2024, and one could be forgiven for being disappointed at the state of technological (or perhaps societal?) progression. But while driving instructors might be gloating, radiologists are not so happy. In fact, many are burning out from work overload.
Aging populations and the obesity epidemic in developed countries, along with air pollution, continued population growth, and of course many instances of tobacco and alcochol consumption contribute to ever-growing rates of cancer around the world, even though the field of medicine is steadily becoming better at treating this disease. Screening programs are increasingly being rolled out to enable early detection of tumours and radiologists play a key role, often being the first medical professional to process every case, whether it is a true positive or negative.
So, how close is AI to being able to help us? If we make an effort to cut through the hype, we may find studies like these:
- 2021: AI in radiology: 100 commercially available products and their scientific evidence — “we conclude that the sector is still in its infancy”
- 2023: Clinical applications of artificial intelligence in radiology — “we determined that currently AI has a modest to moderate penetration in the clinical practice”
Let’s try to understand what radiology is, what a radiologist does in their work, and what the actual issues are that get in the way of AI adoption.
What I will describe is my personal experience gained as a computer vision engineer working for two years at an AI medical device company specializing in CT scans. Naturally, some topics will be emphasized more, but that doesn’t mean any others are less important.
What a radiologist does and how AI can help
The field of radiology started over a century ago with the advent of X-ray radiation — hence the name. Today, it encompasses all available radiological imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), ultrasonography, and fluoroscopy.
CT is a common and effective imaging method because, even though it involves ionizing radiation (as does PET), it can be done quickly (unlike MRI), with high resolution (more than ultrasound) and in 3D (unlike X-rays). It is fair to say that CT and MRI are so far the imaging modalities of choice and loved by the AI community because of their popularity in clinical use and the potential for workflow optimization applying AI to these images compared to human driven processes.
To understand where AI could be useful, first we should discuss the workflow which a patient is subjected to.
- A radiologist gets an incoming patient’s scan.
- They fetch older scans, if any exist, as well as patient history — and try to match older findings with the new scan (often a drawn out manual process).
- They go over the new scan from start to finish — identifying new findings if present.
- Then, based on these findings and according to a set of pre-defined rules, they make a decision what to do with this patient.
Because radiology is an indirect method of observation of what’s going on in the patient’s body, the diagnosis is often probabilistic. A certain way to make sure if there truly is a cancer present is to perform a biopsy — an organ tissue sample taken with a needle for pathological assessment. This is an invasive and traumatic procedure that can cause harm on its own, especially in elderly and infirm patients — but in some cases it’s an inevitable part of the diagnostic process. For less obvious cases, a repeat scan after a certain period of time is the best course of action — but it is not without its drawbacks — as this will require another dose of radiation, any tumours present will have a chance to grow during this time without intervention, and of course the patients themselves will be without a conclusive diagnosis for even longer, which is a distressing situation to be in.
So the goal then becomes increasing the likelihood of survival based on imperfect information, with a range of factors affecting the outcome:
- The quality of findings from a scan
- The quality of diagnosis based on these findings
- Direct harm from more aggressive interventions
- Harm and reduced chances of survival from treatment coming too late
We can infer that AI can be applied in the following scenarios to improve upon this:
- Matching known findings from old scans to locations in the new scan (“registration”).
- Automatic detection, segmentation and classification of everything worthy of attention in the new scan.
- AI-assisted manual interactive correction of inaccuracies/omissions in the automatic AI’s results.
- Assigning malignancy score to a patient, given all the findings, patient’s history and metadata.
Radiologists usually specialize in certain areas of the body, such as the abdomen or thorax. Cancers are different from organ to organ and tissue to tissue, — the only thing they have in common is that they multiply uncontrollably. A tumour can be identified by a specific difference from its surroundings. In certain cases, this difference can be seen on the picture as it is (in a so-called native scan), in others, deliberately amplified by injecting a contrast agent into the patient’s bloodstream. Additionally, different modalities are good for picking up different anomalies. CT (which is essentially a 3D X-ray) is good at discerning density, MRI looks at water content, PET is helpful for seeing differences in the speed of metabolism.
So, as you can see, AI models for the use cases described above would also have to be modality-based and organ-specific. Some scenarios could be unified — such as if we have a model for manual one-click correction of an imperfectly segmented gray blob — because it would work on the level of blobs, without having to provide any sort of diagnosis as to what the blob itself actually is. However, most use cases indeed require separate, domain-specific solutions. That is quite a lot of models, not to mention deep domain knowledge imbued into them — no wonder human radiologists study for many years and specialize! This is certainly one of the reasons why the tasks of radiology have not been “solved” by AI as of yet — there is simply not enough time for the vast amount of work that goes not only into creating all these models, but also properly productionizing, validating and certifying them — more on that later.
Arguably, one thing that could save the most resources and bring the most impact to radiology is in the automatic detection/segmentation of everything that a human doctor would have marked on a scan. This is a very popular subject to Kaggle-like challenges for those who want to try medical AI out for themselves. One such well-known challenge that nicely showcases the diversity of models for different cancers was Medical Segmentation Decathlon (2018) — as the name suggests, it encouraged individuals to come up with a generalized solution that would be able to segment ten selected cancers from provided accurately labeled data. There are plenty of more recent challenges too. There are also publicly available datasets for many organs, distributed by the Cancer Imaging Archive as one such example. Some of them are well-known within the community, such as the legendary LIDC dataset of lung nodules, some are more obscure. Many of these datasets are used as benchmarks for research into better methods, and cited in many papers. Any time an incremental advancement in general deep learning-based computer vision happens, a paper applying this method on medical datasets is soon to follow.
Technologies for AI in radiology
The biggest difference between “computer vision” and “computer vision in radiology” is that the most popular modalities of the latter are in 3D. Most deep learning methods were originally developed for cat-and-dog, ImageNet-like 2D images, but over time, many architectures worthy of attention (no pun intended) have been extended and shown to work on 3D data as well. There are still inconveniences inherent to 3D: one is the much longer training and inference times, and another is the scarcity of powerful open-source pre-trained encoders, like the 2D ImageNet-based ones, which could provide a boost to any downstream task.
While neural networks had been slowly developed for decades, the ongoing revolution can be traced to a specific point in time — September 2012 — when a three-person university team submitted a solution to ImageNet Large-Scale Visual Recognition Challenge. It was based on convolutional neural networks and won the competition by a large margin, leaving everyone else in the dust. The network architecture came to be known as AlexNet.
Image classification, while impressive, is still a very coarse task, as it operates on the whole image. For many practical applications, in medicine and elsewhere, you need to understand the details in the image and where exactly they are located. One such task is segmentation — finding regions of interest, such as lesions, in the image and accurately outlining their boundaries. In 2015, the medical imaging community had its AlexNet moment with the advent of U-Net. This was a fully convolutional architecture — meaning it used only convolutional layers within the network. Its purpose was semantic segmentation — ‘semantic’ meaning that all pixels of a type, say, “tumor,” are separated from “background” without differentiating them into “tumor 1” or “tumor 2.” This is essentially a per-pixel classification task. Before U-Net, there were attempts to solve this problem by putting together a Frankenstein’s monster out of both deep learning and classic algorithms available at the time. One such example is R-CNN (2013), which uses classic pre-deep-learning algorithms for region extraction, CNNs for feature extraction from those regions, and finally, a set of Linear SVMs for class assignment. Another, from 2012, involved pixel-by-pixel classification of the image, providing the pixel itself and a surrounding patch of the image for context. U-Net, by contrast, was simple, computationally efficient, and able to overcome the limited amounts of training data that plague the medical field by efficiently using training-time augmentations — when the input image is rotated, flipped, or blurred — forcing the network to generalize better.
Since then, a myriad of modifications appeared, such as extensions of the original U-Net to 3D (2016), exploration of alternative wiring for gradient propagation (U-Net++, 2018), even loosely U-Net-inspired transformer-based architectures (Swin-Unet, 2021) and hybrids (Swin-UNETR, 2022). A strong argument in favor of U-Nets is the richness of the ecosystem that has emerged over the years — it’s likely that, at least on a technical level, your problem has already been discussed and solved by others. Of course, U-Net has also found extensive usage outside the medical field, and even aside from segmentation — a prominent example being Stable Diffusion, which has U-Net as its integral component.
The ultimate aim of many computer vision pipelines in medical imaging is to segment something; however, object detection is quite common as well. It can be the best choice for the first stage of a pipeline that ultimately also segments and classifies findings. Such a choice might be made if the data contains many objects of different sizes and catching as many as possible is crucial. Segmentation would be harder to get right, as underperformance on small objects can be obscured by good performance on larger ones. That can be mitigated to an extent by designing an elaborate loss function, but opting for object detection in the first place is a valid alternative as well.
What object detection architectures are most suitable for working with CT and MRI? The requirements are: top accuracy, support for 3D images, and less need for real-time performance. Historically, the RetinaNet (2018) architecture satisfied all these requirements while still being quite fast as a single-stage object detector — and it became popular for those reasons. Because of its focal loss, which adds weight to the worst predictions, it is particularly effective in situations where an image is sparse or contains many small objects. As with every popular CNN, there are numerous variants of it, including transformer-based ones that claim incremental improvements on standard benchmarks. However, due to their obscurity, these variants are unlikely yet to be widely adopted in real-world applications.
The development lifecycle typically starts with a well-established vanilla architecture proven within the community. Later, data and product challenges become the main hurdles, so few in commercial settings ever reach the point of experimenting with cutting-edge architectures. That being said, there is ongoing experimentation with YOLO architectures in medical imaging — which are modern, constantly developing and widely popular outside the medical domain. While they don’t natively operate on 3D images, they can work with a fusion of 2D images and 3D point clouds. An example of where such fused data can come from is another AI darling already mentioned: self-driving technology.
Not everyone wants — or is able — to start all the way from PyTorch or TensorFlow, but they still want to experiment. Often, these users are domain scientists who can write Python but are not experts in machine learning. Even ML experts, however, often prefer to establish a quick baseline before committing to a long project. Neat frameworks exist that heuristically determine the best fine-tuning strategy for the task and dataset at hand, freeing users from the need to do it themselves. The most popular of these are nnU-Net, nnDetection and Auto3DSeg. Clicking on the last link will take you to the MONAI website — a sort of mini-Hugging Face for medical imaging, backed by Nvidia. There, you can find both pre-trained models and a powerful framework with many pre-implemented components. These can be used to develop AI pipelines manually, but still at a higher level of abstraction than PyTorch.
A discussion of deep learning approaches would be incomplete without mentioning large foundation models, which are all the rage these days. Many attempts to use popular models, such as DINOv2 or CLIP, for various downstream tasks can be found; however, most are still at an early proof-of-concept stage. Solutions for X-ray are generally one step ahead of CT, both in terms of technology and commercialization, due to the relative abundance of data and somewhat lower complexity.
Stable Diffusion is seen as a potential solution to the problem of data scarcity, especially in 3D. If plausible synthetic X-ray, CT, or MRI data can be generated, more generalizable models could be trained.
Even though LLMs are not directly related to computer vision, they have the potential to be very helpful in the field of radiology. Every patient is accompanied by a series of long, detailed, and dry texts written in the hospital information system for each procedure or visit, including radiological studies. The accuracy requirements will undoubtedly be high, but if LLMs could help reduce the radiologist’s workload, it would already be a significant improvement. I was once impressed that even LLaMa-2–7b was able to generally understand real radiological reports in the Estonian language; however, it quite heavily hallucinated at unpredictable moments.
There are also many specialized solutions, such as the powerful medical LLM from Google (MedPaLM) or the X-ray report generator CXRReportGen from Microsoft. Wherever LLMs exist, joint text-image embedding models like BiomedCLIP also emerge, which could potentially be used for a variety of purposes when connecting text to a specific region in an image is needed.
One model worthy of particular attention is SegmentAnything (SAM) by Meta. The first version, which operates on 2D images, was released in April 2023, and a video-based second version followed in August 2024. With both releases, it has sparked the imagination of the AI community as a potential solution for medical image segmentation, even though it was primarily developed for general-world objects. However, demos where it is applied to tomography scans often overlook important medical considerations. First, simple tasks are chosen — such as whole organ segmentation — partly because more complex tasks would not be as visually appealing and understandable in a demo. Second, even when the model makes significant errors, such as severe undersegmentation on par with a basic thresholding approach, it is still touted as a success (and a non-medical viewer is unlikely to recognize what is wrong). Third, even when serious mistakes are acknowledged, it is often stated that “with fine-tuning on a medical dataset, it will improve.” But if someone has a high-quality medical dataset and the expertise to fine-tune SAM, they may be better positioned to train a custom model instead. This approach would likely be more maintainable, more lightweight, and potentially more accurate for a specific narrow task. Fourth, SAM has a good potential on interactive segmentation tasks, when a single object is pointed at by a human. But the greatest promise of AI — the one that could be considered revolutionary — is still the automatic bulk detection and segmentation of all areas of interest within a scan, a task that SAM is unlikely to excel at in any foreseeable future. Why? Because of the complexities surrounding data.
Dataset problem
SAM, and indeed any machine learning model, will be susceptible to the ultimate challenge: the diversity of datasets and the mutual incompatibility between those it was trained on and those provided for inference.
A person undergoing a lung CT scan will receive different treatment depending on the reason for the scan. There is no single standard for a “lung CT.” If a former smoker is enrolled in routine government-sponsored screening, they will be given a low-radiation-dosage scan, resulting in a blurrier image. However, if they come with a strong suspicion of a malignant tumor, they will be scanned with a higher dosage to achieve better resolution.
Scanner technology develops rapidly, but a CT or MRI machine is a significant, long-term investment, so it is unlikely that every hospital will have the newest model — creating another source of data diversity.
Medicine is partly structured science and partly the art of the practitioner, so what is considered important may differ slightly from radiologist to radiologist and from country to country, depending on established clinical practices and education.
What a radiologist marks on a scan during inspection will also differ depending on the circumstances. Taking lungs as an example, there are often many small abnormalities, and in some cases, they will be marked graphically on the scan, in other cases described individually in the accompanying report, but sometimes just mentioned in bulk. This difference arises from practical considerations about whether more detailed annotation will add value relative to the time invested in doing it. For an ML model developer, however, a complete, exhaustive annotation is essential because all those abnormalities must be enumerated by an ML model and presented to a radiologist for assessment. But how can a model be trained to recognize them if such data is difficult to obtain and not produced by humans in the first place? This requires costly in-house annotation.
Every ML practitioner is familiar with the precision-recall tradeoff, which depends on the problem domain and the severity of the consequences of a decision. At a surface level, it might seem that a model should err on the side of caution, producing more false positives (FPs) for the radiologist rather than false negatives (FNs), which could mean missing a potentially terminal illness. However, FPs must be balanced against the harm caused by unnecessary interventions, such as radiation or biopsies. Additionally, even a physically harmless FP that results in a patient being sent for a repeat examination in a few months can cause significant distress and reduce so-called quality-adjusted life years. Moreover, if a radiologist encounters too many irrelevant findings, they may let their guard down and miss something important that they would not have overlooked had they worked without AI. This risk can also occur if the model is highly accurate but occasionally makes significant errors; however, in such cases, the benefit may be balanced by the increased detection of relevant findings aided by AI. This all means that both precision and recall need to be high and tailored to the specific use case, making high-quality training data absolutely crucial.
For all the reasons mentioned above, publicly available datasets are insufficient for use in real clinical practice but are adequate for ML methods research, product prototyping, and quick demos. Few Kaggle hackers and computer science graduates working with these datasets are connected to hospitals or the actual practice of medicine. This contributes to the confusion, if not the illusion of a conspiracy: if all the technology is seemingly available and breaking benchmarks monthly, why has medicine not been automated yet?
Post-development
There are a few interesting steps between getting your model done and having it serve doctors and patients in the real world.
First, a clinical study should be conducted to validate whether the product achieves its intended goals. For this, one or more endpoints (outcomes) should be defined — such as speed or certain aspects of the quality of reviewing a CT scan — that are claimed to improve. Then, a study should be designed to robustly prove the desired changes in these endpoints. Such study that doesn’t involve real patients and their outcomes but focuses on the performance of radiologists is called a reader study.
Speaking of the scale of clinical studies in radiology, let’s revisit the use case of malignancy scoring for lung cancer. This involves estimating the probability of cancer based on a CT scan, which informs the decision for more invasive procedures. Since there can be significant delays — sometimes up to a year — between initial suspicion and a definitive diagnosis, the only way to determine if an initial assessment was accurate is to track the patient over time and learn their eventual outcome. This makes developing a new and improved approach both challenging and costly. There are established methods accepted in the medical community, such as the rule-based Lung-RADS and regression-based ones like the Mayo clinic and Brock models. These approaches use lung nodule metadata (e.g., size, shape) for probability estimation but do not use the CT images directly. However, there are ongoing efforts to utilize deep learning to directly output malignancy scores. According to publicly available information, an exhaustive clinical study to validate such an approach can take nearly two years and involve 2,000 patients.
Another step is applying for certification for your medical device. For a device to be allowed on the market, both the device itself and the company developing it must be reviewed by regulatory authorities to ensure safety. In the European Union, this certification is known as the CE marking. If you plan to sell your product in other regions, separate certification is also required, such as FDA approval in the United States.
There are several classes of medical devices, ranging from those with minimal impact to those critical for life support or essential for making critical decisions. Software that assists radiologists falls somewhere in the middle, as the decisions it supports are mostly not time-critical, and the ultimate decision-making authority remains with the human.
An interesting concept in this context is that of significant versus non-significant changes to the medical device after it has been certified. A significant change, in the context of an ML model, refers to instances where, for example, a model is redesigned to output additional lesion classes that were not present before and have not been validated through clinical studies. In such cases, full recertification may be required. A non-significant change, on the other hand, involves retraining a model with new data, provided it can be demonstrated that the main performance metrics have not degraded. In this scenario, only a notification to the regulatory authority is needed.
All in all, it is now clear that obtaining high-quality data, developing the model, conducting clinical testing, and navigating regulatory processes require significant time. This field is still relatively young, and the market is fragmented by regions and regulatory bodies, meaning no single entity has decisively claimed dominance. The “move fast, break things” mentality is not suitable for medicine, so time-to-market is extended, and many deep tech companies in this space benefit from public funding, which often has longer planning horizons and greater tolerance for losses compared to venture capital.
Conclusion
It has been less than a decade since application of AI in radiology became not just a pipedream but a true possibility. Things always move slow in medicine, but as this train gets moving, it will be as hard to stop it as it is hard now to replace more established methods. It may still take another decade before this becomes truly and routinely embedded, and while innovation in DL methods by computer science experts is important, it will take a larger concerted effort across disciplines to get to this point.
Perhaps, a generation of experts should pass through a cycle on working on these problems commercially and full-stack, leaving their first companies and disseminating know-how in the wider world. And from another angle, some of the current start-ups that are currently laser-focused on a few narrow use cases will expand and tackle a wider array of problems.
The field is still green and most heroes are yet to be made, but the future certainly looks bright for the field of AI in radiology.