Evaluating chest x-rays using AI in your browser? — testing Chester
Just imagine, what if you could just upload a bitmap image of a chest x-ray into a web-based software that would spit out relevant diagnoses in seconds? Chester the AI radiology assistant aims to do something similar to that, you grab the bitmap image of a frontal chest radiograph, upload it, and the system quickly assesses the likelihood of 14 distinct categories of pathologies including masses, pneumonia, and heart enlargement among others. A video overview of the system has been provided here, while the algorithm itself has been described itself in an arXiv paper. According to some related news articles “this free AI reads X-rays as well as doctors” though this claim has to the best of my knowledge not been tested by anyone. Chester has been trained using the Chest-Xray8 dataset released by NIH in 2017, which has been extensively criticized for inaccuracies of image labels within the dataset among many other things. This as we will see likely resulted in profound consequences for the way Chester interprets images.
In this post I tested the performance of the system against the most typical examples of the disease categories found on Radiopaedia of the fourteen distinct abnormalities Chester has been suggested to be able to detect. The good thing with Radiopaedia is that unlike most radiology training datasets used in deep learning research their database comes from a myriad of different institutions around the globe, while the diagnoses have been curated and verified by professional editors. Lets see how Chester fares against these cases:
Atelectasis is a rather broad category, but it can hardly be more obvious as when a whole lobe — in this case the right upper one is collapsed. As we can see the algorithm assumes a minimal risk of the actual condition while potential mimickers (mass, consolidation) are given significantly greater consideration (To be fair this is one condition where lateral views can be of great use, but Chester can not yet handle those).
I decided to give a second chance using an x-ray of the more commonly encountered, subsegmental atelectasis. Interestingly Chester picked up this one, though it falsely assumed a high risk of pneumothorax as well.
For this one I selected a classic example of significant heart enlargement, with no other major pathologies. The outcome seems to be much more favorable, Chester confidently identified the anomaly.
A textbook case of significant left-sided pleural fluid collection was used for demonstration, and Chester once again made a correct evaluation.
Infiltration, consolidation, pneumonia
Infiltration/consolidation/pneumonia treated as distinct categories feels a bit awkward, as the first two are nonspecific (and largely synonymous) descriptors, while the latter is an actual disease. This categorization has been unfortunately inherited from the NLP-processed training dataset. First I wanted to make this reasonably difficult and selected one of my own cases for this. This time Chester gave an unconvincing result, highlighting an area as suspicious which in my opinion contains no abnormality.
After this fiasco I decided to give Chester a second chance, this time using a “barn door obvious” case. All three relevant categories ranked rather high, showing that at least massive pneumonia can picked up by the algorithm.
The first case is a rather straightforward one, showing numerous metastatic nodules, the evaluation is confident and correct.
I decided to a raise the bar a bit, providing a second, more subtle but still rather obvious example. Much to my surprise Chester made a correct diagnosis only serendipitously, the suspicious region contains no abnormality, while it did not detect the actual lesion.
First I selected a rather simple but not blatantly easy example an ended up with unimpressive results, although the heatmap suggests that algorithm found the right region suspicious, it did not actually detect the condition.
I decided to give Chester a second chance with a case that would less likely be affected by the downgraded image quality which had rather concerning results. The algorithm not only did not recognize the condition, but mistook the collapsed left lung for a mass. This is not something an algorithm like this should miss, even if it provides only an early warning “red dot” system, or worklist prioritization.
I have to mention that pulmonary edema is not necessarily an easy diagnosis, and certainly one where I heavily rely on clinical informations as well. Thus, a rather striking example of pulmonary edema was used. Chester correctly recognized that this is not exactly a negative exam, but facing the diffuse abnormality it started “shotgunning” diagnoses with some labels correct, others understandable (e.g. effusion, consolidation), while some entirely erroneous (mass, nodule)
Emphysema is one of those categories we see a great deal of differences between reporting physician, therefore to make it fair a severe case of panlobular emphysema was used, which Chester correctly and confidently categorized.
As with emphysema the chest x-ray is not a sensitive tool for the early diagnosis of fibrosis, and this is yet another area where interobserver variability can be higher. Thus, I selected again a straightforward example of advanced fibrosis, which Chester correctly labelled. Once again however, the confusing amount of other labels with high likelihood somewhat diminish the value of it. However I have to mention that this is again not an easy judgment without appropriate clinical information.
This one is a bit tricky category, as admittedly pleural thickening is not something that can be necessarily differentiated from other conditions including circumscribed effusion and masses solely on the basis of a chest x-ray. Therefore I selected an example of isolated apical pleural capping, which is rather straightforward, and can be encountered on a daily basis. Though the algorithm did not spot the condition, it has to be credited that it did not assume a high likelihood of any other condition either. From the perspective of clinical management this is an agreeable evaluation.
Hernias are not really the strong suit of this exam, but when it comes to that no hernia can be more obvious and common finding in the chest x-ray than the hiatal one, which Chester correctly identified.
I have to admit, running an algorithm in my web browser that readily, although as we saw often erroneously evaluates radiographs is an enthralling, and somewhat worrisome experience. The errors Chester made have their reasons, good performance can hardly be expected when the training dataset itself used vague, opaque, and often overlapping image labels. Also the downgraded resolution of the images certainly affects the chance of detecting subtle abnormalities. Regardless of who makes the judgment the chest x-ray remains a shadowgram plagued by manifold limitations. The algorithm indeed made the worst evaluations of disease categories that have diverse imaging appearance and suffer from higher interobserver variability — you are only as good as your teachers.
As of today Chester today can not yet replace even the fourth year medical student peering over my shoulder, and I definitely would not use it for teaching, nor will I make a habit of asking it for second opinion just yet, but its possibilities for improvement (unlike mine) are seemingly endless.