A Few Thoughts About CheXNet — And The Way Human Performance Should (And Should Not) Be Measured

The recent release of the CheXNet study on arXiv caused significant uproar in both the medical and the AI community, perhaps in no small part because of the way it was presented. In brief using a convolutional neural network-based approach the authors claim a supra-human accuracy in detecting pneumonia in frontal chest x-ray when compared to four practicing radiologists including a thoracic specialist. The machine learning approach, its strengths and potential pitfalls have been extensively discussed by those with extensive domain-specific knowledge, so being by and large a layman I will skip that part. Others have also raised concerns about some aspects of the original NIH chest x-ray dataset, which includes highly ambiguous if not completely overlapping terminology such as infiltration, consolidation, and of course pneumonia.

Below I collected some of my thoughts/concerns regarding this study and the way some medical image analysis researchers are comparing human performance, and call for establishing a standardized framework within which human diagnostic sensitivity and specificity should be tested.

Is “soft” ground truth really the ground truth?

In the ChestX-ray8 dataset which was used for training the algorithm pathology labels were extracted automatically from the radiology reports by text mining. This also means that the ground truth in essence is an individual radiologist’s judgment, which in some cases further deteriorated due to the inherent inaccuracies of automated data mining. Since x-ray signs of substantially different pathologies can be ambiguous if not outright similar, we can safely assume that in many cases the subjective ground truth of the original dataset was arguable at best. Hence I would call this soft ground truth, in contrast with bona fide ground truth established by e.g. verifying the pathology with superior imaging methods.

The devil is in the details (of the medical record)

The radiologists involved in the study had to do their best without any access to patient informations. This is a highly unusual, in fact entirely non-existent scenario in clinical practice. Since pneumonia itself is a great mimicker of numerous pathologies, in the daily routine we usually receive targeted questions inquiring about the presence of certain pathologies on the request form and have access to all clinical information including lab results. This provides crucial help in narrowing down the diagnostic possibilities and producing an actually useful report (instead of vague hedging).

The power of priors

Prior examinations are invaluable in the daily practice of radiology, or as the common proverb says comparison studies are the radiologist’s best friend. Especially with something as simple, cheap, and commonplace as chest x-rays old exams are in a very large percentage of the cases immediately available to aid interpretation and increase diagnostic confidence. However, the participants of the CheXNet study had to interpret all images with no priors provided. This is in sharp contrast with clinical practice and therefore this study rather measures the low-end of human performance.

Each armed with their own tools

This part is a bit speculative, as the article does not disclose how human diagnostic performance was measured, which is a quite common problem with similar studies. A radiology reading room is a unique environment designed to maximize our sensitivity, with appropriate lighting, dedicated diagnostic monitors and other task-specific tools. Not to mention the PACS system where many invaluable tools such as windowing, greyscale invert, and zooming are readily available. My point with this that if you compare radiologists with an algorithmic solution you have to test them in their daily working environment. There are already some published examples (see below) of the opposite, where human readers had to interpret images already distorted, downsized, and cropped to meet the needs of the algorithm.

A published example where human performance was measured in an unorthodox setup

A frontal chest x-ray is only the beginning

Already around the time the NIH dataset was released I expressed concerns about the fact that only frontal views were included. Of course in many cases we use these as standalone tools, but very often we do rely on a very simple yet immensely useful ancillary technique: the lateral view. The lateral chest x-ray alone is rarely diagnostic, but in conjunction with the frontal view its effect is supra-additive, by helping clear up not only the localization but the etiology of abnormalities first identified on the frontal radiograph.

Summary

It has to be noted that the CheXNet paper acknowledges the lack of access to patient records, prior examinations, and lateral views as a clear limitation of the study. However, this important detail is buried deeply in the article and will never receive much attention unlike the “AI beats doctors” headlines it generated.

In my opinion this and many other papers highlight a general problem: performing human vs machine comparisons in an unrealistic, artificial environment, with little if any effort put into mimicking the actual workflow. The real risk of this is that while sensational news are easily disseminated through the mainstream media, limitations and shortcomings that are painfully apparent for professionals will never receive any of the limelight. CheXNet holds great promise, but it is nowhere near ready to be deployed “in the wild”. Was that though the impression of laypeople reading those headlines? I will leave this to the reader to decide.

PS: In this earlier post (motivated by the infamous coyote comparison by Geoffrey Hinton) I explored some aspects of the impact of AI on radiology.