Dear Mythical Editor: Radiologist-level pneumonia detection in CheXNet? (SEE UPDATE at top)


  • *UPDATE (1/24/18): The paper has been revised (version 3), which addressed the below potential issue with majority voting. The team now uses a F1 score (balance between precision and recall) and a bootstrap technique to establish confidence intervals. This is a nice way to establish reasonable confidence intervals with a smaller sample size, and they were able to show that the machine had higher F1 scores that the average radiologist.
  • The team has also indicated that there were at least 50 cases of pneumonia in the test dataset, which gets around an issue I noticed in the original blog below. We don’t know the exact number, but greater than 50 makes me feel more comfortable that there are enough cases to get a sense of precision (same as positive predictive value) and recall (same as sensitivity). I actually like that there were more “no pneumonia” cases than “pneumonia” cases in the test dataset because in the real world, having a low false positive rate is important.
  • While there are still issues with the underlying training data, and likely issues with radiologists labeling infiltrate, consolidation, and pneumonia, the CNN was asked to do the same task and still outperformed the radiologists.
  • Luke and I both believe the results of the team, and Luke has spent time discussing directly with the team about those results. I would highly recommend reading his new blog if you have not done so.
  • I thought Dr. Raym Geis (former chair of Society of Imaging Informatics in Medicine, and now part of the American College of Radiology Data Science Institute) brought up a really good point — part of our decision making process is to determine if an action should be made. From my own experience, I am on the phone for a large part of the day, discussing findings with referring clinicians, and there is much that goes into that conversation that never makes the radiology report.
  • For those that attended the discussion, I thought it was really cool that the majority of the Stanford team was there, and they did a great job presenting their results in a clear succinct fashion. It’s great to see a team consisting of machine learning experts, radiologists and informaticists all striving to the same goal of improving health care using AI — this is a great model for how research should be done.
  • I also liked how the Stanford team discussed the issue with lack of access to radiologists in the developing world, as this is a nice use case for AI, and one of the impetuses behind their work.
  • Jeremy Howard made some really awesome points about how to take deep learning to the next level, and strategies to further improve these results. He is a really engaging speaker, and for us deep learning nerds, we love it! I really need to take his course now, I’ve heard amazing things about it from so many people.
  • Lastly, we all really need to thank Dr. Judy Gichoya for her tremendous effort in organizing and moderating this discussion. She is a rising super star in imaging informatics and someone to keep an eye on as we move forward.
  • Keep in mind the original blog below is designed so one could review papers in a critical fashion, but it is also important to acknowledge and thank the hard work of the authors and their excellent results. One thing that I realize with this open concept of peer review, is that it may dissuade people from doing this type of research, and we want to do the opposite. Despite what I wrote below, I was quite impressed with the results of the team — even more knowing that the data labels are problematic — and really happy that they are working on problems such as this. I think Luke did a great job of balancing both in his review of the paper.


Original blog (based on Version 1 of the CheXNet paper)

I find it fascinating that AI has catapulted research into the mainstream news. The CheXNet deep learning algorithm recently published by the Stanford machine learning group ( propelled its way through multiple media outlets, including coverage in the Wall Street Journal. The following is a screenshot from another news source.

One could dissect a number of publications in a similar manner; in fact, that is why journal review is part of every residency program and is a hallmark of medical education and residency training. Keep in mind, the fact that machine learning (ML) has the potential to do this is pretty amazing, and the best of ML has yet to come. That being said, the following is what I would write to an editor, if I were asked to do peer-review on this paper:

Dear Mythical Editor of ArXiv:

Dr.’s Rajpurkar and Irvin et al. use a state-of-the art deep convolutional neural network (CNN) — 121 layer DenseNet — to classify 14 different diseases on a public chest X-ray dataset of over 100,000 frontal radiographs.

The area-under-the-curves (AUCs) of their general chest X-ray deep learning algorithm (CheXNet) outperform those from two other machine learning groups. Moreover, the separately trained pneumonia deep learning algorithm achieves similar performance to four practicing radiologists.

The following are weaknesses and strengths of the study:


  • What’s the radiologic difference between consolidation, infiltration, and pneumonia? There is overlap among all three. In your test dataset, radiologists were asked to label these categories, when consolidation and infiltration are the most common manifestations of pneumonia. Some radiologists even think of them entirely synonymously.
  • It looks like the NIH readers used them interchangeably as well:
Same patient with multiple images and different interpretations over time
Same patient imaged at different times, with corresponding NIH labels. I personally think it is cardiomegaly and pulmonary edema, but what do I know?
  • A “majority voting” scheme is used for the ground truth on the pneumonia test dataset. In this scheme, “We evaluate the performance of an individual radiologist by using the majority vote of the other three radiologists as ground truth.” Let’s imagine these hypothetical labels for the four radiologists for one case:
  • In this case, Rad 1 is wrong, because the majority is “consolidation” among 2/3/4. Rad 2 is wrong, because “infiltration” is the majority among 1/3/4. Rad 3 is wrong, because “consolidation” is the majority among 1/2/4. Rad 4 is wrong because pneumonia or infiltration are the majority among 1/2/3. Yet, consolidation, infiltration, and pneumonia are in the same league — essentially they point to pneumonia. With that in mind, Rad 1–4 are really in agreement in the case above, but for scoring purposes in this study, every single radiologist is incorrect.
  • Radiologists were asked to label all “14 pathologies” in the test dataset, but for the pneumonia detection task, the CNN was asked to make a binary decision: pneumonia present or absent. This is not a fair comparison, and if radiologists were asked to do a binary label, their sensitivity and specificity would be higher. This is because for the pneumonia task the ML group trained a separate classifier in a binary fashion (not a 14 dimensional vector) but simply a two-dimensional vector of pneumonia present or absent.
  • Discrepancies in the pneumonia data. Per the original ChestXray 8 paper by Wang et al., the figure denotes 2,042 labels with pneumonia; per the table in the same paper, there are 1,353 cases of pneumonia. However, I downloaded the NIH dataset, and tabulated 1,431 pneumonia cases from the CSV file. Which one is it?
Discrepancies in the number of pneumonia cases. Which one is it?
  • Let’s assume the CSV file is correct and there are 1431 x-rays with the pneumonia label out of 112,120 images, which is only 1.27% of the entire dataset! The test set contained 420 images, and we do not know the exact number of test cases that had pneumonia. However, if authors are true to the statement that there was random splitting of the training, validation and test data, then 1.27% * 420 test cases = 5 pneumonia cases (+/- 2) in the test set! I doubt they did this (5 positive and 415 negative pneumonia cases is nonsense for a test set), so my deduction is that the ML group did not randomly split the cases into training, validation, and test (for the pneumonia classifier), but rather did some manual splitting of cases to ensure there were enough pneumonia cases in the test set. This seems like a red flag. This is also corroborated by the numbers in the figure below, where the ratio of images to patients is roughly 3.5:1 for the training and validation data, but close to 1:1 for the test data.
Taken from CheXnet. 420 images were “randomly” obtained but it seems like they were not. Note the ratio of images to patients is roughly 3.5:1 for training and validation, but closer to 1:1 for the test data.
  • To properly assess sensitivity and specificity, one should have a similar number of cases with and without pneumonia (close to 50:50 split) in the test dataset. Even 35:65 may be okay. However, we do not know this number, and what if it is 25:75? A classifier that flags everything as normal would be correct 75% of the time.
  • I’m curious to see that radiologist’s performance was not compared with other pathologies, such as pneumothorax or effusion, since you had them label images that way. Do we have a problem with multiple testing and picking ones that show a certain result?
  • Down-sampled, 8 bit grayscale 1 megapixel resolution radiographs were shown to radiologists, when typical viewing conditions are with higher 16 bit grayscale images (> 4 megapixel resolution), which can be window-leveled appropriately on a DICOM calibrated medical workstation.


  • Use of a Densenet
  • Very well written article, nice figures, nice discussion
  • The AUCs were better than work by Wang et al. and Yao et al., experienced research groups, indicating strong practical knowledge and solid execution of ML by this team.
  • Chest radiography interpretation is known to be muddled, which speaks well to the AUCs they were able to achieve from this dataset.
  • Simple clean solution that could further improve with other annotated datasets.


Like with any publication, there are always strengths and weaknesses. The quality of the deep learning techniques appear solid. However, there are questions about the study, and it is hard to draw decisive conclusions regarding radiologist-level accuracy of pneumonia detection with CheXNet.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.