Illustration by Chenyu Wang

On the Automatic Generation of Medical Imaging Reports

By Baoyu Jing, Pengtao Xie, and Eric P. Xing

If you haven’t already read our first two AI-for-healthcare posts on Predicting Discharge Medications and Automating ICD Coding, go check them out! In this post, we will discuss how our healthcare-specific machine learning (ML) platform generates reports from medical images using deep learning.

Medical images, such as radiology and pathology images, are widely used in hospitals and clinics for the diagnosis and treatment of many diseases, such as pneumonia, pneumothorax, and heart failure. The reading and interpretation of medical images is usually conducted by specialized medical professionals — for example, radiology images are read by radiologists and pathology images are read by pathologists. These specialists write reports like the one shown in Figure 1 to narrate the findings regarding each area of the body that was examined in the imaging study; specifically, whether each area was found to be normal, abnormal, or potentially abnormal.

Figure 1. In the Impression section, the radiologist provides a diagnosis. The Findings section lists the radiology observations regarding each area of the body examined in the imaging study. The MTI (Medical Text Indexer) Tags section lists the report’s keywords.

For less-experienced radiologists and pathologists, especially those working in rural areas where healthcare quality is relatively low, writing medical imaging reports is challenging and requires an array of skills they might not yet possess. For experienced radiologists and pathologists, especially those working in busy, crowded areas, reading and writing hundreds of imaging reports a day is tedious and time-consuming.

This is what motivated us to investigate whether it is possible to automatically generate medical image reports. Our first challenge was that a complete diagnostic report is comprised of multiple heterogeneous forms of information that are technically difficult to unify into a single framework. As shown in Figure 1, the report for a chest x-ray contains three sections with distinct types of text: a single sentence (Impression), a paragraph (Findings), and a list of keywords (MTI Tags). To address this, we built a multi-task framework (shown in Figure 2), which treats the prediction of lists of words (Tags) as a multi-label classification (MLC) task, and treats the generation of long descriptions (Impressions and Findings) as a text generation task.

Figure 2. The proposed framework.

In the framework, we first adopted a Convolutional Neural Network (CNN) to extract the visual features of an x-ray report. These features are then used to generate keywords (MTI Tags) through multi-label classification. Next, we adopted a hierarchical Long Short Term Memory network (LSTM) to generate the longer-form parts of the medical report (Findings and Impression). Within the hierarchical LSTM, we used a co-attention module to localize the abnormal areas and focus on specific keywords, which guide sentence-LSTM and word-LSTM to generate a more precise diagnostic report.

We tested this model on a public x-ray dataset from the Indiana University Chest X-Ray Collection (IU X-Ray) that contains 7,470 pairs of images and reports. We first compared the full model (Ours-CoAttention) with several state-of-the-art image captioning models (CNN-RNN, LRCN, Soft ATT, and ATT-RK) by standard image captioning evaluation methods: BLEU, METEOR, ROUGE, and CIDER. The results in Table 1 show that our proposed model significantly outperformed the state-of-the-art models. Table 1 also shows that our full model outperformed Ours-no-Attention (our full model without the co-attention module), which indicates the effectiveness of the co-attention module.

Table 1. Main results for report. BLUE-n denotes the BLEU score uses up to n-grams.

As shown in Figure 3, our full Ours-CoAttention model was able to correctly describe the many real abnormalities in the images (top three images), while the Soft Attention and Ours-No-Attention models detected only a few abnormalities in the images, and the abnormalities they detected were incorrect.

Figure 3. Illustration of paragraph generated by Ours-CoAttention, Ours-no-Attention, and Soft Attention models. The underlined sentences are the descriptions of detected abnormalities. The second image is a lateral x-ray image. Top two images are positive results, the third one is a partial failure case and the bottom one is failure case.

For the third image, the Ours-CoAttention model successfully detected the area (“right lower lobe”) that is abnormal, however, it failed to precisely describe this abnormality as “eventration”. In addition, the model also found “interstitial opacities” and “atherosclerotic calcification” abnormalities, which are not actually considered abnormalities by human experts. The potential reason for these misdescriptions might be that this x-ray image is darker (compared with the above images), and our model might be very sensitive to this change.

The image at the bottom shows a failure case of the Ours-CoAttention model. However, even though the model made the wrong judgment about the major abnormalities in the image, it did find some unusual regions: “lateral lucency” and “left lower lobe”. Additionally, it is surprising to find that the model tried to reason about the findings by using “this may indicate”.

We can also observe that in both the generated paragraphs and the ground truth paragraphs, there are more sentences describing normal areas than abnormal areas. This could account for why the Ours-no-Attention model achieved relatively high scores even though it didn’t detect the correct abnormalities — it can simply generate paragraphs made up of descriptions of normal areas to obtain a higher score in the evaluation systems.

Figure 4. Visualization of co-attention for three examples. Each example is comprised of four lines: (1) image and visual attentions; (2) ground truth tags and semantic attention on predicted tags; (3) the generated descriptions; (4) ground truth descriptions. For the semantic attention, three tags with highest attention scores are highlighted. The underlined tags are the ground-truth tags.

Figure 4 presents a visualization of how our co-attention model works for predicting the correct Tags for a given image. Sentence-LSTM can generate different topics at different time steps since the model focuses on different regions of images and different tags for different sentences. Visual attention can guide our model to concentrate on relevant regions of the image. For example, the third sentence of the first example is about “cardio”, and the visual attention concentrates on regions near the heart. Similar behavior can also be found for semantic attention; for the last sentence in the first example, our model correctly concentrates on “degenerative change,” which is the topic of the sentence. Finally, the first sentence of the last example presents a misdescription caused by incorrect semantic attention over tags. We believe incorrect attention like this can be reduced by building a better tag prediction module.

Our work demonstrates that applying deep learning methods for automatically generating diagnostic medical reports is very promising, though there is definitely room for improvement. Our team will continue to try to design more sophisticated deep learning models to generate more precise diagnostic reports, and we’re excited to share updates on this work as we make progress.

For the details, here’s our paper: