Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation, NIPS 2018

By Christy Y. Li, Xiaodan Liang, Zhiting Hu, and Eric P. Xing

The goal of most traditional visual captioning tasks is to produce single sentences that accurately describe visual content such as images or videos. But recent research has moved beyond sentences and attempted the difficult task of generating long and topic-coherent reports to describe visual content. This is much more difficult for several reasons:

  1. The generated report is a long narrative consisting of multiple sentences or paragraphs, all of which must be logical and on-topic;
  2. There is presumed content substance and specific terminology depending on the task at hand. For example, a sports game report should describe the competing teams, the allocation of points, and outstanding players;
  3. The order of the content is important — a sports game report should talk about the results of the game before describing the teams and players in detail.

Medical image report generation is one of the most representative and practical report generation tasks, but it must satisfy additional critical protocols. As shown in Figure 1, a medical report consists of a “Findings” section describing medical observations and detailing both normal and abnormal features, an “Impression” sentence indicating the most prominent medical observation or conclusion, and “Comparison” and “Indication” sections that list the patient’s information.

Among these sections, “Findings” is likely the most important component and should cover various aspects such as heart size, lung opacity, and bone structure; any abnormalities that appear in the lungs, aorta, and hila; and potential diseases such as effusion, pneumothorax, and consolidation. The “Findings” section also usually follows a presumptive order, e.g., first heart size, then mediastinum contour followed by lung opacity, then remarkable abnormalities followed by mild or potential abnormalities.

Figure 1. An example of medical image report generation. The middle column is a report written by radiologists for the chest x-ray image in the left column. The right column contains three reports generated by a retrieval-based system (R), a generation-based model (G), and our proposed model (HRGR-Agent), respectively. The retrieval-based model correctly detects effusion while the generative model fails to do so. Our HRGR-Agent detects effusion and also describes supporting evidence.

Because medical reports are usually dominated by normal findings and, therefore, are usually described with the same set of predictable sentences, a retrieval-based system (e.g., performing classification directly among a list of template sentences given image features) can perform surprisingly well due to the low variance in language. For instance, in Figure 1, a retrieval-based system correctly detects effusion from a chest x-ray image, while a generative model that generates word-by-word given image features fails to detect effusion.

However, it is much more important for the model to accurately describe abnormal findings, which are relatively rare and remarkably diverse. Current text generation approaches often fail to capture the diversity of this small portion of descriptions, and pure generation pipelines are biased towards generating plausible sentences that look natural according to the language model but lack visual groundings. A desirable medical report has to not only describe normal and abnormal findings but also support itself with visual evidence such as the location and attributes of the detected findings in the image.

For additional context, check out this blog post from February about our team’s first attempt at automatically generating medical imaging reports.

Hybrid Retrieval-Generation Reinforced Agent

Inspired by the fact that, when writing these reports, radiologists often follow certain patterns and reuse templates with modifications for each individual case, we propose a Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) — the first attempt to incorporate human prior knowledge with learning-based generation for medical reports. HRGR-Agent employs a retrieval policy module to decide between automatically generating sentences with a generation module and retrieving specific sentences from the template database. It then sequentially generates multiple sentences via hierarchical decision-making.

The template database is built based on human prior knowledge collected from available medical reports. To enable effective and robust report generation, we jointly train the retrieval policy module and generation module via reinforcement learning guided by sentence-level and word-level rewards, respectively. Figure 1 shows an example generated report by HRGR-Agent that correctly describes “a small effusion” from the chest x-ray image and successfully supports its finding by providing the appearance (“blunting”) and location (“costophrenic sulcus”) of the evidence.

By bridging rule-based (retrieval) and learning-based generation via reinforcement learning, we are able to achieve plausible, correct, and diverse medical report generation.

Our Approach

The goal of medical image report generation is to generate a report consisting of a sequence of sentences given a set of medical images relating to a patient case. Each sentence comprises a sequence of words and the vocabulary of all output tokens. In order to generate long and topic-coherent reports, we formulate the decoding process in a hierarchical framework that first produces a sequence of hidden sentence topics and then predicts the words of each sentence on the condition of each topic.

We first compile an off-the-shelf template database T that consists of a set of sentences that occur frequently in the training corpus. Such sentences typically describe general observations and are often inserted into medical reports, e.g., “the heart size is normal” and “there is no pleural effusion or pneumothorax” (see Table 1 for more examples).

Table 1. An example of a template database from the IU X-Ray dataset. Each template is constructed by a group of sentences of the same meaning but with slight linguistic variations. The top three most frequently used template sentences are displayed in the first column and the second column shows the document frequency (as a percentage of the training corpus) of each template.

Image Encoder. Given a set of images, we first extract their features with a pre-trained CNN and then take the average features of all images to obtain v. The image encoder converts v into a context vector that is used as the visual input for all subsequent modules. Specifically, the image encoder is parameterized as a fully-connected layer and the visual features are extracted from the last convolution layer of a DenseNet or VGG-19.

Sentence Decoder. Then, a sentence decoder recurrently generates a sequence of hidden states that represent sentence topics. The sentence decoder comprises stacked RNN layers that generate a sequence of topic states q. We equip the stacked RNNs with an attention mechanism to enhance text generation. Each stacked RNN first generates an attentive context vector, given the image context vector and previous hidden state. The generated hidden state is further projected into a topic space and a stop control probability through non-linear functions, respectively.

Retrieval Policy Module. Given each topic state qi, the retrieval policy module first predicts a probability distribution over two actions: generating a new sentence and retrieving candidate template sentences from |T|. We reserve 0 index to indicate the probability of selecting automatic generation and positive integers in {1, |T|} to index the probability of selecting templates in T. This first step is parameterized as a fully-connected layer with Softmax activation.

If a template in T obtains the highest probability, it is retrieved from the off-the-shelf template database and serves as the generation result of the current sentence topic (the first row on the right side of Figure 2).

If automatic generation obtains the highest probability, the generation module is activated to generate a sequence of words conditioned on the current topic state (the second row on the right side of Figure 2).

Generation Module. The generation module generates a sequence of words conditioned on the current topic state qi and image context vector for each sentence. It comprises RNNs that take environment parameters and previous hidden states as input, and generate a new hidden state that is further transformed into a probability distribution over all words. We define environment parameters as a concatenation of the current topic state qi, the context vector encoded by following the same attention paradigm in the sentence decoder, and the embedding of the previous word.

Figure 2. The Hybrid Retrieval-Generation Reinforced Agent. Visual features are encoded by a CNN and image encoder and then fed to a sentence decoder to recurrently generate hidden topic states. For each topic state, a retrieval policy module decides to either automatically generate a sentence or retrieve a template from the database. Dashed black lines indicate hierarchical policy learning.

Hierarchical Reinforcement Learning

Our objective is to maximize the reward of the generated report compared to the ground truth report. Omitting the condition on image features for simplicity, the loss function can be written as:

The loss of HRGR-Agent comes from two parts: the retrieval policy module and the generation module.

Policy Update for the Retrieval Policy Module. We define the reward for the retrieval policy module Rr at sentence-level. The generated sentence or retrieved template sentence is used for computing the reward. The discounted sentence-level reward and its corresponding policy update according to the REINFORCE algorithm can be written as:

Policy Update for the Generation Module. We define the word-level reward for each word generated by the generation module as a discounted reward of all generated words after the considered word. The discounted reward function and its policy update for the generation module can be written as:

Our Results

We conduct experiments on two medical image report datasets: the Indiana University Chest X-Ray Collection (IU X-Ray), which is a public dataset consisting of 7,470 frontal and lateral-view chest x-ray images paired with their corresponding diagnostic reports; and CX-CHR, which is a private dataset of chest X-ray images from 35,500 patients with corresponding Chinese reports collected from a professional medical institution. For the template database, we select template candidate sentences with high document frequencies (the number of times a sentence occurs in the training documents) in the training set. Table 1 shows examples of templates for the IU X-Ray dataset.

Table 2 shows an automatic evaluation comparison of state-of-the-art methods and our model variants. HRGR-Agent outperforms all baseline models (state-of-the-art methods that have no retrieval mechanism or hierarchical reinforcement learning) on both datasets by great margins. Most importantly, HRGR-Agent outperforms all baseline models (state-of-the-art methods that have no retrieval mechanism or hierarchical reinforcement learning) on both datasets by great margins, demonstrating its effectiveness and robustness. Particularly, on CX-CHR, HRGR-Agent achieves a CIDEr score that is 0.73 greater than that of HRG, demonstrating that reinforcement fine-tuning is crucial to performance.

Table 2. Automatic evaluation results on CX-CHR and IU X-Ray datasets. BLEU-n denotes BLEU scores up to n-grams.
Table 3. Average accuracy (Acc.) and average false positive (AFP) of medical abnormality terminology detection and human evaluation (Hit). The higher the Acc. and the lower the AFP, the better.

The last row of Table 3 shows the average human preference percentage of HRGR-Agent compared to Generation and CoAtt [13] on CX-CHR and IU X-Ray, evaluated in terms of content coverage, accuracy of specific terminology, and language fluency. HRGR-Agent achieves much higher human preference than baseline models, showing that it is able to generate natural and plausible reports. HGRG-Agent also achieves the highest accuracy and lowest AFP out of all of the models, in large part thanks to its unique ability to detect rare and abnormal findings.

Figures 3 and 4 demonstrate qualitative results of HRGR-Agent and baseline models on both datasets. The reports generated by HRGR-Agent are generally longer than those of the baseline models and contain a balance of template and generated sentences. And, among the generated sentences, HRGR-Agent has a higher rate of detecting abnormal findings.

Figure 3. Examples of ground truth reports and generated reports by CoAtt [13] and HRGR-Agent. Bolded phrases are terms for medical abnormalities; italicized text is from the template database.
Figure 4. Examples of ground truth reports and generated reports by Retrieval, Generation, and HRGR-Agent. Bolded phrases are terms for medical abnormalities.

Our HRGR-Agent is the first attempt at bridging human prior knowledge and generative neural networks via reinforcement learning. The results are promising and we will continue to improve our model so that, in the future, healthcare professionals can confidently rely on the medical image reports it generates in clinical settings.

For more details on our work, you can read our paper here: https://arxiv.org/abs/1805.08298