Automatic Medical Report Generation From X-Ray Images through AI

Dhilip maharish

Published in

Analytics Vidhya

22 min readOct 20, 2020

Can we make a machine/computer as Radiologist to generate medical reports for patients

Table of content:

Business Problem
Source of Data
Evaluation Metric
Data preprocessing
Exploratory Data Analysis and its observations
Existing research paper of project
My first cut approach and Architecture
Conclusion
Future Work
Reference

1. Business Problem:

In the medical field, Radiologists need to describe medical X-ray images. Summarizing the X-ray in a form of radiology report is a complex task and more care should be taken in generating reports. The radiology report is a complete study of X-ray images describing normal and abnormal conditions which makes the right decision to take proper medication. Hence, the Radiologist is expected to diligently summarize a report. The problem with today’s diagnostics is medical report errors where increases day by day due to lack of experienced radiologists. These Medical report errors will be dangerous and can put human’s life in risk by incorrect diagnostics and medications.

1.1 Project Objective:

By taking the above diagnostic medical errors into account and also to reduce the work of radiologists, this process can be implement using Artificial intelligence which can automatically generate the reports by processing the X-ray images.

Inputting the medical X-ray images to Deep Learning model and generating the radiology reports as the output.
The generated radiology report by model must be produced approximately similar with actual report.

2. Source of Data:

The above links are open-i Chest X-ray collection dataset which the problem is proposed by Indiana University hospital network.

2.1 Data Overview:

The dataset contains chest X-ray images and radiology text reports.

Chest X-ray -There are 7,471 images in .png file format.(contain lateral view and frontal view of each patient). Here some of the random patients images which display the lateral and frontal view of X-ray images.

2. Radiology Report- There are about 3955 patients text report available in .XML format. Here is the sample below, which contains the raw data of single patient report in .XML format.

Raw data

3. Evaluation Metric:

For this problem, BLEU score is set as the performance metrics to measure the model performance.

Bilingual Evaluation Understudy Score(BLEU) is metric for evaluating a generated sentence to actual sentence. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

This metric is most commonly used in sequence model like Machine translation system, image captioning ,speech text detection.

4. Data preprocessing:

From the given raw data of 3955 patients, each report contains X-ray images and radiology reports like Indication, Finding and Impression. Thus extracting those details from XML files, X-ray images are extracted by <parent Image> tag and text report are extracted from <Abstract> tag from .xml files .

Here is code snippet to extract the raw data,

Extracting a Raw Data

output:

4.1 Checking whether the Null values present in the dataset:

For images features:

Finding how many reports contains without images

output:

observation:

There are about 104 medical records without images.
It is mandatory that each report should contain at least one image which is required to feed into model.

So by the above result we are removing 104 patient record contain without images.

2. For Text Reports:

In the text report, if null values are present then it is not necessary to remove the records instead we are filling the values to corresponding text meaning.

a. Indication Feature- “No indication” text are filled to Null value

b. Finding Feature- “Result Not Found” text are filled to Null value

c. Impression Feature- “No Impression” text are filled to Null value

4.2 Text Preprocessing:

Cleaning the text is very important before feeding into the model. Here are some of the text preprocessing that I performed,

Removing HTML tags through Beautiful soup library
Removing Special Characters and numerical digits
Modifying some word text,

For example, “won’t” will be changed to “will not”, “can’t” will be changed to “can not”.

4. Removing some irrelevant word found in text like “XXXX” and “year old”.

5. Removing Stop words like “a”, “the”, “is”, “and” are not much important words. But it is necessary the word which include stop word like “not” and “no” should not get removed from the sentence.

for example, “no chest pain in cardiac muscles” if we remove the word “no” from the sentence entire meaning will get change.

6. Modifying all the words in the sentences to lower case.

Here is the code snippet of text preprocessing,

Cleaning the text data

output:

5.Exploratory Data Analysis and it’s observation:

Before proceeding with model architecture lets first understand about insights of data.

Here we have image features and text report features. Let first see about text features,

5.1 Analyze and visualizing the text features:

“Indication” feature

a. Finding and plotting the unique indication report vs. repeated indication report.

Indication Analysis

output:

Observations:

From the above plot we can observe that most of the indications report are unique suggestions given to patients.
Less kind of repeated report we can observe by assuming these reports should be common term in medical field.

b. Plotting sentence length of each indication report:

Observation:

Most of the word count in the indication report are in range of 0 to 5.
Few reports contains text length more than 10.

c. Plotting Top 100 frequent words present in indication report.

Lets plot the word cloud in indication report,

Observations:

The above plot gives the important words present in the indication report.
From the word analysis “chest”, ”pain”, ”breath”, ”dyspnea”, ”surgery” are mostly repeated in the indication report which these words relates to breathing system in our body though it is about Chest X-ray images.

2. “Finding” Feature

a. plotting the unique Finding report vs. repeated Finding report.

Finding Analysis

output:

observations:

From the above we can see that most of the finding reports are unique suggestions that are given to patients.

b. Plotting sentence length of each Finding report:

Observation:

Most of the word count in the indication report are in range of 0 to 30.
Few report text where contain word length more than 40

c. Plotting Top 100 frequent words present in Finding report.

Plotting word cloud visualization,

Observations:

The above plot gives us the top frequent words in Finding Report like we can see the word “Pleural”, ”heart”, ”focal lungs”, ”effusion” are most commonly present in Finding Report.

3. “Impression” Features

a. plotting the unique Impression report vs. repeated Impression report.

output:

observations:

From the above we can see that most of the indications report are unique suggestions given to patients.
Less kind of repeated report can be seen and assuming that they should be common term in medical field.

b. Plotting sentence length of each Impression report:

Observation:

Most of the word count in the indication report are in range of 0 to 20.
Few report text contains word length more than 40.

c. Plotting Top 100 frequent words present in Finding report.

**Top 100 words in Impression Features**

Plotting word cloud visualization,

Observations:

The above plot gives us the top frequent words in Finding Report like we can see the word “acute”, ”cardiopulmonary”, ”abnormality”, ”disease” are most commonly present in impression Report.

5.2 Analysis and Visualization of image features:

Finding number of image present in all patient report

observation:

From the observation of plot we can conclude that most of the report contain two images.
we can see some of the reports contains maximum of five X-ray images.
Some of the report we can observe only one X-ray image.

2. Visualizing some random 10 patient X-ray images

observation:

From the above random patients images, we can observe that each X-ray images are in different in size.
So we should resize all the images in same size before feeding as input to model.

5.3 Conclusion on Exploratory Data Analysis:

From the above analysis the images provided are in different in size we need to resize same to all images before feeding to model.
From the image analysis, we can find some reports contain one image and more than two images. we need to make sure that every report should contain two image like frontal and lateral projection of X-ray image.

6. Existing Research paper/Blogs of project:

solution 1: https://www.researchgate.net/publication/301837242_Learning_to_Read_ Chest_X-Rays_Recurrent_Neural_Cascade_Model_for_Automated_Image_ Annotation

In the above paper, is one of the existing solution they designed deep learning model that efficiently detects a disease from an image and annotate it context(e.g. location, severity and affected organs). They used an open-i publicly available radiology dataset of chest X-ray from Indiana University.

Short summary of paper:

A common challenge in medical image analysis is the data bias. By considering the whole population, diseased cases are much rarer than healthy cases. In order to circumvent the normal vs disease case bias, they adopted various regularization techniques in CNN training.

In order to train CNN with chest X-ray images, they sample some frequent annotation pattern with less overlap for each image, in order to assign image label to each chest X-ray images and trained with cross entropy criteria. They used 17 unique disease annotation pattern to label the image and train CNN. For this purpose they had adopted various regularization technique to deal with default CNN model they choose the simple effective Network-in-Network model because the model is small in size, fast to train and achieve similar or better performance to most common used Alexnet. They rescale the original image and send cnn input in training and testing as the rescaled image size of 256*256. In the context of report generation they used both LSTM and GRU technique in Natural Language processing. During Training, Mesh terms describing ranges from 1 to 8(except normal which is one word). Thus by their observation majority of words contain up to five words and only 9 cases have images with description longer than 6 words they are ignoring these containing the RNN to unroll up to 5 time steps. They zero-pad annotation with less than five words with the end of sentence token to fill the five word space. They had set the initial state of RNN as the CNN image embedding (CNN(I)), and the first an-notation word as the initial input. The output of the RNN are the following annotation word sequences, and they train RNN by minimizing the negative log likelihood of output sequences and true sequences

L(I, S) = −NXt=1{PRNN(yt= st)|CNN(I)}

In the above equation, where yt is the output word of RNN in time step t, st the correct word, CNN(I) the CNN embedding of input image I, and N the number of words in the annotation (N = 5 with the end-of-sequence zero-padding).They therefore use the output of the last spatial average-pooling layer as the image embedding to initialize the RNN state vectors. The size of our RNNs’ state vectors are R1×1024, which is identical to the output size of the average-pooling layers from NIN and Google LeNet. They use evaluation metrics as the BLEU score overall the images and annotation in training, validation and testing. They noticed that LSTM is easier to train, while the GRU model yields better results with more carefully selected hyper-parameters. They find it difficult to conclude which model is better, the GRU model seems to achieve higher scores on average.

solution 2:

On the Automatic Generation of Medical Imaging Reports

By Baoyu Jing, Pengtao Xie, and Eric P. Xing

medium.com

The above blog is more exciting, it explains the conversion of Chest X-ray images to reports by using open-i dataset of Indiana University. Their challenge was that a complete diagnostic report is composed of multiple heterogeneous forms of information that are technically difficult to unify into a single framework. The report for a chest X-ray contains three sections with distinct types of text: a single sentence (Impression), a paragraph(finding) and a list of keywords (MTI tags). To address this, they built a multi-task framework which treats the prediction of lists of words (Tags) as a multi-label classification (MLC) task, and treats the generation of long descriptions (Impressions and Findings) as a text generation task.

In the above architecture, they first adopted a Convolutional Neural Network (CNN) to extract the visual features of an x-ray report. These features are then used to generate keywords (MTI Tags) through multi-label classification. Next, they adopted a hierarchical Long Short Term Memory network (LSTM) to generate the longer-form parts of the medical report (Findings and Impression). Within the hierarchical LSTM, and they used a co-attention module to localize the abnormal areas and focus on specific keywords, which guide sentence-LSTM and word-LSTM to generate a more precise diagnostic report.

7. My First Cut Approach and Architecture:

Let me explain work flow process about what I have done,

In my part of work taking the class label as the “Impression” features. which two images are sent to model and should generate the impression text report.

Converting all our medical reports such that it contains two X-ray Images (frontal view and Lateral View).

From the above Data Analysis we had observed that some report contain one X-ray images and some report contain more than two images.

a. Report contain with single image are replicated twice.

b. Report which contain more than two images are taken one frontal images and one lateral images in order to take correctly with one frontal view and one lateral view this process is done by mapping image name contain in projection.csv data frame.

Here is Projection.csv data frame which contain all the image file name with their projection of entire 3955 medical reports.

So here is the code snippet which the above two steps a and b are done as follows,

Final Data

output:

Here is a final data frame with “Frontal_image” and “lateral_image” as input Features and “impression_report” as Target features.

2.Splitting the Train, validate and test data from the above data frame.

Here we are assigning 80% of data as the Train data, 10% of data as validation data and 10% of data as Test data from the above data frame.

a. Train input data shape → (3118,2)

b. Validation input data shape →(386,2)

c. Test input data shape →(347,2)

3. Applying the word Tokenization and padding to Text class label.

a. Tokenization

Before feeding the text data into deep learning model we need to create text encoding which map each word to unique integers values and this method is called as Tokenization. It is a way of separating a piece of text into smaller units called tokens. Each word in document of corpus are assigned with unique tokens as an integer values.

Word_tokenize

After applying the tokenizer to train_data we need to find the vocab size which is the size in count of unique word token in entire corpus. By applying tokenizer we got 1352 number of unique word tokens present in the corpus.

b. Padding

Each Report sentence in target “Impression_report” variable will be varied in size. Some reports will have long sentences and some reports will have short sentences. We need to make all sentence report as same size before applying to deep learning model and this technique is known as padding. From the above data analysis of “Impression” Features i came across that maximum size of sentence is about 85. So by considering this value we are making all the short sentence with size of 85 by adding zero tokens at the end.

Padding

4. Extraction of Input image features

The X-ray images which contain in the reports are consider as an input to model. We need to convert every image into fixed size vector from which these vectors are fed as input to model.

Here i tried with some Transfer Learning technique to extract every image features into vectors. Transfer Learning is a technique where a deep learning model trained on a large dataset is used to perform similar task on another dataset. We call such a deep learning model as pre-trained model.

Here i use Desnenet121 pre-trained model where it contain 121 layers of convolution layers. The pre-trained weights as cheXNet file is given “brucechou1983_CheXNet_Keras_0.3.0_weights.h5” these weights are pre-trained with medical X-ray images largest publicly available chest X-ray dataset, containing over 100,000 frontal-view X-ray images with 14 diseases. Setting these weights are suitable to our task.

Here is the pre-trained model that is implemented where last layer is assigned with global average pooling. By this the output are generated with the shape of (1,1024).

Pre-trained Model

5. Basic Model Architecture(Encoder-Decoder Model)

Here is the basic model architecture which consists of simple encoder and decoder architecture. The model is build using tf.keras library.

Model Architecture

Encoder:

It is the initial stage where two image Frontal view and lateral view vector from pre-trained model are fed as input to encoder. In next stage both image vectors are concatenated as a single vector and passed into dense layer. Finally the output of encoder state are applied with flatten layer with size of Embedding_dim.

Encoder State

Decoder:

In decoder, the text reports are fed as input and passing with Embedding layer.

Embedding layer is like words are embedded into real-valued vectors in predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resemble a neural network and this technique most often used in deep learning.

Next the Embedding layer are passed into LSTM layer,

Long Short Term Memory(LSTM) is most popular in RNN sequence model which can able learn long-term dependencies.

Finally, the output of LSTM layer and output of Encoder output are concatenated with each other and this vector is passed to feed forward Neural Network model with dense layer and output layer is defined as softmax layer which generated the probability output for each vocab word in the corpus.

Decoder State

Defining Loss Function and optimizer:

Here the loss function is defined as mask sparse categorical cross entropy for this problem. The reason behind this loss is

For ex. Taking sequence of tokens-[2],[4].[5],[7],[9],[3],[0],[0],[0],[0],[0]

In the above sentence we have 6 words in sequence, the zeros corresponding to the padding which is actually not a part of report. Since the model will assume that zeros are also a part of sequence and tries to learn them. when model predicts zero correctly, the loss will decrease because the model thinks that it learns correctly. But the loss should only decrease if model predicts the actual words non-zero token correctly.

Therefore we should mask the zeros in the sequence so that the model doesn’t give its attention to them and only learns the needed words in the report.

Masked Loss Function

Here the model is trained with Adam Optimizer with learning rate of 0.001 step size.

Model Training :

For Training a model, we use a concept called Teach forcing. It is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.

At the first time stamp t, of the decoder “<START>” token is fed as input to decoder by starting the process of training and generated word output is used as input on the subsequent time step t+1 , and along with this image vector and source text is feed as input. And the process of providing source text as input is called as Teach forcing.

Model Training

Performance Characteristic graphs of Basic Model through Tensor board:

observation of plot:

From the above plot the train loss(blue curve) and validate loss(orange curve) both are smoothly decreases as when epoch increases.
Thus by the graph train loss obtained as 0.61 and validate loss obtained as 0.65.

Model Inference through Argmax Search:

After Training the model let begin with testing phase by using test dataset. Here argmax search, (which is common search in model inference of sequence model) is performed which takes the highest probability values is similar like greedy search.

Model Inference

Result and Observation:

Let us look some results how the Basic model predicted to test data.

Some random Test data point:

From the above result, the model predicted almost most of the words of actual “impression” report.

From the above results, the model predicted as worst case totally it made wrong prediction.

From the above result, the model predicted exactly the same words of sentence of actual impression report.

From the above result we can observe model predicted not giving same exact meaning of the actual sentence.

Final Conclusion on Basic Model:

By observing some result of test data which model does not performed well with long sentences.
we can observe that some short sentence it performs well and predicts similar words of actual sentence.
we can also observe that some short sentence it predicts similar words but entire meaning get change when compare to actual sentence.

6. Main Model (Attention Mechanism):

By the above basic model(simple encoder-decoder) it doesn’t gave us a proper result with test data and need to do some improvement in model architecture. we can observe that long sentence predicted in worst case by the basic model.

so extending the basic model architecture by adding the Bi-directional GRU with attention mechanism(additive attention) which helps to perform well with long sentence report.

Let’s quickly look into the process of attention mechanism model,

**Additive Attention Architecture |** **source**

Input -Here the model is fed with both image vector and report text with embedding dimension which both input are added and send as the context vector to decoder.

Decoder stage -Here bi-directional GRU is used to get high level features from input to get more understanding depth in input features.

Additive Attention- It provides weight vectors (alpha) to every sequence of word and get add up with word level features from each time stamp into sentence level features vector. This is simplified form of Bahdanau’s Attention.

Let see how the weights are calculated,

From the above illustration, the weights are calculated by feed forward neural network. As the above softmax as output update the context weight by the back propagation technique

The context vector ci for the output word yi is generated using the weighted sum of the annotations:

The weights αij are computed by a softmax function given by the following equation:

eij is the output score of a feed forward neural network described by the function a that attempts to capture the alignment between input at j and output at i.

Basically, if the encoder produces Tx number of “annotations” (the hidden state vectors) each having dimension d, then the input dimension of the feed forward network is (Tx , 2d) (assuming the previous state of the decoder also has d dimensions and these two vectors are concatenated). This input is multiplied with a matrix Wa of (2d, 1) dimensions (of course followed by addition of the bias term) to get scores eij (having a dimension (Tx , 1)).

On the top of these eij scores, a tan hyperbolic function is applied followed by a softmax to get the normalized alignment scores for output j:

So, α is a (Tx, 1) dimensional vector and its elements are the weights corresponding to each word in the input sentence.

If you still need more detail about attention mechanism refer this blog. The entire explanation of attention mechanism source is taken from this blog.

Output -Finally the context vector from attention layer is sent to feed forward neural network and final output word is generated by softmax layer.

Here is the architecture of attention mechanism implemented using Tensorflow Functional API ,

Attention Architecture design

Main Model summary:

Attention Model summary

Main Model Architecture:

Encoder:

In Encoder stage same architecture of basic model is applied here. Finally image vector is add up with text embedding vector through ADD layer in Tensorflow API.

Encoder State

Decoder:

In decoder architecture, the input features contain both image vector and text report vector is fed into bidirectional GRU which extract complete input information and send to additive attention layer, which also called as Bahdanau attention mechanism is implemented into simplified version in Tensorflow. The output of attention layer of context vector is then fed into feed forward neural network and finally the output word is generated through softmax layer.

Decoder State

Same loss function, metric ,training method, optimizer, model inference in basic model is used here. Those are explained in detailed manner in basic model architecture. Could you please refer above.

Performance Characteristic graphs of Main Model through Tensor board:

Plot Observation:

Here we used mask sparse categorical cross entropy as a loss function. From the loss part we can clearly observe that loss is rapidly decrease when epoch increases.
The train loss is obtained as 0.4 and validate loss is 0.5 by training with 16 epochs.

Result and Observation:

Here generated reports by model is measured using BLEU score by applying both Individual N-gram and Cumulative N-gram.

Individual N-gram Score:

An individual N-gram score is the evaluation of just matching grams of a specific order, such as single words (1-gram) or word pairs (2-gram or bigram).

Cumulative N-gram Score:

Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.

To know more detail about this implementation of BLEU score refer this.

Here is the code implementation of to visualize the results,

Results and metric observation

Good Results from the model:

From the above result of some test data points we can conclude that it model performs well in long sentence report. And some generated reports gives exact similar result of actual reports.

Bad Results from the model:

we cannot conclude that model will always perform good here some of the above result are bad performance of generating reports by the model.

Reason for Error:

we can observe that image is not clear in representation some dull patches and noise in the images.

we can observe some images with finger shade in x-ray images that badly affect to model.

Beam Search Algorithm:

The beam search algorithm selects multiple alternatives for an input sequence at each timestep based on conditional probability. The number of multiple alternatives depends on a parameter called Beam Width B. At each time step, the beam search selects B number of best alternatives with the highest probability as the most likely possible choices for the time step. If u still want to understand more detail about beam search go through this and this.

Beam Search

Results and Observation:

Here i applied beam width about 3 in all the results.

Good Results from the model:

Bad Results from the model:

Conclusion of Beam Search Algorithm:

From the above result of using beam width as 3 for short sentence we obtained good BLEU score and exactly generates actual reports . When we take a long sentence the prediction is worst and did not match completely with actual report. Instead of trying with beam width as 3 we can try with large number width to get better results for long sentences.

8. Conclusion:

1.From the above experiments we can observe and conclude that Additive Attention Algorithm gave better result when compared to basic model algorithm and beam search algorithm.

2. we can clearly observe with good BLEU score in additive attention model and performed well in long sentence when compare to basic model.

3. we can observe that beam search performed good in generating short sentence but it fails to perform in generating long sentence.

9.Future Works:

In future if we train with more medical records we can expect better results than above results with good score in generating reports.
We can apply with Transformer and Bert layer in decoder state to improve with better in generating text report when compare to attention layer.
The best model can be converted into android application or desktop application which can be use as a display tool to generates reports by providing X-ray images by Radiologist.