Automatic Impression Generation from Medical Imaging Report

Process of generating textual description from medical report — end-to-end Deep learning model

Anand Pandiyan

Published in

Analytics Vidhya

22 min readJul 12, 2020

Business Problem
Introduction about the Dataset
Prerequisite
Existing Research-Papers/Solutions
My Approach — Solution
XML Parsing Creating Data Points
Data Preprocessing
Exploratory Data Analysis
Data point construction
Train Test and Validation split
Tokenization and Dataset preparation
Basic Model [CNN]
Main Model [CNN-BiLSTM]
Conclusion
Error Analysis
Future work
References

1. Business Problem

The problem statement here is to find the impression from the given chest X-Ray images. These images are in two types Frontal and Lateral view of the chest. With these two types of images as input we need to find the impression for given X-Ray.

To resolve this problem statement, we will be building a predictive model which involves both image and text processing to build a deep learning model. Automatically describing the content of the given image is one of the recent artificial intelligence models that connects both computer vision and natural language processing.

2. Introduction about the Dataset

Open-i chest X-ray collection from Indiana University

This dataset is about 7,470 chest x-rays with 3,955 radiology reports for the chest x-ray images from Indiana university hospital network. — Images are downloaded as png format — Reports are downloaded as xml format.

Each xml is the report for corresponding patient. To identify images associated with the reports we need to check the xml tag <parentImages id=”image-id”> id attribute in the id we have the image name corresponding to the png images. More than one images could be associated with one report or xml.

Original data source: https://openi.nlm.nih.gov/

Other Resources: https://www.kaggle.com/raddar/chest-xrays-indiana-university

Sample Data point:

3. Prerequisite

Before we go through deep on this work, I assume that you are familiar with the following deep learning concepts and python libraries.

Convolution Neural Network, Recurrent Neural Network, LSTM, Transfer learning, Activation functions, Optimization techniques like SGD, Adam. Loss functions like categorical cross entropy, sparse categorical cross entropy. Finally, TensorBoard for performance visualization and debugging

Python, tensorflow, Keras, tokenizer, Pandas, numpy, Matplotlib. Understanding concept of Sequential Api, Functional Api and model subclass type keras model implementation. The reason I have chosen the subclasse model is, it is fully-customizable and enables you to implement your own custom forward-pass of the model. Also we can have control over every nuance of the network and training process.

Below I have mentioned import blogs and tutorials to begin with.

1. https://www.tensorflow.org/tutorials/text/nmt_with_attention — TensorFlow Tutorial

2. https://www.tensorflow.org/tutorials/text/image_captioning — TensorFlow Tutorial

3. https://becominghuman.ai/transfer-learning-retraining-inception-v3-for-custom-image-classification-2820f653c557 — Transfer Learning tutorial

4. https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202 — InceptionV3 model tutorial

5. https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/ — why ImageNet — why InceptionV3

6. https://www.pyimagesearch.com/2019/10/28/3-ways-to-create-a-keras-model-with-tensorflow-2-0-sequential-functional-and-model-subclassing/ — 3 ways Keras model implementation

7. https://www.tensorflow.org/tensorboard/get_started — TensorBoard Tutorial

4. Existing Research-Papers/Solutions

This work is inspired from the below research and Blog:

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

In the mentioned paper they have used encoder and decoder model with attention mechanism. In the encoder part they have used CNN to extract feature from images. In decoder they use a long short-term memory (LSTM) network that produces a caption by generating one word at every time step conditioned on a context vector, the previous hidden state and the previously generated words. They have used BLEU score to measure the performance of the model.

Few other Blogs i have referenced.

1. https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2

2. https://www.analyticsvidhya.com/blog/2018/04/solving-an-image-captioning-task-using-deep-learning/

5. My Approach — Solution

Initially I will be doing the Exploratory Data Analysis part I both image input and text output with EDA I could find the data imbalance, Images availability per patient, Type of images associated for each patient. After the EDA I will be implementing deep learning model with two different approach to find the improvement on one another.

1. The basic model:

A simple encoder and decoder architecture. In encoder part it will have the CNN single fully connected layer to get the feature vector of images from pretrained InceptionV3 model. Decoder part will be having LSTM layer where it takes two inputs one is image feature vector and the sequence of text to word in each time step.

2. Main Model:

I will be using encoder-decoder architecture to generate the impression from the chest X-ray. The encoder will output the image feature vectors. The feature vectors are then passed to decoder with attention mechanism this will generate the next word for the content of the image. With same model approach from basic model I will be creating a new architecture which is implemented using the research paper Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

As initial step I will do an image classification using InceptionV3 model over this dataset https://www.kaggle.com/yash612/covidnet-mini-and-gan-enerated-chest-xray. With this classification model I will save the weights over this training and use this weight in Encoder feature extraction by loading the saved weights to InceptionV3.

Encoder:

The encoder is a single fully connected linear model. The input image is given to InceptionV3 to extract the features. this extracted feature of two images are added and input to the FC layer to get the output vector. This last hidden state of the Encoder is connected to the Decoder.

Decoder:

The Decoder is a have a Bidirectional LSTM layer which does language modelling up to the word level. The first-time step receives the encoded output from the encoder and the <start> vector. This input passed to 2 stage Bidirectional LSTM layer with attention mechanism. The output vector is two vector one is predicted label and other is the previous hidden state of decoder this fed back again to decoder on each time step. Detailed Architecture is mentioned below.

6. XML Parsing Creating Data Points

In this section we will see how the raw xml data is parsed and structured as data points, Then the data points are stored in csv files for future model requirements.

Raw XML Tree View:

From the xml file we will be extracting the Abstract and ParentImage Nodes. In this we have the Impression and image file name as below.

Impression level:

We will retrieve the Abstract text values

Image File name:

Image file name available in the id attribute. We can ignore other details because the data are not relevant for our report. As we can see there are two parentImage nodes we have two image for this report.

XML parser code to retrieve the details mentioned above.

xml parsing

After extraction we have 3955 rows, data in dataframe view,

7. Data Preprocessing

In this phase the text data are preprocessed to remove unwanted tags, texts, punctuation and numbers. We will also check for the empty cell or NaN values.

If there any empty cells in image name column we will drop those cells.
If there any empty or NaN value in text data we will replace it with “No <Column Name>” (ex: No Impression)
Each text column word counts are calculated and added to the dataframe column.

After the data preprocessing missing value handling below is the dataframe view and we have total of 3851 rows present in the final data points.

Total number of unique Images 3851
Total number of unique Caption 402
Total number of unique Comparison 281
Total number of unique Indication 2098
Total number of unique Findings 2545
Total number of unique Impression 1692

8. Exploratory Data Analysis

In this section we will see different approaches to analyze the data set by summarizing and visualizing their main characteristics.

8.1 EDA on Text data

In the text analysis we will be taking the impression column target variable. With below visualization we could see the top 100 most occurring sentences.

Sentence occurrences for Impression

From above visualization we can see that “No acute cardiopulmonary abnormality” occurred almost 600 times.
Mostly longer sentences are occurred less than are equal to 10 times

Word occurrences for Impression

We will see the word wise occurrence using the word cloud for impression column

Above word cloud are generated on the top 1000 max occurrence words.
Acute, cardiopulmonary, abnormality, disease, pleural, effusion, active these are the highlighted words from above visualization.

Word count distribution

Let’s see the word count distribution on the impression column as we have already calculated the word count in impression_count column we see the distribution like below.

Minimum word count is 1 — Maximum word count is 122 — Median word count is 5.0

We can see the maximum and minimum word count from this distribution.
Word count that maximum occurrence is mostly 5
Most often word count is between 5 to 10 only.

8.2 EDA on Image data

Lets analyze the total image present per data point or report.

Minimum Image count is 1 — Maximum Image count is 5 — Median Image count is 2.0

Most frequent images count per record is 2.
Second frequent is single image.
We do have 5 images per records too.

Displaying random 25 patient X-Ray

As we have seen the images are in both Frontal and Lateral view. And each patient have one or more than 2 images associated with it. Let see some random data points with its images.

Sample data point

8.3 EDA Findings

All the raw texts from xml files are parsed and created the dataset.
Each patient have multiple x-rays associated with them.
Major finding is that how the images are in sequence or number of images associated with each record.
We have mostly of 2 images per record frontal and lateral. and also, we have 1, 3, 4, 5 images associated with each record.
There is no missing files. We have total of 3955 records and 3 additional features (Comparison, Indication and Findings) which we will not use in this model and 1 Impression target variable.
Most occurring words: Impression - acute cardiopulmonary
Images are in different shapes.
All the X-Ray images are human upper body particularly about Chest part.
In text features there are some unknown values like XXXX XXXXX these are replaced with empty string.

8.4 Data Conflicts

There are only two image types Frontal and Lateral, but we have 1, 3, 4, and 5 images associated for each datapoints. we have met a conflict here that how we provide the data points to the model we build. Because of this conflict we need to come up with an idea to handle how we input the data to the model. Before building our model, we see some data point structuring methods that could help use handle this case.

9. Data point construction

As we have more than 2 image or some case less than 2 images associated with each data point. If we have no images, we dropped those data point.

Lets handle the data point which are having 1,3,4,5 images. Below is the data point counts with number of image sets.

Data point having 2 images is 3208

Data point having 1 images is 446

Data point having 3 images is 181

Data point having 4 images is 15

Data point having 1 images is 1

Total data point is 3851 data points

Approach,

Limiting the data point to 2 images per data point, if we have 5 images, its 4+1 (all image + last image) so make it as 4 data points as below.

Here last image should be Lateral if we have frontal as remaining images.

if i have 5 images then, here 5th image is Lateral other or frontal,

1st image + 5th image => Frontal + Lateral

2nd image + 5th image => Frontal + Lateral

3rd image + 5th image => Frontal + Lateral

4th image + 5th image => Frontal + Lateral

Increased to 4 data point from this single data point

likewise, for other data point,

if i have 4 images then,

1st (Frontal) + 4th (Lateral)

2nd (Frontal) + 4th (Lateral)

3rd (Frontal) + 4th (Lateral)

Increase to 3 data point from 1 data point

if i have 3 images then,

1st (Frontal) + 3rd (Lateral)

2nd (Frontal) + 3rd (Lateral)

Increased to 2 data point from this single data point

If we have only one image then,

1st images either (Frontal or Lateral) + Duplicate 1st image

Same data point count. We need to make sure this duplicating data point should be equally split among the train test and validation sets. If we don’t have Lateral images, then keep the frontal as last image data points.

So with this data constructing method we could also increase the data point and come up with fine input data points. Code for the above explained data structuring.

After constructing the data point we will add the <start> and <end> token to text data.

Final datapoints,

10. Train Test and Validation split

We have a separate data one is without duplicate data points other is with duplicate data points. We need to split the data points as the duplicate data points are equally available in all three splits.

After taking the two different data set we need to concatenate the dataset equally.

We get the final data point shape as above.

11. Tokenization and Dataset preparation

11.1. Tokenization

We cannot feed raw text to our deep learning model. Text data need to be encoded as numbers and then used in both machine learning and deep learning models. The Keras deep learning library provides some basic tools to perform this operation.

Total vocabulary (vocab_size) present is 1339 and maximum length of the output sentence is taken as 60.

11.2. Dataset Preparation

For the dataset preparation we will be using the transfer learning method for image to feature vector conversion and text data tokenization.

Please refer this blog on why I have chosen the inception model over others https://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/

I will be using the InceptionV3 model trained on ImageNet dataset. Initially I will be doing a xray classification task using below mentioned dataset. https://www.kaggle.com/yash612/covidnet-mini-and-gan-enerated-chest-xray. It’s a three-class classification task where we need to classify the whether the xray of the patient is belongs to one of these 3 class Corona or Normal or Pneumonia.

Once the classification is done, I will save the weights of the trained model and use this model with removed top layer of shape(1, 2048) as feature vector for our model and prepare the dataset.

Below is the model architecture for this classification task.

Transfer Learning

Accuracy plot on the classification task.

Model weights are saved for future use as hdf5 file.

I have trained this model using ImageNet weights and without ImageNet weights with image net weights performed well in this classification.

With the trained weights I will use like below for feature extraction for our image data.

Loading weights

I will create a image tensor for all available images using the inception feature vectorization like below.

Creating image tensor

These image tensors are used in TensorFlow dataset preparation basically I am doing a caching mechanism here for future use.

Create TensorFlow using tf.data

Refer the link for further reading on tf.data: https://www.tensorflow.org/guide/data

Tutorials to read: https://adventuresinmachinelearning.com/tensorflow-dataset-tutorial/

Now that we have our image tensor and text vectors we can build the tf.data dataset

tf.data creation

Multi_image() function converts the two-input tensor of shape (1,2048) & (1,2048) to (2,1,2048). Batch_size, embedding dimension, and units size are mentioned these are the hyperparameters that we can tune according to our model.

So we have done the feature extraction and tokenization for our model to work, And we have the tf.data dataset now lets build the required model.

12. Basic Model

12.1. Model Architecture

As I have already explained about the subclass model. I will directly jump into the model architecture.

I have built the functional Api model for checking the model architecture

12.1.1 Encoder architecture:

Have single fully connected layer linear output. Before we pass to the FC layer, we add the two image tensor and pass to FC layer. This layer outputs shape of (batch_size, 1, embedding_dimension)

Encoder

12.1.2 Decoder Architecture:

In this part we have an embedding layer LSTM layer and dense layer which outputs shape (batch_size, vocab_size)

LSTM layer is Long Short-Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies.

To know more about LSTM refer this link: Understanding LSTM Networks

Decoder

12.2 Model metric and optimizer Initialization

12.3 Model Training

For the training phase we use the Teacher forcing. Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.

In the Training, a “start-of-sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.

This same recursive output-as-input process is used till the model converge to better result. Below I have mentioned the source.

Further readings about teacher forcing: link to Teacher forcing

Training step

12.4. Model Performance visualized in TensorBoard

We have logged the loss and accuracy using tf.summary

12.5. Model Evaluation

In the evaluation or testing stage I have used the argmax search based teacher forcing to find the output sentence. In time step t we generated a word using <start> token and the predicted word again fed back to the next step and it become the input of the decoder in time t+1. Code for argmax search is mentioned below.

Evaluation stage

Sample outputs are shown below

Lets try a longer sentence word first

Prediction is not perfect in the longer sentence lets see the shorter sentence.

Even in the short sentence model not performing well.

12.6. Basic Model Conclusion

This model is built on a simple encoder decoder with LSTM.
getting not perfect or not worst predictions
validation accuracy is not improving much but loss is converging
we could even fine tune this model for perform well.

We will see a better performing and modified architecture having bidirectional LSTM layer with Additive Attention mechanism.

13. Main Model

13.1 Model Architecture

The Model Architecture is reimplemented using one of the research paper I came across Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

Please refer the paper before we understand the model architecture of this model.

In this paper they proposes Attention-BLSTM model in detail.

As shown in below figure, the model proposed in this paper contains five components:

Input layer: input sentence to this model summed with the image feature vectors
Embedding layer: map each word into a low dimension vector
LSTM layer: utilize BLSTM to get high level features from step (2) this BLSTM layers is repeated twice for more in depth feature understanding
Attention layer: produce a weight vector, and merge word-level features from each time step into a sentence-level feature vector, by multiplying the weight vector
Output layer: the sentence-level feature vector is finally used for relation classification. These components will be presented in detail functional view in coming sections.

Lets see the model functional layers using Functional Api. In this model the hyper parameters are same as the basic model only change is with the maximum sentence length is take as 80.

The same below functional model will be implemented using model subclass as I have already mentioned the subclass model is easier while debugging your architecture. We can have the control over each layer.

Now we can see how this above architecture implemented using model subclass. With separate encoder and decoder part.

13.1.1 Encoder Architecture:

In the encoder part it is same as the basic model architecture summed image vector with single fully connected layer.

Encoder Final

13.1.2 Decoder Architecture:

Similar architecture as the mentioned paper and I have modified with one additional layer of BiLSTM for better feature representation. Attention mechanism is used. Take look at the quick overview on attention mechanism in this link. Further readings on attention mechanism is mentioned in the reference section (Attention is all you need)

In our model I have used the tensorflow AdditiveAttention it is nothing but the Bahdanau-style attention. Please refer the implementation details in the reference section.

Decoder final model

Brief explanation and implementation of model Metric and Optimization initializer, Model trainings are mentioned basic model section same is used here in the main model.

13.2 Model Performance visualized in TensorBoard

We have logged the loss and accuracy using tf.summary

13.3 Model Evaluation

In the evaluation or testing stage I have used the Beam search-based teacher forcing to find the output sentence. As we have already seen Teacher forcing in brief lets move to the implementation part.

Bleu score metric:

I have used the Bleu (bilingual evaluation understudy) score as the metric for find the quality of the machine translated word to actual word. Take a quick look at wiki Blue here

Beam Search:

Instead of greedily (usual choosing single highest probalbility word) choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely k is the beam index in other words, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities. Take a quick look on this source link.

Beam search

Sample outputs are shown below with bleu score in both cumulative and N gram scores.

Short sentences

As we can see the predicted output is good. We can also see it from the bleu score.

Longer sentence

Not a good prediction using Bleu score, but we can see there are some similarity in the meaning of the two sentence both explains there is no disease for this record. Theses are on the major problem with the Bleu score doesn’t consider the meaning of the word.

Lets try another long sentence

This one also like the previous prediction. First 4 words exactly matches the predicted word but still we are not having the good Bleu score because the word count is high.

Above are the few random prediction.

13.4 Model Conclusion

The model build on Bidirectional LSTM with attention seems performing better than basic model.
As per the Model Architecture mentioned in the source Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification it worked well in classification task.
Loss is converged to 0.3 with accuracy of 89% train and 92% validation from the result we can see there is similarity between each predicted and actual output.

14. Conclusion

Compared with normal ImageNet trained InceptionV3 model, we could able improve the model performance with training the same InceptionV3 in X-Ray classification and using that weights to extract the image feature.
The BiLSTM architecture gives the good result, even in very less Bleu score values we could able to see the meaning of true sentence in the predicted sentence.
We could perform an Error analysis on the predictions to identify the whether the issue is with model or the data point.

15. Error Analysis

In this section we will see the error analysis it is an analysis of what is causing this error and use that findings to improve the mode. we will be looking into low Bleu score data points which is error in this case and how to make sense of it. After error identification we will see how to improve the model using this.

Error in the model can be reducible or irreducible we will work on the reducible error. After training the model we check the validation set to find the error and do the analysis. Once we find the error if it is reducible error then we fix those in our future training of the model in this way the model will be improved than your previous one.

Lets find out how I have worked on this error analysis.

Validation set with Bleu score,

In the initial step we will take the score which are lesser than 0.08 Bleu and check the data point.

We can also ignore the duplicate data point.We have used the duplicate data point in our model. Which is having the same image in both input we consider this as noise point check the 9. Data point construction section to find the details of data construction.

From total of 139 poor Bleu score dataset we have 22 duplicate dataset.

Final data point,

Lets take each data point and do the analysis,

word length is 26 and there is a word overlap “activeacute” in actual value could not find any image issue
Predicted word gives the partial meaning from the actual not a poor prediction.
we get the poor value because Bleu score does not account the meaning

Word count is 24, No error in actual word. still not able to find any image wise pattern issue
Predicted word is poor did not give any similar meanings

word count is 11, No error in actual word, images is not perfectly captured when we compare with other.
Prediction gives same meaning, issue with the Bleu score and images.

As we can see the bleu score which is having value greater than 0 gives the partial meaning of actual which considered as good prediction lets take bleu score which are 0.

Another finding is when we have word more than 20 word give 0 values. which shows that our model did not perform well for longer sentence. lets consider word lesser than 20.

Final data, having 62 data point.

As we have already separated best and worst case data points lets visualize and look for the patterns

Below is the best result data points.

Points to take in best result images

Proper alignment of images
Brighter view of chest bones
Does not have any additional dark line
Even in the dull images we could clearly see the chest bones

Below is the poor result data points

points to take in poor result

Images are shadowed in some case (row, column) (3,2),(3,4),(4,2),(4,3),(4,4)
Images are too bright in some case (1,2),(3,4),(5,3)
Lets see both images in a data point to check whether least one image have those above issue.

In this data point we see the second image is not properly taken. there is fingerprints visible in the bottom of the picture, major error.

Clear view of poor xray capturing also covered the hands
Right side image have addition dark stripes in the lower left edge

Clear view of Poor image quality in both images.

Clear view of poor image quality. x-ray with Jewelry in both images this is not found in any x-rays even in the x-ray classification task dataset.

15.1 Conclusion

From the above analysis we have found that the quality of the images is plays major role. Mostly the error data points are with poor images quality poor chest bone view this is the primary take away.
We have also seen some fingerprints, jewelry of the patient clearly visible in the image.
The model works well on the clear visible chest bones. we have already seen this and compared in the best and worst case images.
There are images which are brighter those cases model fails. we have also seen the best result images where we does not have the brighter images. brighter means higher white pixels.
Another finding is that our model did not perform well in the case where we have more than 20 words. we could able to improve this by changing the architecture. better than this but our model does not show that its poor model. error are 62 out of 399 which is almost 15% of the data. does not show it is poor prediction.
Some case where we have incorrect words in true sentence.
We could ignore these errors in our future work to get the better performance. And these are the reducible error in error analysis.

Source Code for this blog GitHub

16 Future work

We can also modify whole architecture with state-of-the-art BERT Transformer instead of att-BiLSTM. This can be achieved by sending the image feature and text input as single vector in time step to predict the next sentence. This is one the method using transformer. Few other references below

1. Unified Vision-Language Pre-Training for Image Captioning and VQA
2. https://papers.nips.cc/paper/9293-image-captioning-transforming-objects-into-words.pdf
3. https://arxiv.org/pdf/2004.08070v2.pdf — Entity-Aware News Image Captioning using Transformer
4. http://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks.pdf — Pretraining Task-Agnostic Visio linguistic Representations for Vision-and-Language Tasks

We can further increase the Encoder CNN layer to deep layer for improvements.
Image can be trained on Different ImageNet based keras model like I have implemented training phase to classify the X-ray and using that weight to extract the feature.
From Error analysis we have found that there are some data points which are in poor quality and bad capturing of image led us in poor performance. We can eliminate these issues in our future work.

17 References

1. https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/ — How to Prepare Text Data for Deep Learning

2. https://towardsdatascience.com/what-is-teacher-forcing-3da6217fed1c — Teacher forcing further readings

3. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification — BiLSTM Architecture

4. Attention is all you need

5. CNN+CNN: Convolutional Decoders for Image Captioning

6. Review Networks for Caption Generation

7. AdditiveAttention — Bahdanau style attention

8. https://yashk2810.github.io/Image-Captioning-using-InceptionV3-and-Beam-Search/ — Beam search tutorial

9. Evaluating text output in NLP using Bleu — Bleu Tutorial

10. Applied AI Course

Thank you for reading!

If you have any comments please let me know !!!

You can find me on LinkedIn and GitHub.