Impression Generation From Medical Imaging Report

Abhishek Malik

Published in

TheCyPhy

24 min readJan 23, 2021

Author: Abhishek malik[Linked-In]

A well-defined process by which we generate the textual impression from the medical imaging reports.

Business problem
Dataset introduction
prerequisite
Literature survey
Personal approach- solution
Parsing of XML files by which we create data points.
Preprocessing of data
Exploratory data analysis
Conflicts in data point construction.
Splitting of train, test, and validation data
Tokenization and dataset preparation
Basic model[Encoder- Decoder]
Main Model[Attention mechanism]
Conclusion
Error Analysis
Future work
References

1. Business Problem

The Problem statement is about the generation of textual impression with the help of medical chest x-ray. The Chest x-ray images are of two types one is frontal and the second one is lateral. With the help of these two images, we are going to predict the impression using deep learning techniques.

To resolve this problem statement we have to develop a deep learning model that predicts the textual impression with the help of image and text processing. Automatic generation of medical reports is a challenging task that can be solved with the help of artificial intelligence which definitely uses computer vision for image data and natural language processing for text data. Basically, we use both techniques to solve this automatic report generation problem.

2. Dataset Introduction

Chest X-ray collection from Indiana University

The dataset contains 7,470 chest X-rays with 3,955 radiology reports for the chest X-ray images from the Indiana University hospital network.

Images are available and downloaded in the PNG format whereas the reports are available in XML format.

Description of XML files: Each XML is the report containing information about the images, impression, findings, indication, comparison which is totally related to a parent_id. For the identification of images associated with the reports, we need to explore the XML tag which contains the image_id such as <parentImages id =”image_id”> id attribute in the id we have the image name corresponding to the png images. There is one important to be noted which is that there is more than one image associated with one report or XML.

Original data source: https://Indiana chest X-ray dataset

Resource for getting frontal and lateral images: https://www.kaggle.com/raddar/chest-xrays-indiana-university

Source for getting dataset:

**Source from Indiana university site.**

3. Prerequisite

There are various tools of deep learning and python which we use to solve this work.

Deep learning tools: Convolutional Neural network, Recurrent neural network, LSTM, Transfer learning, Activation functions, Optimization techniques like SGD, Adam, Tokenizer. Tensorflow, Keras.

Loss Functions: categorical cross-entropy, sparse categorical cross-entropy. Finally, Tensorboard for performance visualization of accuracy, loss, and perplexity and gradients for encoder and decoder.

Python libraries: Pandas, Numpy, Matplotlib, Pickle.

Understanding the concept of sequential Api, Functional API, and model subclass type Keras model implementation. The main reason for choosing this technique of using subclass model is because we can customize them in our way as we want and easily enable us to implement our own custom forward pass of the model. And the main importance we have control over each unit of the network and training process.

4. Literature survey

If you really want to learn something new then definitely reach to the literature related to your problem. So these are some research papers and blogs which I refer to during the understanding of the problem.

First Source: https://link.springer.com/chapter/10.1007/978-981-15-4015-8_15

In this paper, the architecture which is introduced is named as Multi-level Multi-Attention based encoder-decoder (MLMA). The Proposed MLMA method initiates a learning strategy to map a report with all views of a chest x-ray image. The CNN based image features and word embedded textual features are fused and sequentially learned through long-range LSTM. Attention over the LSTM is utilized for focusing on relevant local image regions with semantic knowledge. THE Bidirectional LSTM (BD-LSTM) with attention is separately applied on embedding semantic features to learn syntactic knowledge for inter sentence dependency. The combined attentive output is responsible for the report generation by predicting the sequence of words. The method initially describes the encoder-decoder framework and illustrates the MLMA approach consisting of context-level visual attention(CLVA) and textual attention(TA) for the medical report generation. The final performance of the proposed model is reported using the COCO-Caption evaluation API. It shows a significant improvement in a medical report generation task compared to the state of art methods.

Second source: https://arxiv.org/abs/1904.02633

In this paper, authors try to account for the particular nuances of the radiology domain, and in particular, the critical importance of clinical accuracy in the resulting generated reports. In this work, the authors try to present a domain-aware automatic chest x-ray radiology report generation system which first predicts what topics will be discussed in reports then conditionally generates the sentences corresponding to these topics. The important part of this paper is authors is fine-tuning the resulting system using reinforcement learning, considering both readability and clinical accuracy as assessed by the proposed clinically coherent reward.

Proposed method:- Images are first encoded into image embedding maps, and a sentence decoder takes the pooled embedding to recurrently generate topics for sentences. The word decoder then generates the sequence from the topic with attention to the original images.NLG reward, clinically coherent reward, or combined can then be applied as the reward for the reinforcement policy learning.

Third Source: https://sezazqureshi.medium.com/chest-x-ray-medical-report-generation-using-deep-learning-bf39cc487b88

Summary:- This blog explains how we are able to handle this problem statement. It clearly depicts that data has images of chest x-ray of various different persons. Similarly for the information regarding the chest x-ray is found in the XML files. In which information is divided into four formats. The first one is the comparison, the second one is an indication, the third one is findings, and the fourth one is the impression. So, here we observe that he took out all the four statements regarding the images and tried to observe the meaning of the four statements. After observing all the four statements he chose impression to be the best information regarding the chest x-ray images. In XML files there are many chest x-ray image paths that have the same findings. So there are a total of 7470 chest x-ray images corresponding to 3955 XML files. This problem statement is basically based on generating a text given an input image. So, it means we have to generate the medical report given a chest x-ray image. The author uses the BLEU score to measure the performance of deep learning models.

After gaining all the information about data and performance metrics. Let’s see what the exploratory data analysis of this blog said. Here I observe that he uses the xml.etree.ElementTree to access all the four statements and also the path of the images which has also been found in each XML file. After this try to find out the information about the columns. There we observe that the mean, median is calculated for the number of images per report. The maximum and the minimum number of images per report is also found out which is 5 and 0. By this, we are able to clearly understand the maximum 5 images associated with the single report and we do not have any images associated with some reports in this data. Here garbage value is also found in the impression column. So it is better to drop those values. Findings also have the 514 nan values which are dropped. The author did a very impressive thing instead of dropping nan values in the impression he added the no impression instead of nan. The reason for having the only impression as the target feature is because it has fewer nan values in comparison to other statements. Here mostly reported impressions are normal and some are abnormal so it creates a biasness. Mostly there are 2 images per report so it is better to have the 2 images corresponding to each person id and if we have one image then we copy the same image. Here data preprocessing is also performed on the impression statements. Because it contains XXXX values in the sentences which are of no use. Then perform the tokenization on the strings to convert them into numbers so that they can feed into models. To make the input size same we have to perform the padding on each input sequence. Then split the data into train, test, and cross-validation. The author takes 90% data as train data, 9 % for validation, and 1% for test data. Then he goes for augmentation to generate the different copies of images. Because when he applies the augmentation on images his model sees a different image every time thus it will help the model in reducing the overfitting and it can be well generalized. After all this, he uses the encoder-decoder model, attention, CNN, and Pre_trained CHEX-NET model which is especially for the chest x-ray images.

Fourth source: https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

Summary:- By this resource I am able to understand how we can develop the deep learning caption generation model. This resource is like a tutorial for handling the image data. In this, I discovered how we are able to design the deep learning models according to our image data. For example, they take the Flickr8k_dataset as a sample dataset to understand the procedure of handling the test data. This data contains one file of images and another file is of text. Which is really similar to our problem. We also have the image file and corresponding XML files in our problem statements. But they use some pre-trained models to interpret the content of the photos. Keras provides the pre-trained model directly. The interesting thing about this Keras model is it can directly load the weights from the model. Which are about 500 megabytes. They also used the VGG model for the classification of the models. Keras has a very unique property: it provides the tools by which we can reshape the loaded photo into the preferred size for the model. This tutorial also gives an explanation of preprocessing of textual data like first lowercase all the text then remove the punctuations after this remove all the words that are one character or less in length last remove all those integer values which are present in the text file. After all, this tutorial reaches to develop the deep learning model. The developing model procedure divides into three parts first one is loading data second one is defining the model third one is fitting the model, to begin with, we should stack the readied photograph and text information so we can utilize it to fit the model. We will prepare the information on the entirety of the photographs and captions in the training dataset. While preparing, we will screen the presentation of the model on the improvement dataset and utilize that exhibition to choose when to save models to file. The train and development dataset have been predefined in the Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt files separately, which both contain arrangements of photograph record names. From these document names, we can remove the photograph identifiers and utilize these identifiers to channel photographs and depictions for each set. After having all preprocessed data in this tutorial they used the string ‘startseq’ at the beginning of the sentence and ‘endseq’ in the end of the sentence. It is good to know from where our text is beginning and where it is ending. It helps us in understanding the behaviors of using tokens like the ‘startseq’ and ‘endseq’ is converting to tokens they will provide the uniformity and the tokens of text data encode correctly. Then we reached defining of a model in which three models are used in this tutorial.

Photo Feature Extractor: This is a 16-layer VGG model pre-trained on the ImageNet dataset. We have pre-processed the photos with the VGG model(without the output layer) and will use the extracted features predicted by this model as input.

Sequence Processor: This is a word embedding layer for handling the text input, followed by a Long-short term memory(LSTM) recurrent neural network layer.

Decoder: Both the feature extractor and sequence processor output a fixed-length vector. These are merged together and processed by a Dense layer to make a final prediction.

5. Personal approach- My solution

To understand the data first we have to explore the data with the help of exploratory data analysis. As this data contains image and text as well so we have to do the exploratory data analysis on both of them. Because images have different characteristics as compared to text data. I found out that that data is imbalanced, Images availability per patient, type of images associated for each patient. After the EDA I will be implementing a deep learning model with two different approaches to find the improvement on one another.

First Approach: Basic model

I used a basic encoder and decoder architecture. In the encoder part, we have the CNN single fully connected layer to get the feature vector of images from the pre-trained Xception model. The decoder part will be having the LSTM layer where it takes two inputs one is the image feature vector and the sequence of text to word in each time step.

Second Approach: Main model

Here I use the encoder-decoder architecture to generate the impression from the chest X-ray. Here Encoder will take the image extracted features from the Xception pre-trained model and then after adding some layers to those features we add our encoder output as our image feature vectors.

Those images features which we get from our encoder then passed to the decoder with an attention mechanism. The attention mechanism will generate the next word for the content of the image.

With the help of my basic model I came to know the power of the attention mechanism, it will generate the next word which is really good and amazing to know.

Let’s discuss this process step by step:-

First step: I take the images and then extract the features from the image with the help of a Chex-net pre-trained model.

After this, I take my preprocessed text data which is impressions taken from the XML files for all the data points and then make the word embedding matrix using fast text. After this padding is done on that text data to make it uniform.

After these multiple images are converted to tensor corresponding to their impression. Then mapping is done for the Frontal image tensor and lateral image tensor corresponding to the impression.

Then after getting this corresponding train, validation, and test data we have the tensors for all the data points. Let's move forward with Encoder and decoder architecture in which we use the attention for improving our model.

Encoder

The encoder is a single fully connected model. The input image is given to Chex-net pre-trained model to extract the features. These extracted features of two images are added and input to the flatten layer and then dense the output after applying the flatten layer. This output is given to the decoder as the encoder input.

Decoder

In the decoder first, we give the encoder input and the hidden state which is generated with the encoder to attention. After this decoder input is given to the embedding layer. Then we concatenate the embedding and output of attention after this we pass the concatenated output to the LSTM layer and get the output, prev_state_vector which is our updated hidden state.

**Architecture for the impression prediction**

6. Parsing of XML files by which we create data points.

Now, it's time to understand how we can create the data points from the raw XML files. How we can extract or parse the data and get it in a structured format.

View of the XML file:

With this XML file, we will be extracting the abstract and ParentImage Nodes. In this, we have the impression and image file name as below.

Impression level:

Then we get the Abstract text values.

XML parser code to retrieve the details mentioned

After extraction, we have 3955 rows of data in the data frame view.

7. Data Preprocessing

In this phase, the text data are preprocessed to remove unwanted tags, texts, punctuations, and numbers. we will also check for the empty cell or Nan values

For data preprocessing, we have to remove the unwanted tags, texts, punctuations, and numbers. The important thing is that we have to check empty cell or Nan values.

If there any empty cells in the image name column we will drop those cells.

The total number of unique images is 3851.

8. Exploratory data analysis

In this section, we will see different approaches to analyze the data set by summarizing and visualizing their main characteristics.

8.1 Exploratory data analysis on text data.

In the analysis of text data, we take the impression column as our target variable. Let’s see the top 20 most occurring sentences.

In the top 20 occurrences of impression, we can observe that “No acute cardiopulmonary abnormality ” has the occurrence almost more than 400 times.
Larger sentences in the impression have less occurrence which is mostly below than 10 times.

Occurrences of the word for impression

Let's see the word wise occurrence using the word cloud for impression.

As we observe here that acute, abnormality, cardiopulmonary, pulmonary are highlighted words from the above visualization.

8.2 EDA on Image data

Let’s see the total image present per data point or report.

From the above plot, we can observe that there are a fewer number IDS that have four images corresponding to the ids.
Maximum ids have the two chest x-ray images corresponding to the person ids.
The count of only one image is 457, two images are 3227, three images are 173, and the fourth image is 10.

9. Conflicts in data point construction.

As we know that two types of images are Frontal and Lateral, but we have 1,3, and4, images associated with each data point. If we have no images, we dropped those data points.

Let's try to solve this problem by Limiting the data point to 2 images per data point, if we have 5 images, it's 4+1(all the images + last image) so make it as 4 data points as below.

Let's understand it by the code.

From this code, we are able to get the frontal and lateral images. but there are some XML files in which only one image is present then we have to copy the same image so these type of images are duplicate images so we also have to separate them from the frontal and lateral images because of maintaining randomness in the data.

To make it possible I use the Kaggle data https://www.kaggle.com/raddar/chest-xrays-indiana-university In this I use a CSV file where all data points have the significance of frontal and lateral image so I compare them and make the structured data.

In the end, I get the two CSV files in which I have all the data points which have the frontal and lateral images and the second one are duplicate images.

After constructing the data point we will add the <start> and <end> token to text data.

Final datapoints

10. Splitting of the train, test, and validation data

Now we have two types of data one is duplicate data points other is with duplicate data points. Here we need to split the data points so we also have the duplicate data points in equal proportion in all three splits.

First, we split the frontal and lateral images.

Next is split of duplicate images.

In last we combine both the data frame and then split them into the train, validation, and test data.

For maintaining randomness in data we use this code.

11. Tokenization and dataset preparation

11.1 Tokenization

One point is very important to know that we cannot feed raw text to our deep learning model. So the text data must be encoded in numbers and then used in deep learning models. So we are grateful because Keras library provides some wonderful opportunity to perform these operations easily. Let’s understand this by code.

The size of vocabulary present is 1272 and the maximum length of the output sentence is 60.

11.2 Dataset Preparation

For solving this problem we use a very efficient technique called transfer learning for the image to feature vector conversion and text data tokenization.

For usage of transfer learning, we use a pre-trained model for feature extraction which is chex-net, and for getting the word embedding I use the fast text pre-trained models. then we will create the tensor for all available images using the Chex-net feature vectorization below.

Prepare the data using TensorFlow as tf.data

As we have our image tensor and text vectors we can build the tf.data.

Here multi_image () function converts the two- input tensor of shape (1,2048) & (1, 2048) to (2,1,2048). Batch_size, embedding dimension and unit_size are the hyperparameters so that we can tune according to our model.

So we do the feature extraction and tokenization for our model to work and we have the tf.data dataset now let's build the required model.

12. Basic Model[Encoder- decoder]

In the basic model, we just used the simple architecture just use encoder for image feature, and then for handling both image features and text features we use the decoder, and usage of LSTM is introduced in the basic model.

12.1 Encoder

Here we have the fully connected linear layer output. When we pass the dense layer here we add the two image tensor and pass to the dense layer. The output shape comes out to be in form of (Batch_size,1,embedding_dimension)

code:

12.2 Decoder

In the decoder, first text data is passed to the embedding layer and then image features are concatenated with the embedding output, and then this concatenation output is passed through the LSTM layer. By LSTM Layer we get the output and updated hidden state which recursively passes again to the decoder as the initial state.

code:

12.3 Performance of the model

Loss plot:

Accuracy plot:

12.4 Evaluation of model

To know the prediction of our model we have to evaluate our results with actual to predicted ones. So in the evaluation stage, we use the argmax search-based method teacher forcing to find the output sentence. In the time step, we generated a word using <start> token, and the prediction of the word made by the model recursively fed back to the next step and it becomes the input of the decoder in time step+1.

Let’s see the code for argmax-search is mentioned below.

output:

For shorter sentences

For long sentence here one thing I noticed that it gives repetitive results which are not good. it means something is wrong with this simple Architecture.

12.5 Conclusion

The model is based on a simple encoder-decoder based model.
Here getting repetitive results is a problem.
Here validation accuracy is not that impressive whereas loss is converging.
In the future, we enhance the performance of the model by fine-tuning of parameters.

Let’s try adding an attention mechanism in this basic model so that it gives us an idea about the better word prediction so our model is able to generate the best predictions.

13. Main model[Attention mechanism]

13.1 Encoder architecture

Let’s first understand the encoder block of our model inside this layer we will call our image feature extract model (chex-net) and reshape its output to make it look like the output of sequence. In this single fully connected layer linear output. First, we pass the concatenate layer to a flat layer, and then that output is passed through the dense layer so we can get the image feature output.

13.2 Decoder architecture

Then we pass the encoder output to the decoder. First, we pass the features of the image and Initial hidden state to the attention layer and then pass the output of the attention layer for the contamination. But first, we have to pass the text to the embedding layer and concatenate both attention and embedding output with each other and then that output is passed through the Lstm layer and then the output of LSTM is pass through the flatten and dense layer to have the concise shape.

13.3 Metrics for model architecture and optimizer Initialization

Here we use the loss function to monitor the loss during model training and accuracy function for calculation at each epoch. Perplexity is for checking the sentence prediction.

13.4 Training of model

Teacher forcing is a very strong technique by which we can pass one by one word. whereas teacher forcing is a strategy for training recurrent neural networks that use the model output from a prior time step as an input.

In the training, a “start_of_sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with the other input like an image or a source text.

For the convergence, we use this process in a recursive manner to get better results.

Let’s get to the code of how we do the model training.

13.5 Performance of model visualized in tensorboard

Let’s first see the loss plot after training

Accuracy plot:

Perplexity plot:

13.6 Evaluation of model

For the evaluation of model, I have used the beam search-based teacher forcing for the prediction of the output sentence.

Bleu score metric:

For sentence comparison between the actual and the predicted sentences, the bleu score is an efficient technique. Basically, this metric is used to find the quality of the machine-translated word to actual words.

Beam search:

Here we have the option to choose the greedy method to choose the most likely next step as the sequence is constructed, But beam search gives us the opportunity for all possible next steps and keeps m most likely m is the beam index. Where m is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

Output :

First prediction.

For larger sentence.

Here one thing is noted that some words are the same in both sentences. So it means the attention mechanism improves the prediction. But bleu score is not that good

13.7 conclusion of the main model

This time model is based on attention mechanism in which a word is predicted on the basis of image features and hidden state and then it passes through the LSTM with the text embedding.
In this model Loss is converged to 0.28 with an accuracy of 86 percent. From the results, we can have the idea of similarity in the report generation.
Also, the perplexity score is 19.8131 percent and the validation perplexity is 19.8330

14. Conclusion

When we compare our results with the normal Imagenet trained Xception model, we try to improve the model with the usage of Chex-net pre-trained model for image feature extraction.
Usage of attention mechanism in encoder-decoder architecture seems to be useful when we compare the report generation in the basic model and main model, especially for the larger sentences.
Let’s do the error analysis on the predictions made by model to identify where the issue is with model or the data point.

15. Error analysis

In error analysis, we try to explore which data point creates an error and use that finding for the improvement in the model. We filter out the bad bleu score data points and try to understand what problem is arising in those data points. At last, after makes an understanding from the error having data points we try to enhance the performance of model.

Basically, the error can be reduced or irreducible we just have to try out how we can reduce the error. When training is done on the model we can check the validation set to find the error and do the analysis. When we have the cause of error try to solve the error and then again train the model and then make the predictions.

Screenshot of validation score:

First, we take those data points which are less than 0.08 bleu and then check the datapoint.

We also try to ignore the duplicate data points. we have used the duplicate data points in our model. we consider them as noise point .

So there are 27 duplicate image data whose predicted score is poor as we already know that these data point we considered as noise and equally split among all the data sets.

Let’s ignore those data points in the prediction and perform the analysis.

Let’s do the analysis by taking some random points with their chest x rays.

Here the length of word is 23 and there is an overlap in the actual impression words as well as predicted impression.

The length of the actual impression is more as compared to the predicted impression. Bleu score is not good because of not considering the meaning of sentences.

Here word count is 45 there is no error in actual words. Still Cannot find any image-wise pattern issue. The predicted word is poor not give any meaning related to the actual impression.

As we can see the bleu score which is having a value greater than 0 gives the partial meaning of actual which considered as good prediction lets take the bleu score which is 0

Another Finding is when we have word more than 29 word give 0 values. which shows that our model did not perform well for longer sentence. let’s consider word lesser than 20.

Final data point having 62 data point.

As we already separated the data points in best and worst cases lets visualize and look for the patterns.

Important points in best bleu score images

Here I observe that image alignments are in the proper manner.
We are able to the brighter view of chest bones in the images.
Here no line and any disturbance are not present in the images.
Here we are able to see that dull images are visualized properly.

Let’s take a look at the poor result data points.

Important points in bad bleu score images

Here images are very vague not able to see properly(row, column): images are (2,3),(3,2)
In many images, there is too much brightness in which we are not able to see them properly.

Let’s analysis image of zero bleu score images

Here we able to observe that the second picture is total vague. we are not able to see the picture properly.

Second picture:

First image is totally been out of brightness. Poor visibility in first image.
Second image is too bright which makes us incompetent to visualize the part of the chest.

15.1 Conclusion

As we have the knowledge from this analysis, we came to know that quality of image plays an important role. Mostly the error data points are with the poor images quality.
The model works fine for the clear and well bright images in which every part of chest is clearly seen to us.
There are some images which have the brightness there also model able to fails in the decision making so the score is not good in such type of images.
Most important is that model is not performing well when we have the actual word count more than 20.
There are some error points which we can find out and then those points for better results.

16. Future work

After performing basic and main model architecture we can improve our model with the state-of-art BERT transformer instead of just using attention mechanism. This can be possible if we send the image features and text input as single vector in each time step to predict the next sentence. We can also use a paper which i read in Literature survey. It can also enhance the accuracy of model in report generation.
Link for paper: https://medium.com/r/?url=https%3A%2F%2Flink.springer.com%2Fchapter%2F10.1007%2F978-981-15-4015-8_15
We try to further increase the encoder CNN layer to deep layer for further improvements in our model.
In error analysis we came to find out that there are some data points which are in poor quality and bad capturing of image take us to poor results.Elimination of these issues is necessary for improvements in future work.

17. References

I am thankful to everyone who invest his time to read this blog.

For code and better understanding here is my github link.

Abhishekmaliksau/Impression-generation-from-medical-report-generation

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com