Developing deep learning model to generate clinical reports for X-ray images

Great Learning Snippets
8 min readJul 22, 2020

--

Sindhuja Rajan, Vaishali B, Vigneshwar V, Dr. Narayana D

  1. Abstract
  2. Keywords
  3. Introduction
  4. Literature survey: Promise for the hypothesis and problem formulation
  5. Methodology
  6. Plain Results
  7. Discussions
  8. Conclusions
  9. Future Research
  10. References
  11. Appendix

Medical field is a high priority sector and people expect the highest level of care and services regardless of cost. Mostly the interpretations of medical data is being done by medical experts. In terms of image interpretations by human expert, it is quite limited due to its subjectivity, complexity of the image, extensive variations that exist across different interpreters, and fatigue. After the success of deep learning in other real world applications, it is also providing exciting solutions with good accuracy for medical imaging and is seen as a key method for future applications in the health sector. Image captioning is an interesting problem, where you can learn both computer vision techniques and natural language processing techniques.

Clinical imaging captures enormous amounts of information but most radio-logic data are reported in qualitative and subjective terms. In this project, we are tackling the image captioning problem for a dataset containing Chest X-ray images. With the help of the state of the art deep learning architecture and optimizing parameters of the architecture, we have planned for a novel attempt to generate clinical reports for X-ray images.

  1. Introduction and Problem Statement Workflow:

Image captioning is an interesting problem, where you can learn both computer vision techniques and natural language processing techniques. In this project, we are tackling the image captioning problem for a dataset containing chest X-ray images. With the knowledge gained from the approach, we have planned for a novel attempt to generate clinical reports for X-ray images.

We have come up with the following workflow to tackle this problem statement.

2. Evaluation Metrics:

To evaluate the image captioning model performance, we will use bilingual evaluation understudy (BLEU) score.

BLEU is a well-acknowledged metric to measure the similarly of one hypothesis sentence to multiple reference sentences. Given a single hypothesis sentence and multiple reference sentences, it returns value between 0 and 1. The metric close to 1 means that the two are very similar.

3. Exploratory Data Analysis

3.1 Dataset taken for understanding the Image Captioning pipeline:

Indiana Chect Uni X rays contains 7471 images that are each paired with four captions such as Impressions, Findings, Comparison and Indication that provide clear descriptions of the salient entities and events. The dataset contained both frontal and lateral images. We manually chose the frontal images as those images seemed to express the features very well.

Then we bifurcated the frontal images as follows:

  • Training Set — 1516 images
  • Test Set — 377 images

We then extracted only the impressions for all the images along with image names and put it in separate files called alldata.txt.

3.2 Cleaning the Data

We performed basic text cleaning steps as mentioned below :

  1. Lower-casing all the words
  2. Removing special characters like %, $ , # etc.,
  3. Eliminating words which contain numbers along with them like ‘hello88’ etc.,

We then created two new files called “train.txt” and “test.txt” for storing all these impressions along with their image names in that file. Once done, we started creating a vocabulary of all the unique words present in the corpus. To further optimise this vocabulary, we chose only those unique words.

3.3 Loading the training set

The names of the images present in the training set were there in the ‘train.txt’ file. And so, we load these names into a list called “train” thus separating the 1516 image names in this list. Then, we loaded the descriptions also termed as captions of these 1516 images from the “alldata.txt” file into a newly created Python dictionary called “train_descriptions”.

3.4 Add start and end sequence tokens

We add a start sequence token and end sequence token to every caption present in the alldata.txt while loading it to the Python dictionary , train_descriptions. We add these start and end sequence tokens to let the encoder and decoder know whether the sentence processing has been completed or not.

3.5 Data Preprocessing — Images

In our project, Images are the input (X) given to our model. Since every input given to a neural network has to be vectorized, we convert every image present in our training set to a vector of fixed size. For achieving this vectorization, we opted for transfer learning using InceptionV3 model created by Google Research because this model has already been trained on ImageNet dataset for image classification. Note that our objective here is to not classify each image but to vectorize them called the automatic feature engineering process. So, we removed the last softmax layer from the model to achieve this. And as a result, we got a 2048 length vector also called as bottle neck features for every image successfully.

Finally, we save all these bottle neck features in a Python dictionary such that the image names are the keys and the corresponding bottle neck features are the values of the dictionary. Also we save all the encoded train images to disk in a Pickle file namely “encoded_train_images.pkl” and the encoded test images in a Pickle file called “encoded_test_images.pkl”. We use Pickle as it is the only process of converting a Python object such as dict,list etc., into a byte stream to store it in a file/databse.

3.6 Data Preprocessing — Captions

In our project, Captions are the output or the to be predicted values of our model.So during training phase we have considered Captions as the target(Y) variable. Note that the entire caption is not predicted at once by our model.It happens word by word. Hence, there is a need to encode each and eery word into a fixed size vector which will be taken care of in the model design.

As a data preprocessing step, we have created two Python dictionaries called “wordToIndex” and “indexToWord” to represent every unique word in the vocabulary by an inde9ix (i.e) integer. Since we have a total of 1652 unique words in our vocabulary, each word will be represented by any integer between 1 and 1652.

Here,

  1. wordToIndex[‘abc’] dictionary returns the index of the word ‘abc’ and
  2. indexToWord[k] dictionary returns the word whose index is ‘k’

3.7 Data Preparation using Generator Function

In this step, we prepared the data to be in a manner which would be convenient to be given as input to the deep learning model. Data generator is a function which is natively implemented in Python. Keras API provides an ImageDataGenerator class which is an implementation of this generator function in Python.This generator function is like an iterator that resumes the functionality exactly from the point where it left the last time it was called , thereby saving us a huge space in the memory each time the iteration is run.

3.8 Word Embedding

Every word in our vocabulary was mapped to a 200-long vector using a pretrained GLOVE word embedding model. (i.e) For all the 1652 unique words in our vocabulary, we created an embedding matrix which would be loaded into the model before training. We carried out the word embedding process because most of the Deep Learning architectures are incapable of processing strings or plain text in their raw form as they can work on only numbers when it comes to inputs.

4. Model Architecture

Our input consists of 2 parts — Image vector and a partial caption. Hence, the sequential API provided by keras cannot be used. And so , we use the functional model which allows us to merge models.

Below is the high level architecture representation.

High Level Architecture

Our model basically is a combination of two input models and one output decoder. The image feature extractor model and the partial caption sequence model are the input models and a feed forward model is the output decoder.

The image feature extractor model has an input layer, a dropout layer and a dense layer.The partial caption sequence model has an input layer, the embedding matirx, a dropout layer and a bidirectional LSTM layer.

Bidirectional Recurrent Neural Networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can get information from past (backwards) and future (forward) states simultaneously.

BRNNs are especially useful when the context of the input is needed.

We have used the mentioned approach in the caption analysis of our model

We include the embedding matrix that we obtained from a pre trained glove model into our model before we begin the training. Notice that since we are using a pre-trained embedding layer, we need to freeze it (i.e) trainable = False, before training the model, so that it does not get updated during the backpropagation.

We then compile our model using adam optimizer. Finally , the weights of our model will be updated through backpropagation algorithm and our model will learn to output a sentence, given an image feature vector and a partial caption.

5. Predictions of the Model

We ran our model for over 40 epochs and we have recorded and presented those results below.

For 10 epochs : (1893 Frontal Images with Impressions)

For 20 epochs : (1893 Frontal Images with Impressions)

For 30 epochs : (1893 Frontal Images with Impressions)

For 40 epochs : (1893 Frontal Images with Impressions)

The above results suggest that the model’s learning has improved for every 10 set of epochs.

6. Evaluating the Model

6.1 Using Beam Search and Greedy Search

We have used beam search and greedy search for inference and found that both the algorithms generated the same kind of impressions.

6.2 Loss over epochs

We have trained the model for 50 epochs. For every single training epoch, the loss has reduced substantially. The below graph depicts the reduction in loss against the number of epochs.

7. Future Work

We will try to improvise the BLEU score with Hyper parameter fine tuning. And, we will also modify the model architecture to include an attention module. Preprocessing the X ray images using techniques like CLAHE and employing several other preprocessing techniques to clean the X-ray report data, may help in better prediction of words. So, we will include this in the next phase of our work as well

8. Conclusion

We have designed a model to generate impression statements based on the X-ray image and have succeeded in getting a BLEU score of 0.1 after 40 iterations.The result is promising and the impression statements are meaningful according to the X-ray image.

We have explored the different architectures and the techniques which will help to fine tune the model efficiency.

--

--

Great Learning Snippets
Great Learning Snippets

Written by Great Learning Snippets

Great Learning Students showcase their problem-solving capabilities by solving real-life problems using their skillset. https://www.mygreatlearning.com/blog/

No responses yet