Table of content :
- Business Problem
- ML Formulation
- Data source and Data overview
- Data Preparation
- Exploratory Data Analysis
- Basic Model [Encoder-Decoder]
- Main Model [Encoder-Decoder with Attention]
- Future work
1. Introduction :
What do you see in the above picture? “brown color dog taking selfie” , yes you are right 😎. I guess it is easy task, but can you give below image description ? hard task 😖, but deep learning will perform good in both cases.
A.I is kind of magic in current scenario, there is a lot of applications of A.I but in medical field, it is great advantage for human kind. Research over the last five years shows a clear improvement in computer-aided detection (CAD),
specifically in disease prediction from medical images.
However, the elaboration of high-quality medical reports from medical images, such as chest X-rays, computed tomography (CT) or magnetic resonance (MRI) scans, is a task that requires a trained radiologist with years of experience. But with deep learning techniques we can predict medical report just using medical images. But how ?
This task is combination of computer vision (visible data) and natural language preprocessing (text data).
2. Prerequisite :
To understand the blog better, it is better to have some familiarity with topics like Neural networks, CNNs, RNNs, Transfer Learning, Encoder-Decoder, Attention mechanism, Python programming, and Keras library.
3. Business Problem :
The lack of specialist physicians is even more critical in resource-limited countries, and therefore the expected impacts of this technology would become even more relevant. given conclusion based on X-ray, CT-scan or MRI image, is a task that requires a trained radiologist with years of experience. AI is set to have a significant impact on the medical imaging market and, hence, how radiologists work, with the ultimate goal of better patient outcomes
We will use some state-of-art deep learning method to solve this problem, we will give x-ray images (frontal and lateral) to our deep learning algorithm and get medical report of those images.
4. ML Formulation :
We will divide this problem in two parts, first is Encoder part for feature extraction from image data, using transfer learning method we extract information from x-ray images. Second part is Decoder, we will give those image information to decoder model and get medical report.
So we will convert image data into text data, for this task we will use Bleu score (Bilingual Evaluation Understudy), Bleu score value lies 0 to 1 range, 1 means actual report and predicted report are same.
5. Data source and Data overview :
The data for this problem is provided by the Indiana University hospital network. The data contains two parts.
Indiana University provided x-ray images of chest part, image data contain two types of image 1. frontal part and 2. lateral part. data contain 7472 x-ray images.
The X-rays contain a set of chest x-ray images for some people. For example,
Second part of data-set contain reports of patient in XML format Each XML is the report for corresponding patient. To identify images associated with the reports we need to check the XML tag <parent Images id=”image-id”> id attribute in the id we have the image name corresponding to the png images. More than one images could be associated with one report or XML
Below image is sample of XML report,
XML contain total four important features indication, impression, findings, comparison.
Comparison = The Comparison section can provide information of a serial follow-up procedure.
Indication = When radiologists write a report, they generally have patient-
relevant clinical information, usually provided in the section Indication.
Findings = Finding feature contain data related to x ray image, In which have part effection, right or left lung effection.
Impression = Impression feature generated by indication and findings, impression indicate after medical report result like lungs are clear, no any lungs diseases.
Patient x ray image and report look like below sample, you can see that main four features and patient chest x ray image.
6. Data preparation :
6.1 Data collection :
The data consists of a set of x-ray images and XML files containing the medical report. This XML file has a lot information related patients, image id, comparison, impression, findings, indication etc. We will extract the impressions feature from these files and consider them as reports because they are more useful for the medical report. We also need to extract the image id from these files to get the x-rays corresponding to each report.
We collected data from XML file using below code
After data collection we have 3955 rows in data-frame.
6.2 Data preprocess :
We found that total 104 rows have Nan value in image columns, we drop 104 rows from our data-set and indication and impression features have 90, 34 None values respectively. we replaced Nan value to “no indication” and “no impression” value.
- In this phase the text data are preprocessed to remove unwanted tags, texts, punctuation and numbers.
- Words like isn’t, doesn’t, etc are expanded, convert all words into lower case.
- Removed stopwords, convert numerical value into words value (Ex. 8 to “eight”, @ to “at”, & to “and”)
- Removed special characters, symbols, parenthesis and brackets
- Removed words like “xx”, “xxx”, etc.
- Removed extra spaces
- Removed “year old” and removed words which length are less than 2 except some words. Removed (Ex. “dt”, “ab”, “rt”), except words (Ex. “ct”, “tb”, “no”).
- for example, “no chest pain in cardiac muscles” if we remove the word “no” from the sentence entire meaning will get change.
7. Exploratory Data Analysis :
In this section we will see different technique of EDA to analyze and visualize features.
7.1 EDA on text feature :
For the text data analysis we target impression features. Impression feature is final report of the image, below visualization we could see the top 50 most occurring sentences.
> Sentence occurrences for Impression Feature
- From above plot indicates top 50 frequent words in impression feature. top two most frequent words are ‘no actue cardiopulmonary abnormality’ and ‘no actue cardiopulmonary findings’ count are above 350 times in feature.
- We observe from count-plot majority text are ‘no actue cardiopulmonary abnormality’ and ‘no actue cardiopulmonary findings’ it means high number patient have no any abnormality, no any disease.
> Unique Words in Feature
We will check how many unique and repeated words in impression feature. It is important aspect. More unique words means rare words in sentence, sometime more number of unique words means, sentences are less importance.
- We conclude from above bar plot, total 3851 words are repeated in impression feature and total 1692 words are unique in feature.
> Word Count distribution in feature
Let’s see the word count distribution on the impression feature.
- We conclude from above plot, 50% of sentences length lie between 2–15 and 90% of the sentences length lies between 2–20.
- Minimum length is 2, Maximum length is 91.
> Word Cloud of Impression feature
We will see which words are more frequent and which are less frequent with help of word cloud. Which words have large size those words are occurrence and which words have small size are less occurrence.
- Above word cloud are generated on the top 1000 max occurrence words.
- Acute, disease, abnormality, cardiopulmonary, normal, upper, these words size are higher than others, these are important words.
- Size, name, focal, clear, words are less important, occurrence of this words in sentences are less.
7.2 EDA on image feature :
Lets analyze the total image per patient.
- We can see that above count plot of image feature, higher number patients have two chest x — ray images.
- Maximum image count is 5 and minimum image count 1.
- 3208 patient have two x-ray image and 446 patient have one x-ray image
- 181 patient have three x-ray image, 15 patient have four x-ray image and 1 patient have five x-ray image
> Image data point construction
As we have more than 2 image or some case less than 2 images associated with each data point.
Lets handle the data point which are having 1,3,4,5 images. Below is the data point counts with number of image sets.
We will set 2 images in every data point. if we have single image of x ray, we will replicate it.
- If we have only 1 image =>
1st images either (Frontal or Lateral) + Duplicate 1st image
- if we have only 2 image =>
1st (Frontal) + 2nd (Lateral)
- if we have 3 image =>
1st (Frontal) + 3rd (Lateral)
2nd (Frontal) + 3rd (Lateral)
we added two data-point from single data-point
- if we have 4 image =>
1st (Frontal) + 2nd(Lateral)
3rd (Frontal) + 4th (Lateral)
we added two data-point from single data-point
- if we have 5 image =>
1st (Frontal) + 2nd(Lateral)
3rd (Frontal) + 4th (Lateral)
1st(Frontal) + 5th(Lateral)
we added three data-point from single data-point
8. Basic Model [Encoder-Decoder] :
8.1. Add Token in text data :
After create new data points from exist data points, we will add <start> and <end> token into text data. We will split data into train, test and validation.
8.2 Tokenization :
Machine only understand numerical value, so we can not feed text data into deep learning and machine learning model. We will convert text data into numerical data using Tokenizer. The tensor-flow deep learning library provides some basic tools to perform this operation.
Total vocabulary (vocab_size) present is 1278 and maximum length of the output sentence is taken as 60.
8.3 Embedding Matrix :
We will use embedding matrix for embedding layer, we will feed our tokenized text data into embedding layer, and embedding layer convert single token into 300 dimension vector.
We found that, glove-6B-token model (pre-trained model) consist total 88.51% (1131 words) words out of total 1278 words. We will lose total 147 words information, to solve this problem we will use Fast-text model for our problem. We will train Fast-text model with our text data. Fast-text use n-gram for text data and convert each token into 300 dimension vector.
8.4 Image feature :
We will be using the transfer learning method for image to feature vector and text data tokenization. I will be using the InceptionV3 model which trained on ImageNet dataset.
Mainly InceptionV3 model used for the image classification task. but we will modify inception model and will use for our task. We will remove last layer of model and use avg-pooling.
I will create a image tensor for all available images using the inception feature vectorization like below. We will convert image into (1, 2048) shape of tensor, we concatenate both Frontal and Lateral image tensor and get (1, 4096) shape of single data point tensor.
8.5 Encoder :
We will pass two concatenate image tensor to encoder, shape of input tensor is (1, 4096), we feed image feature into dense layer to convert last dimension so we will concatenate image feature with text data.
8.6 Decoder :
In this part we have an embedding layer, LSTM layer and dense layer which outputs shape (batch_size, vocab_size).
LSTM layer is Long Short-Term Memory networks — usually just called capable of learning long-term dependencies.
8.7 Model Training :
For the training phase we use the Teacher forcing method. Teacher forcing is a strategy for training recurrent neural networks that uses model output from a prior time step as an input.
In the Training, a “start-of-sequence” token can be used to start the process and the generated word in the output sequence is used as input on the subsequent time step, perhaps along with other input like an image or a source text.
This same recursive output-as-input process is used till the model get “end-of-sequence”.
8.8 Model Performance :
We have the loss using
8.9 Model evaluation :
In the evaluation. I have used the greedy search based teacher forcing method to find the output sentence. In time step t we generated a word using <start> token and the predicted word again feed back to the next step and it become the input of the decoder in time t+1.
8.10 Sample Output :
We can see that above example, Encoder-Decoder model good perform in short sentence but bad perform in long sentence.
8.11 Conclusion :
- In the basic model we only used LSTM, embedding layer and dense layer. For the image feature extraction we used inceptionV3 model.
- We got weighted bleu score 0.3521 in train data and 0.3111 in test data.
- But our model good perform in short sentence and bad perform in long sentence.
9. Main Model [Encoder-Decoder with Attention] :
The use of Attention networks is widespread in deep learning, and with good reason. This is a way for a model to choose only those parts of the encoding that it thinks is relevant to the task at hand. attention consider some pixels more important than others. In sequence to sequence tasks like machine translation, you consider some words more important than others. Attention mechanism, which allows it to focus on the part of the image most relevant to the word it is going to utter next. This is research paper, you can find more about attention mechanism.
- Image extraction : For the attention model We will use same text data preparation as we discussed in basic model, only changed max output legnth is 91 instead 60, Here i will use Desnenet121 pre-trained model where it contain 121 layers of convolution layers. The pre-trained weights is given ChexNet file, these weights are pre-trained with medical X-ray images largest publicly available chest X-ray dataset, containing over 100,000 frontal-view X-ray images with 14 diseases. Setting these weights are suitable to our task.
- Image Input layer: We will concatenate both (Frontal and Lateral) image. convert into single image. We will pass as single image into image model (Chexnet model) and convert (None, 9, 9, 1024) shape tensor.
- Embedding layer: map each word into 300 dimension vector.
- Bi-GRU layer : utilize Bi-GRU to get high level information from attention output, Bi-direction GRU get information forward to backward and backward to forward.
- Attention layer : produce a weight vector, and merge word-level features from each time step into a sentence-level feature vector, by multiplying the weight vector.
- Output layer : output layer units set to a vocab size, every time step model predict vector, size of vocab size.
9.1 Encoder Model :
The encoder classwill convert the image features (None, 9, 9, 1024) from the CheXNet model into a lower-dimensional tensor (None, 81, embedding_dim).
9.2 Attention Model :
We used bahdanua attention method. The attention class will use the previous hidden state of the decoder model and the encoder output, to calculate the attention weights and context vector.
9.3 Decoder Model :
In the Decoder class every time step we will pass four input, decoder input (teacher forcing), encoder output (encoder image tensor), forward and backward hidden state(from previous time step). Decoder return dense vector length of vocab size, attention weight, forward and backward hidden state from GRU.
9.4 Model Performance :
The below plot shows the train and test loss of the model
The below plot shows the train and test accuracy of the model.
9.5 Model Evaluation :
> Greedy Search :
We will Greedy Search for our evaluation, we will choose word which have highest probability. Using arg-max function we will get higher probability word.
Sample Output :
9.6 Attention plot :
We get attention weight at every time step, model to choose only those parts of the encoding that it thinks is relevant. Attention consider some pixels more important than others. below plots are attention plot.
9.7 Conclusion :
- The model build on Bidirectional GRU with attention seems performing better than basic model.
- Main model good predict long as well as short sentence compare to basic model.
- Loss is converged to 0.28 with accuracy of 95.7% train and 93.8% validation from the result we can see there is similarity between each predicted and actual output.
- We got 0.4177 bleu score in train data and 0.39158 bleu score in test data. Our bleu score increase 0.5 in train data and 0.8 in test data.
10. Error-Analysis :
Now, we will see error analysis it is analysis of what is causing this error and use that findings to improve the model. Error analysis will improve future models performance, we will be looking into low Bleu score data points and high bleu score data points and check it what kind of reasons behind low bleu score values and high bleu score values. We can improve model performance by identifying past mistakes.
After training the model we check the validation set to find the error and do the analysis. Once we find the error if it is reducible error then we fix those in our future training of the model in this way the model will be improved than your previous one.
> Low bleu score analysis :
we will take the score which are lesser than 0.05 Bleu and check the data point.
Let’s check data points which have low bleu score
- We observe which points have low bleu score, those points acutal medical report have higher number of unique words.
- Total 1197 Unique words in Actual medical report and 897 unique words in predication report.
- Out of total unique words 174 words are missing in glove pre-trained model.
- We observe from above images, some image have low resolution, some images are have different size and some are not proper visualize.
- We have also seen some fingerprints, jewelry of the patient clearly visible in the image.
- Higher number of images have brightness fluctuation and low resolution.
- More unique words means rare words in corpus, less important words and repeated words are more important than unique words.
- To solve OOV (Out Of Vocabulary) words, we can fast-text model. Fast-text use n gram method to solve oov words.
> High bleu score analysis :
we will take the score which are greater than 0.7 Bleu and check the data point.
Let’s check data points which have high bleu score.
- We observe which points have high bleu score, those points acutal medical report have lower number of unique words.
- Total 77 unique words in Actual medical report and 112 unique words in predication report.
- We observe from above images, high number of images can visualize properly.
- Corpus have low number of unique words compare to low bleu score data points.
- Some case where we have incorrect words in true sentence.
- We could ignore these errors in our future work to get the better performance. And these are the reducible error in error analysis.
11. Model Deployment :
12. Future Work :
- We can also modify whole architecture with state-of-the-art BERT Transformer instead of Attention-BiGRU.
- We didn’t have a big dataset for this task. A larger dataset will produce better results
12. References :