Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks
We have generated captions; now what can we say about our technique?
This is the fifth article in a five-part series on Using Computer Vision and NLP to Caption X-Rays.
The goal of this project aims to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:
- Part 1: Cleaning and Pre-processing X-Ray Data
- Part 2: Exploring and Engineering X-Ray Data Features
- Part 3: Creating a Caption Generating Model Using CNNs and RNNs
- Part 4: Deploying the Model to Serve X-Ray Diagnosis in Production
- Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks
The code is hosted and useable at this GitHub repository.
The Goals
Before we dive into understanding the output of our caption generation and out final deployed website, we should remind ourselves of the goals we set out at the start of this project.
We aimed to measure the similarity of machine-predicted captions to actual captions provided by doctors.
In order to accomplish this, we set out to guide our reader on a learning journey by exploring the following ideas:
- Image and caption data uploading, cleaning, and preprocessing.
- Exploring data features and engineering new features.
- Generating captions using machine learning methods.
- Deploying a machine learning model online and building a front end around it.
Results
There are a few ways we can examine our results.
Outputting Predictions
Firstly, we can simply display predicted and real captions alongside each other as a way of seeing how similar they are firsthand. Along with this, we can show the attention layer operating on top of the image. Here are some of the best outputs we can visualise from our notebook!
Evaluating Your Model
If we wanted to evaluate the relative difference between a generated sentence and a reference sentence, we can use a BLEU score [1].
BLEU stands for Bilingual Evaluation Understudy Score. It can be calculated using the NLTK Python library, which provides a function sentence_bleu()
that allows for evaluation of generated text against a reference. Essentially, BLEU scoring is performed by using sampled reference translations. The algorithm looks at typical sentences that contain the same words as those in the real caption. Then, it calculates how many of those words are in the predicted caption.
BELU scoring is a unique method for this that uses modified n-gram precision. “An n-gram is a sequence of words occurring within a given window where n represents the window size” [3]. In order to calculate the BLEU score we need 2 pieces: the precision score and the Brevity Penalty (BP). If you want to learn how these are calculated, see [3].
The BLEU metric ranges from 0 to 1. Before implementing, we have to parse together our real and predicted captions, as follows:
store = tokenizer.index_worddef calculateOutputs(img_name_val, store, n):
predicted_captions = []
real_captions = []
random_ids = range(0, n)
for i in random_ids:
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
image = img_name_val[i]
predicted_caption, _ = evaluate(image)
predicted_captions.append(predicted_caption)
real_captions.append(real_caption)return predicted_captions, real_captions
We now implement the NLTK BLEU evaluator below.
from nltk.translate.bleu_score import sentence_bleudef calculate_bleu_evaluation(predicted_captions, real_captions):
scores = []
for i in range(len(predicted_captions)):
score = sentence_bleu(real_captions[i], predicted_captions[i])
scores.append(score)
return scores, np.mean(scores)calculate_bleu_evaluation(predicted_captions, real_captions)
These functions should allow for understanding of model performance.
Conclusion
We see from our results that it is feasible to train a model to a certain degree of accuracy to write captions for X-rays. By using a variety of methods such as tokenisation, batch processing, recurrent neural networks, and gradient taping, we converge on these captions. In this article we have stepped through the processes required to obtain these results, from data collection and exploratory analysis, to NLP and computer vision techniques. To improve on the process within computational constraints, we could increase the size of our dataset, tune our hyperparameters better, or find unique ways of incorporating metafeatures into our captioning. Finally, we could have considered other ways to test the accuracy of our results other than just the BLEU score. For example, perplexity scoring is also another method used in language modelling for intrinsic evaluation [2].
References
[1] Brownlee, J. (2017). https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
[2] Campagnola, C. (2020). https://towardsdatascience.com/perplexity-in-language-models-87a196019a94
[3] Khandelwal, R. (2020). https://towardsdatascience.com/bleu-bilingual-evaluation-understudy-2b4eab9bcfd1
Thanks for reading!
Also, a huge thank you to the DeepNote team for providing our team with some free pro compute power to run our models faster! We wouldn’t have been able to do this project in collaboration with each other so easily if it weren’t for DeepNote.
To keep updated with Alexander’s work, follow him on Twitter!