Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks

We have generated captions; now what can we say about our technique?

Alexander Bricken
5 min readNov 8, 2021

This is the fifth article in a five-part series on Using Computer Vision and NLP to Caption X-Rays.

The goal of this project aims to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:

The code is hosted and useable at this GitHub repository.

Untangling our results like this pile of laundry. Photo by engin akyurt on Unsplash.

The Goals

Before we dive into understanding the output of our caption generation and out final deployed website, we should remind ourselves of the goals we set out at the start of this project.

We aimed to measure the similarity of machine-predicted captions to actual captions provided by doctors.

In order to accomplish this, we set out to guide our reader on a learning journey by exploring the following ideas:

  • Image and caption data uploading, cleaning, and preprocessing.
  • Exploring data features and engineering new features.
  • Generating captions using machine learning methods.
  • Deploying a machine learning model online and building a front end around it.

Results

There are a few ways we can examine our results.

Outputting Predictions

Firstly, we can simply display predicted and real captions alongside each other as a way of seeing how similar they are firsthand. Along with this, we can show the attention layer operating on top of the image. Here are some of the best outputs we can visualise from our notebook!

Here, we get it perfect! Image by Author.
This one is near perfect as well. Findings and disease and this context have the same connoted meaning at the end of the sentence. Image by Author.
Inflammatory aspiration might be an indication of mild crowding? Image by Author.
The predicted and real caption are pretty similar here. No problems found. Image by Author.
Our final model is definitely better than one of our previous iterations! Bony… seems about right :) Image by Author.

Evaluating Your Model

If we wanted to evaluate the relative difference between a generated sentence and a reference sentence, we can use a BLEU score [1].

BLEU stands for Bilingual Evaluation Understudy Score. It can be calculated using the NLTK Python library, which provides a function sentence_bleu() that allows for evaluation of generated text against a reference. Essentially, BLEU scoring is performed by using sampled reference translations. The algorithm looks at typical sentences that contain the same words as those in the real caption. Then, it calculates how many of those words are in the predicted caption.

BELU scoring is a unique method for this that uses modified n-gram precision. “An n-gram is a sequence of words occurring within a given window where n represents the window size” [3]. In order to calculate the BLEU score we need 2 pieces: the precision score and the Brevity Penalty (BP). If you want to learn how these are calculated, see [3].

Taken from [3]. N = Number of N-grams. W_n = Weight for each modified precision. P_n = Modified precision.

The BLEU metric ranges from 0 to 1. Before implementing, we have to parse together our real and predicted captions, as follows:

store = tokenizer.index_worddef calculateOutputs(img_name_val, store, n):
predicted_captions = []
real_captions = []
random_ids = range(0, n)
for i in random_ids:
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
image = img_name_val[i]
predicted_caption, _ = evaluate(image)
predicted_captions.append(predicted_caption)
real_captions.append(real_caption)
return predicted_captions, real_captions

We now implement the NLTK BLEU evaluator below.

from nltk.translate.bleu_score import sentence_bleudef calculate_bleu_evaluation(predicted_captions, real_captions):
scores = []
for i in range(len(predicted_captions)):
score = sentence_bleu(real_captions[i], predicted_captions[i])
scores.append(score)
return scores, np.mean(scores)
calculate_bleu_evaluation(predicted_captions, real_captions)

These functions should allow for understanding of model performance.

Conclusion

We see from our results that it is feasible to train a model to a certain degree of accuracy to write captions for X-rays. By using a variety of methods such as tokenisation, batch processing, recurrent neural networks, and gradient taping, we converge on these captions. In this article we have stepped through the processes required to obtain these results, from data collection and exploratory analysis, to NLP and computer vision techniques. To improve on the process within computational constraints, we could increase the size of our dataset, tune our hyperparameters better, or find unique ways of incorporating metafeatures into our captioning. Finally, we could have considered other ways to test the accuracy of our results other than just the BLEU score. For example, perplexity scoring is also another method used in language modelling for intrinsic evaluation [2].

References

[1] Brownlee, J. (2017). https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

[2] Campagnola, C. (2020). https://towardsdatascience.com/perplexity-in-language-models-87a196019a94

[3] Khandelwal, R. (2020). https://towardsdatascience.com/bleu-bilingual-evaluation-understudy-2b4eab9bcfd1

Thanks for reading!

Also, a huge thank you to the DeepNote team for providing our team with some free pro compute power to run our models faster! We wouldn’t have been able to do this project in collaboration with each other so easily if it weren’t for DeepNote.

To keep updated with Alexander’s work, follow him on Twitter!

--

--

Alexander Bricken

Here lies an amalgamation of academic essays and life messages. For other pieces go to https://bricken.co