Part 3 — Creating a caption generating model using a CNN-RNN framework

Idriss Bennis
6 min readDec 17, 2021

--

This is the third article in a five-part series on Using Computer Vision and NLP to Caption X-Rays.

The goal of this project aims to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:

The code is hosted and useable at this GitHub repository.

Figure 1. A visual image of a Neural Network. Photo by Uriel SC on Unsplash.

1 — High-level Breakdown

Our model aims at creating high-quality captions from medical images. While very intuitive to human beings, such a task can not be done directly by a machine as neither text nor images hold any meaning for a computer program. As such, we introduce a triple-model framework where two consecutive models are applied. The first is a model that will extract the features from the images, followed by a CNN, a convolutional neural network, that will take those features and encode them into an output that will then be decoded by a final RNN, which will then generate text.

Those two models are needed and will complement each other. We implemented both of them on DeepNote (Thanks to Simon for allowing us to use higher-end machines). This article will go in-depth on the implementation of our model and will incorporate selected snippets of code to guide applicability and make our code easier to browse for an audience that is not as familiar with those models.

2 — Inputting the data into the model

To implement this model, we used TensorFlow.Keras library and a google drive integration to allow Deepnote to retrieve the X-Ray images without having to upload them on our limited space Deepnote team server.

Our model requires two separate data inputs:

  • A pickle dataset that includes an ID for each patient alongside the real caption generated by medical experts. As explained in Article 1, the caption is made up of both the ‘impression’ and ‘findings’ columns. This will be needed to create the original tokenizer used to populate the vocabulary as well as for testing purposes and comparing the computer-generated captions to the real captions.
  • The actual image file is hosted on a separate google drive folder. The image files will be used for training the model by using the report ID included in the dataset and will be the sole input to the final deployment of the model as the captioning will be done from the image solely.
#This is code for representation only, some lines are missing for the sake of simplicity#Importing the library, we actually imported much more than just this
import tensorflow.keras
#Defining the image path
image_folder = "/datasets/gdrive/XRay-AKAKI/images_normalized/"
#Importing the dataset containing the captions and ID
frontal_train = pd.read_pickle("../data/train/frontal_train.pickle")
#Creating an array with the path to all images
train_path = train['filename'].to_numpy()
numpy_standholder = [image_folder] * len(img_names)
all_img_names = numpy_standholder + img_names
#Formatting the real captions to be readable by the model
impressions = train['impression'].to_numpy()
start_standholder = ['<start> '] * len(impressions)
end_standholder = [' <end>' ] * len(impressions)
all_impressions = start_standholder + impressions + end_standholder

For the model to recognize the different lengths of captions, each caption will be edited with the addition of the strings ‘<start>’ and ‘<end>’ at the start and end of each caption.

3 — Diving Deeper into the models

Once the data has been prepared to be fed into the model, we start by developing a tokenizer that will contain the different words allowed to be used by the RNN decoder, and by coding the function that will extract the features from the images.

Figure 2. Visual representation of the model implemented in this project. [Image by Author]

Each element in Figure 2 will be explained here, although simplified for the sake of space. Complete code can be found in the repository.

Firstly, the Tokenizer is the vocabulary element of this model as it receives the entire stream of characters included in the real captions generated by Medical experts and breaks it up into individual words it calls tokens, splitting them every time it sees a white space.

Before explaining deeper, here is a quick definition of both CNNs and RNNs:

CNNs (Convolutional Neural Networks) are Deep Learning algorithms that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other.

RNNs (Recurrent Neural Networks) are Deep Learning algorithms where connections between nodes form a directed or undirected graph along a temporal sequence. By doing so, exhibiting temporal dynamic behavior.

There are several differences between them. A few of the ones that are most relevant to this project are:

  • CNNs are commonly used in solving problems related to spatial data, such as images. RNNs are better suited to analyzing temporal, sequential data, such as text or videos.
  • CNNs have a fixed input and output size, making them suitable for this feature extraction as they will feed features with the same format to the RNN. By contrast, RNNs also have a fixed input size but actually have a varying output size, which is very important for the resulting captions to be realistic.

Below, you can find simplified code showing the creation of both of these elements. The CNN_Encoder is quite simple by design, as the transfer learning using ‘ImageNet’ already did quite an extensive work at extracting features from the images.

Now, we can join the tokenizer, the CNN_Encoder, and the RNN_Decoder to create a predict function that takes those different elements and returns a caption. It starts by implementing the transfer learning using the weights obtained from the InceptionV3 model when trained on“ImageNet”, a very large visual learning database project that is run on millions of images and already recognizes most common shapes.

By using its weights, we can extract features, input them into the encoder, and then creates an iterating for loop where we generate the caption one word at a time. The Decoder can predict the end of the sentence at any point during the for loop, effectively ending the function and outputting a predicted caption. Without an ‘end’ prediction, the model will just keep adding words into the caption until reaching the maximum length indicated by the user.

The output of this function, ‘return’, can be directly interpreted as the Caption generated by the model.

4 — Conclusion

Depending on the number of layers, such an architecture can easily result in millions of parameters being calculated, and thus being quite cost-intensive. However, it is a reliable model that has performed very satisfactorily in this task and can be customized until nearing human performance. The code and explanations in this article are aimed at an A.I. enthusiast who has some prior knowledge about this topic without being a state-of-the-art expert. All elements required for replications can be found in our team repository where all code is available.

The next articles, Part 4 and Part 5, will respectively cover how to deploy the model obtained throughout this process online and how to interpret the results predicted by this model.

References:

Jing, B., Xie, P., & Xing, E. (2017). On the automatic generation of medical imaging reports. arXiv preprint arXiv:1711.08195.

--

--