Part 2: Exploring and Engineering X-Ray Data Features

4 min readDec 10, 2021

This is the second article in a five-part series on Using Computer Vision and NLP to Caption X-Rays.

The goal of this project aims to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:

Part 1: Cleaning and Pre-processing X-Ray Data
Part 2: Exploring and Engineering X-Ray Data Features
Part 3: Creating a Caption Generating Model Using CNNs and RNNs
Part 4: Deploying the Model to Serve X-Ray Diagnosis in Production
Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks

The code is hosted and useable at this GitHub repository.

Images

In the dataset, we have 7426 images of X-rays and only 3826 reports. This means that some reports have more than images associated with them; in other words, a patient may have taken more than a single X-ray. After exploring, we found that most patients have taken two X-ray images (one frontal and the other lateral):

Images per patient:

2 images  -  3197
1 images  -  435
3 images  -  180
4 images  -  13
5 images  -  1

Figure 2. The number of images taken per patient (Image by Author)

To better understand how the X-ray images look, we generated the average image for frontal and lateral views; the process of extracting average images was adapted from Byeon (2020).

Figure 3. The average image for frontal and lateral views (Image by Author)

Captions

Next, let us deep diver and try to understand the features of the written reports. As mentioned in the first part, each report has two diagnoses (impressions and findings). We checked the length of the impressions and findings for each report:

Figure 4. The distribution of the reports’ length (Image by Author)

According to figure 3, we can say that:

A majority of the reports for findings contain between 10-60 words, with a significant number containing less than 10
A majority of the reports for impression contain less than 30 words, with a bulk containing less than 10

Later the diagnoses (impressions and findings) were merged to provide a richer representation for each report; we call this new representation caption. Also,

Figure 5. The distribution of the lengths of captions (Image by Author)

According to figure 4, we can see that most of the captions contain between 10–60 words, and the average length of a caption is around 40 words.

So far, we have gotten a high-level understanding of how the reports look like; however, we don’t know the content of those reports. To understand that, we generated a word cloud, which gives visual representations of words that appear more frequently.

Figure 6. The word cloud of captions (Image by Author)

Based on the word cloud above, most of the prominent words (e.g., normal, thoracic, spine, effusion, disease, etc.) are meaningful in the context of the description of X-ray images, and we would expect these to be the main aspects of any medical report.

Images & Captions

Below are the sample images of the x-rays and the report for a given patient:

Findings : there is scattered calcified granulomas. the lungs are otherwise grossly clear. cardiac and mediastinal silhouettes are normal. pulmonary vasculature is normal. no pneumothorax or pleural effusion. no acute bony abnormalityImpressions : no acute cardiopulmonary abnormalityCaption: Findings + Impressions

Figure 7. X-rays images and the report for a given patient (Image by Author)

We expect that the final output of the model should look like the one above.

References

Eunjoo Byeon. (2020, September 11). Exploratory Data Analysis Ideas for Image Classification. Medium; Towards Data Science. Retrieved from https://towardsdatascience.com/exploratory-data-analysis-ideas-for-image-classification-d3fc6bbfb2d2

Part 2: Exploring and Engineering X-Ray Data Features

Images

Captions

Images & Captions

References

Written by Albion Krasniqi