Part 2: Exploring and Engineering X-Ray Data Features
This is the second article in a five-part series on Using Computer Vision and NLP to Caption X-Rays.
The goal of this project aims to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:
- Part 1: Cleaning and Pre-processing X-Ray Data
- Part 2: Exploring and Engineering X-Ray Data Features
- Part 3: Creating a Caption Generating Model Using CNNs and RNNs
- Part 4: Deploying the Model to Serve X-Ray Diagnosis in Production
- Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks
The code is hosted and useable at this GitHub repository.
Images
In the dataset, we have 7426 images of X-rays and only 3826 reports. This means that some reports have more than images associated with them; in other words, a patient may have taken more than a single X-ray. After exploring, we found that most patients have taken two X-ray images (one frontal and the other lateral):
Images per patient:
2 images - 3197
1 images - 435
3 images - 180
4 images - 13
5 images - 1
To better understand how the X-ray images look, we generated the average image for frontal and lateral views; the process of extracting average images was adapted from Byeon (2020).
Captions
Next, let us deep diver and try to understand the features of the written reports. As mentioned in the first part, each report has two diagnoses (impressions and findings). We checked the length of the impressions and findings for each report:
According to figure 3, we can say that:
- A majority of the reports for
findings
contain between 10-60 words, with a significant number containing less than 10 - A majority of the reports for
impression
contain less than 30 words, with a bulk containing less than 10
Later the diagnoses (impressions and findings) were merged to provide a richer representation for each report; we call this new representation caption. Also,
According to figure 4, we can see that most of the captions contain between 10–60 words, and the average length of a caption is around 40 words.
So far, we have gotten a high-level understanding of how the reports look like; however, we don’t know the content of those reports. To understand that, we generated a word cloud, which gives visual representations of words that appear more frequently.
Based on the word cloud above, most of the prominent words (e.g., normal, thoracic, spine, effusion, disease, etc.) are meaningful in the context of the description of X-ray images, and we would expect these to be the main aspects of any medical report.
Images & Captions
Below are the sample images of the x-rays and the report for a given patient:
Findings : there is scattered calcified granulomas. the lungs are otherwise grossly clear. cardiac and mediastinal silhouettes are normal. pulmonary vasculature is normal. no pneumothorax or pleural effusion. no acute bony abnormalityImpressions : no acute cardiopulmonary abnormalityCaption: Findings + Impressions
We expect that the final output of the model should look like the one above.
References
- Eunjoo Byeon. (2020, September 11). Exploratory Data Analysis Ideas for Image Classification. Medium; Towards Data Science. Retrieved from https://towardsdatascience.com/exploratory-data-analysis-ideas-for-image-classification-d3fc6bbfb2d2