Part 1: Cleaning and Pre-processing Chest X-Ray Data

4 min readNov 5, 2021

This is the first article in a five-part series on Using Computer Vision and NLP to Caption X-Rays

The goal of this project is to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:

Part 1: Cleaning and Pre-processing X-Ray Data
Part 2: Exploring and Engineering X-Ray Data Features
Part 3: Creating a Caption Generating Model Using CNNs and RNNs
Part 4: Deploying the Model to Serve X-Ray Diagnosis in Production
Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks

The code is hosted and useable at this GitHub repository.

Introduction

The performance of a model is only as good as the data it is fed. This post outlines our approach to extracting, preprocessing, and splitting the Chest X-Rays Indiana University dataset in preparation for future modeling.

Extracting and accessing the data
Importing and merging the dataset
Data cleaning and preprocessing
Splitting the data (frontal/lateral and training/test)

1. Extracting and accessing the data

The Chest X-Rays (Indiana University) dataset for this modeling problem was obtained from Kaggle. It consists of two components:

indiana_projections.csv: The dataset containing the X-ray images classified as Frontal or Lateral
indiana_reports.csv: The dataset for the imaging reports/doctor’s diagnoses

2. Importing & merging the dataset

Both datasets were read into pandas data frames and merged on the uid column. Specifically, we performed a left merge with the reports data frame as the left table, thus retaining only images that had reports tied to them:

A preview of the dataset after merging. An individual can have multiple images (e.g., frontal and lateral); hence, the duplicated rows.

The eventual dataset (a snippet of which is shown above) has 7466 rows and 10 columns.

3. Data cleaning & Preprocessing

As our goal is to caption X-ray images with relevant findings, the following columns (which correspond to the doctor’s diagnoses) serve as our primary outcome variables:

findings: Doctor’s findings from x-ray.
impression: Final impression made by the doctor.

a. Dropping columns and rows

Our first step was to drop columns that were deemed not to add anything significant to future analysis (see code cell below):

image: Mostly contained what was stated more succinctly in projection
indication: Mostly composed information from findings and impression
comparison: High missingness. It seemed to refer to data unavailable from and external to the dataset.

Asides from columns, we also dropped rows that were missing both outcome variables (findings and impressions) as these have no diagnoses to be trained on.

b. Merging findings and indications to create a richer caption

After dropping unimportant rows and columns, our next step was to merge the two outcome variables to provide a richer representation called caption.

This gives an exhaustive final report on the X-ray image by merging the two diagnoses as shown below:

Caption column generation. Image by author

Note: For rows that had a NaN for either outcome variable, this resulted in a ‘nan’ string in the generated caption. This issue is addressed in the next section.

b. Text preprocessing

After creating the caption column, we cleaned and prepared the data before feeding it into our model.

Several columns had various issues such as excess whitespace, special characters, and placeholders (as shown below):

Example 1: “Normal chest x-XXXX.”

Example 2: “Cardiomegaly/borderline; Pulmonary Artery/enlarged”

To resolve these, we performed the following steps:

Removed special characters and placeholders (e.g., XXXX-year-old and XXXX are intact) using regular expressions
Removed the string ‘nan’ as mentioned above from the caption column
Removed numbering in impression and findings
Cleared excess whitespace between words and punctuations
Removed leading and trailing commas and full stops.
Converted all letters into lower case
Replaced contractions, e.g., couldn’t was changed to could not
Finally, we removed all punctuations except full stops and commas. We retain full stops and commas as these serve as separators for sentences, symptoms, etc.

We also handled missing values in the findings and impression columns using the following process:

If MeSH and Problems were recorded as ‘normal,’ fill the findings NA with ‘no unusual findings.’
In other cases, mainly corresponding to the findings and impression columns, fill the NA with ‘no findings’ and ‘no impression,’ respectively.

Below are some examples of the text after cleaning:

A random sample of observations from different columns after cleaning

5. Splitting the dataset for future analysis

Finally, the cleaned dataset was split into frontal and lateral (image) data frames and then into training and test set for each.

Note: Given an individual can have multiple records, all contributing to the same doctor diagnosis, we split the dataset based on the unique person uid to avoid data leakage between the training and test set!

The data is now ready for future modeling and analyses! See part 2 on Exploring and Engineering New X-Ray Data Features

- Data cleaning and preprocessing code
- Full Project GitHub Repo

References

Raddar. (2020). Chest X-rays (Indiana University). Kaggle.com. ‌