Part 1: Cleaning and Pre-processing Chest X-Ray Data
This is the first article in a five-part series on Using Computer Vision and NLP to Caption X-Rays
The goal of this project is to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:
- Part 1: Cleaning and Pre-processing X-Ray Data
- Part 2: Exploring and Engineering X-Ray Data Features
- Part 3: Creating a Caption Generating Model Using CNNs and RNNs
- Part 4: Deploying the Model to Serve X-Ray Diagnosis in Production
- Part 5: Interpreting Machine-Predicted X-Ray Captions and Concluding Remarks
The code is hosted and useable at this GitHub repository.
Introduction
The performance of a model is only as good as the data it is fed. This post outlines our approach to extracting, preprocessing, and splitting the Chest X-Rays Indiana University dataset in preparation for future modeling.
Table of contents
- Extracting and accessing the data
- Importing and merging the dataset
- Data cleaning and preprocessing
- Splitting the data (frontal/lateral and training/test)
1. Extracting and accessing the data
The Chest X-Rays (Indiana University) dataset for this modeling problem was obtained from Kaggle. It consists of two components:
- indiana_projections.csv: The dataset containing the X-ray images classified as Frontal or Lateral
- indiana_reports.csv: The dataset for the imaging reports/doctor’s diagnoses
2. Importing & merging the dataset
Both datasets were read into pandas data frames and merged on the uid
column. Specifically, we performed a left merge with the reports data frame as the left table, thus retaining only images that had reports tied to them:
The eventual dataset (a snippet of which is shown above) has 7466 rows and 10 columns.
3. Data cleaning & Preprocessing
As our goal is to caption X-ray images with relevant findings, the following columns (which correspond to the doctor’s diagnoses) serve as our primary outcome variables:
findings
: Doctor’s findings from x-ray.impression
: Final impression made by the doctor.
a. Dropping columns and rows
Our first step was to drop columns that were deemed not to add anything significant to future analysis (see code cell below):
image
: Mostly contained what was stated more succinctly inprojection
indication
: Mostly composed information fromfindings
andimpression
comparison
: High missingness. It seemed to refer to data unavailable from and external to the dataset.
Asides from columns, we also dropped rows that were missing both outcome variables (findings
and impressions
) as these have no diagnoses to be trained on.
b. Merging findings and indications to create a richer caption
After dropping unimportant rows and columns, our next step was to merge the two outcome variables to provide a richer representation called caption
.
This gives an exhaustive final report on the X-ray image by merging the two diagnoses as shown below:
Note: For rows that had a NaN
for either outcome variable, this resulted in a ‘nan’ string in the generated caption. This issue is addressed in the next section.
b. Text preprocessing
After creating the caption column, we cleaned and prepared the data before feeding it into our model.
Several columns had various issues such as excess whitespace, special characters, and placeholders (as shown below):
Example 1: “Normal chest x-XXXX.”
Example 2: “Cardiomegaly/borderline; Pulmonary Artery/enlarged”
To resolve these, we performed the following steps:
- Removed special characters and placeholders (e.g., XXXX-year-old and XXXX are intact) using regular expressions
- Removed the string ‘nan’ as mentioned above from the
caption
column - Removed numbering in
impression
andfindings
- Cleared excess whitespace between words and punctuations
- Removed leading and trailing commas and full stops.
- Converted all letters into lower case
- Replaced contractions, e.g., couldn’t was changed to could not
- Finally, we removed all punctuations except full stops and commas. We retain full stops and commas as these serve as separators for sentences, symptoms, etc.
We also handled missing values in the findings
and impression
columns using the following process:
- If
MeSH
andProblems
were recorded as ‘normal,’ fill the findingsNA
with ‘no unusual findings.’ - In other cases, mainly corresponding to the
findings
andimpression
columns, fill theNA
with ‘no findings’ and ‘no impression,’ respectively.
Below are some examples of the text after cleaning:
5. Splitting the dataset for future analysis
Finally, the cleaned dataset was split into frontal and lateral (image) data frames and then into training and test set for each.
Note: Given an individual can have multiple records, all contributing to the same doctor diagnosis, we split the dataset based on the unique person
uid
to avoid data leakage between the training and test set!
The data is now ready for future modeling and analyses! See part 2 on Exploring and Engineering New X-Ray Data Features