Part 1: Cleaning and Pre-processing Chest X-Ray Data

Korede Akande
4 min readNov 5, 2021

--

This is the first article in a five-part series on Using Computer Vision and NLP to Caption X-Rays

The goal of this project is to measure the similarity of machine-predicted captions to actual captions provided by doctors. Our process has been broken down into the following topics:

The code is hosted and useable at this GitHub repository.

Photo by JESHOOTS.COM on Unsplash

Introduction

The performance of a model is only as good as the data it is fed. This post outlines our approach to extracting, preprocessing, and splitting the Chest X-Rays Indiana University dataset in preparation for future modeling.

Table of contents

  1. Extracting and accessing the data
  2. Importing and merging the dataset
  3. Data cleaning and preprocessing
  4. Splitting the data (frontal/lateral and training/test)

1. Extracting and accessing the data

The Chest X-Rays (Indiana University) dataset for this modeling problem was obtained from Kaggle. It consists of two components:

2. Importing & merging the dataset

Both datasets were read into pandas data frames and merged on the uid column. Specifically, we performed a left merge with the reports data frame as the left table, thus retaining only images that had reports tied to them:

A preview of the dataset after merging. An individual can have multiple images (e.g., frontal and lateral); hence, the duplicated rows.

The eventual dataset (a snippet of which is shown above) has 7466 rows and 10 columns.

3. Data cleaning & Preprocessing

As our goal is to caption X-ray images with relevant findings, the following columns (which correspond to the doctor’s diagnoses) serve as our primary outcome variables:

  • findings: Doctor’s findings from x-ray.
  • impression: Final impression made by the doctor.

a. Dropping columns and rows

Our first step was to drop columns that were deemed not to add anything significant to future analysis (see code cell below):

  • image: Mostly contained what was stated more succinctly in projection
  • indication: Mostly composed information from findings and impression
  • comparison: High missingness. It seemed to refer to data unavailable from and external to the dataset.

Asides from columns, we also dropped rows that were missing both outcome variables (findings and impressions) as these have no diagnoses to be trained on.

b. Merging findings and indications to create a richer caption

After dropping unimportant rows and columns, our next step was to merge the two outcome variables to provide a richer representation called caption.

This gives an exhaustive final report on the X-ray image by merging the two diagnoses as shown below:

Caption column generation. Image by author

Note: For rows that had a NaN for either outcome variable, this resulted in a ‘nan’ string in the generated caption. This issue is addressed in the next section.

b. Text preprocessing

After creating the caption column, we cleaned and prepared the data before feeding it into our model.

Several columns had various issues such as excess whitespace, special characters, and placeholders (as shown below):

Example 1: “Normal chest x-XXXX.”

Example 2: “Cardiomegaly/borderline; Pulmonary Artery/enlarged”

To resolve these, we performed the following steps:

  • Removed special characters and placeholders (e.g., XXXX-year-old and XXXX are intact) using regular expressions
  • Removed the string ‘nan’ as mentioned above from the caption column
  • Removed numbering in impression and findings
  • Cleared excess whitespace between words and punctuations
  • Removed leading and trailing commas and full stops.
  • Converted all letters into lower case
  • Replaced contractions, e.g., couldn’t was changed to could not
  • Finally, we removed all punctuations except full stops and commas. We retain full stops and commas as these serve as separators for sentences, symptoms, etc.

We also handled missing values in the findings and impression columns using the following process:

  • If MeSH and Problems were recorded as ‘normal,’ fill the findings NA with ‘no unusual findings.’
  • In other cases, mainly corresponding to the findings and impression columns, fill the NA with ‘no findings’ and ‘no impression,’ respectively.

Below are some examples of the text after cleaning:

A random sample of observations from different columns after cleaning

5. Splitting the dataset for future analysis

Finally, the cleaned dataset was split into frontal and lateral (image) data frames and then into training and test set for each.

Note: Given an individual can have multiple records, all contributing to the same doctor diagnosis, we split the dataset based on the unique person uid to avoid data leakage between the training and test set!

The data is now ready for future modeling and analyses! See part 2 on Exploring and Engineering New X-Ray Data Features

--

--

Korede Akande

Data Scientist at Spotify. Bachelors in Computational Sciences from Minerva University