OSIC Pulmonary Fibrosis Progression

Purnima Chowrasia

Published in

The Startup

7 min readJan 29, 2021

Describing my little different approach from all existing approaches.

Table of content:

Introduction
Mapping as ML problem
About the Data
Evaluation Metrics
Exploratory Data Analysis
Data Preparation and Modelling
Conclusion & Future Work
References

1. Introduction

Assume that your breathing became consistently labored and shallow. After visiting the hospital, your doctor confirmed that you are suffering from pulmonary fibrosis(PF). PF is a lung disease, which has no known cause and no known cure. It occurs when lung tissue becomes damaged and scarred. As this condition worsens, your breath will become short progressively. Lung damage caused by pulmonary fibrosis is irreparable, in other words, we can say that it has no cure. The present approach to deal with PF is by using medication and therapies to ease out the symptoms. Sometimes, a lung transplant is also a suitable option for some patients. If health experts know the prognosis of this disease beforehand, then they can formulate better ways to treat this medical condition. Open Source Imaging Consortium (OSIC) is a non-profit organization, they had organized a competition on Kaggle to solve this progression prediction task using machine learning techniques.

2. Mapping as ML problem

Given image data(CT Scan of lungs), metadata, and baseline FVC as input, the challenge is to develop proper ML techniques that will predict pulmonary fibrosis progression in the upcoming weeks i.e predicting future week’s FVC and Confidence value.

3. About the Data

The data is picked from Kaggle's respective competition page. Files provided to us were:

train.csv - the training set, contains the full history of clinical information.
test.csv - the test set, contains only the baseline measurement.
train/ - this folder contains the training patients' baseline CT scan in DICOM format.
test/ - this folder contains the test patients' baseline CT scan in DICOM format.
sample_submission.csv - demonstrates the submission format.

Columns in train.csv and test.csv:

Patient- a unique Id for each patient (also the name of the patient's DICOM folder)
Weeks- the relative number of weeks pre/post the baseline CT (may be negative)
FVC - the recorded lung capacity in ml
Percent- a computed field which approximates the patient's FVC as a percent of the typical FVC for a person of similar characteristics
Age- patient’s age
Gender- Patient’s Gender
SmokingStatus- whether a patient smokes or not.

Columns in sample submission.csv

Patient_Week - a unique Id formed by concatenating the Patient and Week columns (i.e. ABC_22 is a prediction for patient ABC at week 22)
FVC - the predicted FVC in ml
Confidence - a confidence value of your prediction (also has units of ml)

4. Evaluation Metrics

Metrics used in this particular competition is a modified version of Laplace Log Likelihood. Know about Laplace Distribution from here.

Here σ is the standard deviation. Details about these evaluation metrics can be found here.

About FVC: Forced Vital Capacity in short FVC is measured using a device known as a spirometer.

5. Exploratory Data Analysis

The shape of train.csv file: 1549 records/rows, 7 columns. The shape of test.csv file: 5 rows and 7 columns. And the shape of submission.csv file is 730 rows and 3 columns.
There was a total of 176 unique patients, whose record has been given in the train data. Each patient’s record is given a maximum of 10 times and a minimum of 6 times. Each record of the patient describes their FVC value for a given week with other info about that patient.

Week attribute can take negative value, suggesting FVC recorded before initial CT scan

While performing Dicom data analysis, there was an unequal number of scans per patient given in the Dicom image folder. The average number of CT Scans found per patient is 187.64

With each of the Dicom images, there is always meta information attached to it which can be easily accessed by pydicom library. It is a python package for reading and writing DICOM medical files. By using the below function I extracted metadata wrt each Dicom files:

Code to extract meta info from Dicom images

I have also performed segmentation over the lung images. There are certain steps involved in doing segmentation and I took help from this Kaggle notebook. Steps involved for segmentation are:

Normalization of image.
Clustering for separating lung with everything else.
Threshold image.
Morphology — Erosion followed by dilation.
Label different regions and define regions with different colors.
Create a lung mask.
Apply the mask to the original image and get a final masked image.

6. Data Preparation and Modelling

The main challenge faced while preparing data for the next section(creating a deep learning model)was that there was an unequal number of CT Scans given per patient. Also, each patient’s FVC value recorded over several weeks, this number of week values is not the same for all the patients. Due to these discrepancies in the number of Dicom images and the number of metadata records, it is difficult to create one single model that will fit all the patients. And creating one model corresponding to each patient is also not a feasible option, because we need a model that can generalize well over all the patients and hence can be used in the future on any unseen data without increasing any excess computational cost.

CNN Architecture used:

Before discussing how data is fed into the model, I would like to discuss the architecture I used and how different it was from other existing solutions on the respective competition page. After skimming through various deep learning models that other Kagglers have tried I got to know that EfficientNet gave them good results and some used deep CNN models with residual blocks in between. The architecture that I created was inspired by this Kaggle notebook. As I don’t have much computational power, so I decided to train my model inside the Kaggle environment only that is not too deep(otherwise ResourceExhausted error occurs).

Total three input branches were created(two input branch feed image data and one branch feed tabular data) by me, rather than just feeding only one image as one input and tabular data as another input. The reasoning behind choosing two images as two inputs were that, wrt each patient we have many Dicom files(some patients even have 100s of Dicom files), so why only randomly select one image and feed it into the model. Why not feed more images at once, which means more data for the model to learn from. But due to a lack of computational power, I cannot sample 10 -20 images and feed them all as input to my model. So I decided to create just one extra input branch through which I can pass one more Dicom image to the model (hmm.. something is better than nothing 🙃).

Below is the code for defining CNN architecture/model.

Below is the code for defining a class that acts as a data generator. The main advantage of using this type of generator is that you don’t have to fit whole data in RAM while training the model, everytime this generator generates a small batch of required data and send it to the model.

One very essential point that is needed to be mentioned here is that rather than predicting FVC and percent value(main aim of this project) here we are predicting the coefficient ‘a’. This slope can also be considered as the slope of the line, and this line can be defined as FVC = a*Week + intercept. Rest of the code and details about all the steps followed can be found in this Github repo here.

7. Conclusion & Future Work

After submitting my final approach on Kaggle, I got a score of -6.9034 which can be considered as a good score, keeping in mind the computational power restriction. I tried to feed all the Dicom images somehow to the model however it didn’t work due to an unequal number of images for each patient and resource constraints. Also, I thought of trying the LSTM model by considering FVC and percent values as a time-series because FVC values were recorded over a time period. But the issue with this approach was each of the time-series is of very short length(some have 10 time-steps and some have just 6 time-steps). Finally, I decided to go with the approach of predicting coefficient which was already tried by other Kagglers. As a future work: a CNN network having more depth can be tried and rather than just feeding two images as input using two input branches, multiple branches can be generated to feed more Dicom images.