Nodulee Overview : Detecting early stage lung cancer

Jovan Sardinha
Weights and Biases
Published in
3 min readMay 9, 2017

Nodulee is a collaboration between Ashish Malhotra and Jovan Sardinha

Background

Lung cancer is one of the most common types of cancer (for both men and women) in Canada and the United States [1][2]. Most Lung cancer cases are diagnosed in advanced stages (2–4) where survival rate is only 17% [3]. Early detection of lung cancer is critical as it provides one of the best chances of survival.

Low-dose computed tomography(CT) scan technology has shown that with early detection probability of death reduces by 20% [4]. However, CT scans of early cases lead to high false positive rates of around 25% [5].

Source: http://www.datasciencebowl.com/

The Problem

Can we use machine learning to help radiologists identify signs and predict probability of early stage cancer with lower false positive rates?

Our Approach

The following diagram provides an overview of our approach:

Nodulee: Overall approach

A more detailed version of this approach will be provided in subsequent blog posts. All our code is publicly available here.

Technologies used

This solution was built using Tensorflow 1.0 (GPU enabled), pydicom, pylidc and XgBoost using Python 3.

We used a NC12 instance on Azure: 12 core CPU, 2 x K80 GPU (1 physical card), 2TB hard disk space.

Datasets used

LIDC-IDRI dataset: consists of diagnostic and lung cancer screening CT scans with marked-up annotated lesions.

LUNA16 dataset: subset of the LIDC-IDRI data that was used for training the nodule identification model.

Data Science Bowl 2017 dataset: contains CT scans with probability of stage 1 cancer.

Evaluation Metric: How we measure success

The main metric used to evaluate overall model performance was a two class log loss as defined below:

where,

n : number of patients in the test set

y : truth label (1 for cancer, 0 for not cancer)

ŷ : predicted probability of the image belonging to a patient with cancer. (1 for cancer, 0 otherwise)

log() is the natural base e logarithm

This metric was used as it heavily penalises extremely confident false positives and false negatives in a multi-class classification problem. Such a metric forces non-generalized models to predict close to the naive score.

Stay tuned for subsequent posts where we outline how we built various parts of the pipeline and insights we attained during the process.

[1] “The Faces of Lung Cancer.” The Faces of Lung Cancer — Canada. Lung Cancer Canada, 1 Nov. 2015. Web. 13 Mar. 2017.

[2] “Lung Cancer Facts Sheet.” American Lung Association, 10 Jan 2017. Web 10 Mar. 2017.

[3] Siegel RL, Miller KD, Jemal A. “Cancer Statistics” 2016. CA: A Cancer Journal for Clinicians. 2–16; 66:7–30.

[4] Aberle DR, Adams AM, Berg CD, et al. “Reduce lung-cancer mortality with low-dose computed tomographic screening.” N Engl Med. 2011; 265:395–409

[5] Low-Dose CT has historically resulted in high false positive rates of around 25% (Aberle, et. al., New England J Med, 2011, 365:395–409)

--

--