As mentioned in our previous post (Nodulee Overview) our pipeline consists of the following stages:
The raw data provided by LIDC-IDRI, LUNA and DSBowl consists of CT scans provided in the .dicom format. A single dicom file contains N slices (images) which cover the entirety of the upper torso. The number of slices N can vary from 100 to 600 depending on the patient. The spacing between the slices may vary as well depending upon the machine used to perform the scan.
Our pre-processing code converts all scans to a uniform spacing. This is done to make the models learn a spacing independent representation of the data.
The slices are also segmented to isolate just the lungs from the scan of the whole torso. For example here is a scan before segmentation:
here is the same scan after segmentation:
As is evident, segmentation helps by removing the parts of the data outside the lungs since they are irrelevant when detecting lung cancer. This helps reduce noise in the data.
- Pre-process LUNA data (link)
- Pre-process LIDC-IDRI data (link)
- Pre-process Data Science Bowl data (link)
Nodule Identification Model
This model takes a small volume (chunk) from a lung scan (3D image) as input and classifies the chunk into two classes:
- Class 0 : Chunk does not contain a nodule
- Class 1 : Chunk contains a nodule
Since the Data Science Bowl dataset does not contain data about individual nodules we had to search for external data sources. We found that we could leverage the data from LUNA16 grand challenge for this purpose. We used annotations about individual nodules to train a 3D Convolutional Neural Network (3D-CNN) to detect nodules with high accuracy (90%) (code).
In order to get the chunks with highest probability of nodules we split a 3D scan into 48 x 48 x 48 chunks and did a forward pass through the previously trained network to calculate the ‘nodule probability’ for each chunk. These probabilities are used to feed chunks into the next stage of our pipeline.
- Pre-process LUNA16 data and split into chunks (link)
- Train LUNA model (link)
- Load trained model and predict on pre-processed DSBowl data (link)
Nodule Malignancy Model
This model takes a chunk from a lung scan and classifies the chunk into one of five classes:
- Class 0 : Chunk does not contain a cancerous nodule
- Class 1 : Chunk contains a malignant (cancerous) nodule of type 1
- Class 2 : Chunk contains a malignant nodule of type 2
- Class 3 : Chunk contains a malignant nodule of type 3
- Class 4 : Chunk contains a malignant nodule of type 4
Classes 1–4 describe the malignancy of the nodule (higher is more malignant). We utilized LIDC-IDRI dataset provided by the National Cancer Institute. This data contains lung scans along with the locations of nodules within those scans. It also contains malignancies of those nodules.
We used malignancies data about individual nodules to train another 3D CNN to predict the malignancy of a chunk (code).
The trained model was used to predict malignancy for chunks which contain nodules (obtained from the previous stage).
- Pre-process LIDC data and extract chunks along with their respective malignancy labels (link)
- Train malignancy model (link)
- Load trained model and predict malignancy for chunks from previous stage
Cancer Prediction Model
This final stage consist of an XGBoost model that uses malignancies from previous stage along with hidden features to predict the overall probability of a patient having cancer. The hidden features were extracted from the activations of the penultimate layer of the nodule malignancy model.
The output of this layer is a number between 0 (not cancerous) and 1 (cancerous), indicating the probability of cancer.
We envision a product where the predicted probabilities can be used to assist a radiologist with an automated second opinion when screening lung scans for cancer. In some cases the Nodulee system may even help the radiologists to catch nodules which are too small to be seen with the human eye (< 3mm in size).