Automatic lung-cancer detection on scans of Computed Tomography with RadIO

Lung cancer is the world’s deadliest cancer and it takes countless lives each year. Fortunately, early detection of the cancer can drastically improve survival rates. Computer-aided diagnostics could be the key here, as it might help to detect tumors automatically and thus at a wider scale.

Left: CT scan. Center: tumor location. Right: evolution of a neural network prediction

Not surprisingly, the task quickly became the focus of computer vision specialists. As a result, in the recent years we witnessed the launch of two open cancer detection competitions: Luna-competition and Kaggle-Data Science Bowl of 2017.
It turns out that data-science community has a lot to offer to radiologists. Despite that, there has been a lack of a flexible and customizable framework for lung cancer research. The new open-source RadIO library attempts to fill this gap. Incorporating a set of preprocessing and augmenting operations, as well as a zoo of proven neural network models, it allows for creation of deep learning algorithms of cancer detection in a short piece of clean, easy-to-read, reproducible and fast Python code, just like that:

By the end of this article you will get a working understanding of the code above and will be able to build a cancer detection system by yourself.

How it all works?

Dataset: indexing scans on disk

First of all, you should set up a Dataset — a structure, that indexes a dataset of CT-scans on disk:

With Dataset, you don’t have to write boiler-plate code for splitting the dataset into training/test parts:

while batch-generation takes one line of code:

Pipelines and actions

In RadIO, setting up a Pipeline preceeds any computation. A pipeline is only a plan of what is going to happen with the data. To put it simply, a pipeline is a sequence of actions. Each action in RadIO implements a specific
operation on data
. For instance, loading of scans from disk in memory is implemented by action load. Another example is unify_spacing-action, which serves two purposes:

  • making different scans isotropic in their scales
  • making the shapes of different scans equal to each other, so that they can be put in a neural network

The effect of unify_spacing is much clearer if you compare two scans before

Before: shapes of scans differ, [117, 512, 512] vs [345, 512, 512]. Scale of scans also differ, as grids suggest.

and after the action:

After: scans have the same shape [92, 256, 256]. Scale (which corresponds to the lung size in the real world) is also the same.

RadIO implements a lot of preprocessing operations, including load data from disk, resize scans to a different shape, dump preprocessed data to disk (see the full list). Importantly, actions of RadIO are quite fast. For a deeper understanding of how Python code can be fast, check out the article about preprocessing in RadIO (spoiler: keywords are numba, parallelization, asynchronicity).


Six actions mentioned earlier are just enough to build a full-fledged preprocessing pipeline:

This preprocessing contains data loading, shapes and scales equalization, and normalization of voxel densities with normalize_hu. It also makes use of cancer annotations and sets up cancerous masks, a target for a segmenting neural network. Just pass some scans through the preprocessing pipeline and see the results by yourself:

According to Luna-annotations, the scan (left picture) contains cancerous tumor. The corresponding mask (right picture) reflects just this.

Augmentation and neural network training: yet another pipelines

Data augmentation is another important step of developing a computer-vision solution. A simple augmenting pipeline includes cropping out interesting (and small) parts of scans with sample_nodules, rotating scans on a random angle and random scaling with unify_spacing:

Lastly, we set up a pipeline that performs initialization and training of a V-Net-type model from RadIO model zoo. To begin with, you need to configure the model: define the shape of its inputs and the loss function:

You can now attach the model to a pipeline and schedule a training action:

Putting it all together

At this point, most of the work is done. You only need to combine preprocessing, augmentation and training in one workflow

and apply the workflow to the luna-dataset (importantly, real computations start only now):

Note that there are two ways of running the computations. The first one involves iterating over batches of specific size, as in the example above. The second one uses run (stands for one run through the whole dataset):

The whole training process may take up to several days. However, even after a couple of epochs (several hours of training) you may enjoy a view of decreasing loss:

and a nice prediction dynamics:

Concluding remarks

As you can see, building a cancer detection system with RadIO comes down to setting up a Pipeline — a plan of ensuing computations. In its turn, this plan can be further divided in three separate workflows: preprocessing, augmentation and model training.

Still, RadIO has a lot more to offer. For overview of its capabilities, check out the documentation and visit our GitHub-repo. You may also benefit from reading the tutorials, as they provide working examples of code. For example, this tutorial explains how you can tweak preprocessing and augmentation parts, while this one introduces you to the RadIO-zoo of neural network models for customizing model-training part.

Nothing stops you now from developing a cancer detection system, be it for fun, or for assistance to radiologists from the nearest center of cancer research.