How to make CT-scan preprocessing fast and easy

Kirill Emelyanov
Data Analysis Center
5 min readFeb 1, 2018

Deep learning loves to put hands on datasets that don’t fit into memory. In these cases efficiency is key. 3D-scans from computer tomography (CT-scans) are cumbersome, take lots of memory (~300–400 mb per scan) and require many preprocessing steps before feeding them into neural network.

Suppose that a scan’s shape is 256 x 256 x 256, then it consumes as much memory as 256 two-dimensional images. But you never need just one scan. Usually you have to work with hundreds and thousands of them.

Fortunately, RadIO framework handles preprocessing of CT images making it fast and convenient.

RadIO framework has three main features:

1. Storing data and preprocessing methods in batches

Whenever you want to transform data, it is faster and easier to handle a batch of scans at once. On the one hand, this approach allows to use parallel preprocessing of scans. On the other, it simplifies and reduces amount of code when training neural networks because they commonly accept batches as input. Storing data and methods working with it in one structure is a well known approach called encapsulation:

Encapsulation concept

RadIO framework provides CTImagesMaskedBatch class that used for CT scans and corresponding masks storage. Inner representation of data contained in CTImagesMaskedBatch is based on stacking 3D-arrays corresponding to scans or masks along z axis in a ‘batch-like’ structure:

Skyscraper structure used for storing CT scans in batch

This representation is convenient because allows to store scans with different sizes along z-axis in one 3D-array.

Here is a list of CTImagesMaskedBatch components and methods:

— — — — — — — — — — — — — — — — — — — — — — — — — — — —

Components:

— — — — — — — — — — — — — — — — — — — — — — — — — — — —

  • images — source CT scans or crops
  • masksmasks corresponding to source CT scans
  • spacing — number of millimeters per one pixel along each dimension
  • origin — world coordinates of scans

— — — — — — — — — — — — — — — — — — — — — — — — — — — —

Methods:

— — — — — — — — — — — — — — — — — — — — — — — — — — — —

  • batch data loading and saving: load / dump
  • scan’s data transforms: resize, segment, normalize_hu
  • augmentation methods: sample_nodules, rotate, central_crop
  • mask creation and transforms: create_mask, binarize_mask, resize

— — — — — — — — — — — — — — — — — — — — — — — — — — — —

For example, slice of CT scan and corresponding mask with cancerous nodule look like:

2. Chaining actions in readable and flexible pipelines

Usually you would like to call the same chain of methods for every batch that was generated. Straightforward approach is to call the same methods chain inside for or while loop for every batch. For example, if you want to resize scans and normalize HU’s you can do something like:

Another approach is a bit intricate, but clearer and more expressive: using “pipelines” with chains of actions applied to the whole dataset.

Here is an example based on Pipeline class from dataset package:

This pipeline describes a simple chain of operations for resizing scans in LUNA dataset.

It’s really easy to run this pipeline on each scan from CT-dataset:

You can even train your neural network on the fly.

Using pipelines allows to separate the description of the methods chain from their execution. The same approach is used in tensorflow, so the concept of “pipelines” can be considered as a natural continuation(extension) of tensorflow dataflow model.

3. Making actions really fast

Parallel preprocessing of CT-scans is an essential feature because it significantly boosts performance. Modern processors have several cores that can be used for parallel preprocessing of scans in batch. However, python puts a global interpreter lock (GIL) which forces the program to run in a single thread. Fortunately, sometimes it’s possible to overcome this issue, for instance, with numba decorators or when calling external functions that release GIL.

Releasing GIL in external functions

Here is an example of function that uses scipy with GIL release that rotates CT-scan:

Dataset library contains inbatch_parallel decorator that enables automatic threading. Using it, implementation of rotate action will look like this

Our custom batch inherits CTImagesMaskedBatch and extends it with rotate method. Method is wrapped with actiondecorator from dataset module because it’s required by Pipelines API.

Image below shows that trick with inbatch_parallel decorator gives over 4x boost comparing with single-thread version:

Releasing GIL with numba

Another example is maximum filter (similar to 3D max pooling operation) implemented via numba:

Now, let’s add maximum_filter action in our CustomBatch:

Thanks to the use of inbatch_parallel from dataset package and numba’s jit decorators we also gain speed boost:

Many preprocessing methods of CTImagesMaskedBatch that can be significantly boosted via parallel execution are wrapped with inbatch_parallel in a similar manner. Among them: rotation and resize action-methods, mask creation and patches extraction methods that are also wrapped with numba’s jit decorator with parameter nogil=True to enable GIL release.

Conclusion

RadIO framework provides an interface for expressive description of preprocessing or data augmentation of CT scans. It handles heavy sets of CT scans efficiently in parallel leaving all dirty work under the hood and allows you to focus on ‘what’ preprocessing should do instead of ‘how’.

--

--