Data Science Bowl 2017 — Space-Time tricks

5 min readFeb 13, 2017

Here we are again for the third post on my journey to Deep Learning.

Like all super-heroes, the data scientist sometimes need to call upon greater power to solve the issue at hands. Predicting lung cancer for the Data Science Bowl Competition by Booz Allen Hamilton and Kaggle is the perfect playground to learn such power.

What powers ? Today I am here to talk to you about how to manipulate space and time. Compressing the data, Just-In-Time compiling and vectorizing your code, and parallelize your Data Science loops. Bonus : run array operation on GPU.

Today’s heroes are: Numpy, Bcolz, Zarr, Numba and Joblib

Note: story was initially published on my blog at https://andre-ratsimbazafy.com/data-science-bowl-2017-space-time-tricks/. Unfortunately Medium doesn’t allow syntax highlighting of Python code. Read on my blog for “The way it’s mean’t to be played” (Nvidia slogan).

Space manipulation

So here you are, ready to challenge this great competition, and to pocket $500 000. And you discover that you have to download a 70GB .7z file. And you discover that uncompressed it’s +150 GB.

And now you’re wondering how to store your preprocessed data afterwatershedding, connected component tresholding or Region of Interest generation.

Fear not, because you have 3 ways to store numpy arrays in a compressed manner.

First way — Pure Numpy

The first way is straightforward with pure NumPy andnumpy.savez_compressed function. You can load the data back withnumpy.load.

Second way — Bcolz

The second way is with bcolz. A code says more than a hundred words

Define a save_bcolz function

def save_bcolz(data_array, patient_id, outFolder):
    
    outFile = outFolder + patient_id + '.bcolz'
    
    z = bcolz.carray(
        data_array,
        bcolz.cparams(clevel=9, cname="zstd", shuffle=2), # "zstd" is the state-of-art compressor
        dtype='int16', # "int16" if possible save as integer for maximum compression
        rootdir=outFile
    )
    z.flush() #Make sure data is written to disk

data_array should be a NumPy array, bcolz.carams are compression parameters, rootdir is the path on disk. The data will be saved in a directory not in a compressed file.

Load the bcolz data :

def load_bcolz(patient, inFolder):
    return bcolz.open(inFolder + patient, mode='r')[:]

mode=’r’ means open the data read-only, and [:] extract the NumPy array from the bcolz file.

Third way — Zarr

Zarr is an alternative to Bcolz. If you’re familiar with HDF5, it strives to support similar features, like group. Also besides saving data in a directory like bcolz it can also save data in a single file.

Define a save_zarr function In this example I will show how to use groups to save the ~1600 patient data.

# First set some global variable
# A store (DirectoryStore, ZipStore, MemoryStore) is where you wish to store the data, directory on disk, single file on disk (slower), in-memory (not persistent)
# Then one or more groups to save your data. It's like having a filesystem (folder/subfolder/data), check the [group documentation](https://zarr.readthedocs.io/en/latest/api/hierarchy.html).
ZARR_STORE_PREPROC = zarr.DirectoryStore('./data/compressed_preproc.zarr')
ZARR_GROUP_PREPROC = zarr.hierarchy.open_group(store=ZARR_STORE_PREPROC, mode='w-')def save_zarr(id_patient, image):
    ZARR_GROUP_PREPROC.array(id_patient, image, 
			chunks=(128, 128, 128),
		        compressor=zarr.Blosc(clevel=9, cname="zstd", shuffle=2)
    )

id_patient is the name of the data I will use when I reload it later. image is a numpy ndarray (here 3D array). Chunks are how data is cut to optimize storage space and time to decompress.

After saving the data I suggest you change the permission to read-only. Zarr allows you to do on-disk computation so it’s very possible to modify on-disk data by mistake by manipulating zarr Arrays afterwards.

Load the zarr data :

# First global variables
# Check that you load data read-only with mode='r'
PREP_STORE=zarr.DirectoryStore(''./data/compressed_preproc.zarr'')
PREP_GROUP=zarr.hierarchy.open_group(store=PREP_STORE, mode='r')def load_zarr(patient):
    return PREP_GROUP[patient][:]

[:] extract the NumPy array from the bcolz file.

Conclusion

Compression of the raw image is slightly better than .7z (and directly usable from Python) Compressing Guido’s preprocessing output with bcolz, data only takes 1.16 GB Compressing Ankasor’s preprocessing output with zarr, data only takes 0.76 GB

Crazy !

Time manipulation

Okkkaay, space is done. While trying some kernels like Guido’s or Ankasor’s you probably realised that just to preprocess the data you would need a whole week. Oops.

And you also realize at one point that their preprocessing steps were only using 1 core of your multicore CPU, and no GPU even though you were manipulating images.

Okay let’s solve that.

First speed bump — Vectorize your code with Numba and Just-in-time compiling

By default, Python is an interpreted language, meaning when you run a Python code, it’s run line after line. You can enable a lot of optimizations, if the code is analyzed ahead of being run and optimized (aka, vectorization).

The easiest way to do that for data science project is by using Numba and the @autojit decorator. Note: General Python code that doesn’t use Numpy has a lot of other options like Pypy or Cython.

How ? by importing autojit and adding @autojit before computational function you want to accelerate.

from numba import autojit@autojit
def crop_center(img,cropx,cropy):
    y,x = img.shape
    startx = x//2-(cropx//2)
    starty = y//2-(cropy//2)    
    return img[starty:starty+cropy,startx:startx+cropx]

Bonus : Use you GPU to go even faster

By importing cuda from numba, you can use the @cuda.jit decorator to run your code on GPU. Check the documentation for what is supported.

Second speed bump — Run loop in parallel with joblib

Loops are an excellent opportunity to parallelize if ther eis no interdependance between each loop run (no n = n+1)

The joblib library makes that quite easy. Let’s say you have a list of images, want to apply a preprocessing function to all those images and get back a new list of preprocessed image.

from joblib import Parallel, delayed#preproc_function is a function that accepts a single image as argument and return an im.age
images = Parallel(n_jobs=-1)(delayed(preproc_function)(image) for image in images)

Note: if your preprocessing function does not return value you can just use straight:

Parallel(n_jobs=-1)(delayed(preproc_function)(image) for image in images)

Third speed bump — Delay preprocessing

Lastly if you’re really stranded for CPU, you can use computational graphs with dask. Basically, when you do y = function(x), instead of computing y right away, dask will store the “computation graph”. Then when you actually need the computation, it will optimize the ressources needed, it can even distribute computation on multiple computers.

I’ll let you check the documentation.

That’s all folks, Happy deep learning