Several tips on building Machine Learning pipelines

Based on processing EEG signals in python for seizure prediction.

Published in

Acta Schola Automata Polonica

8 min readDec 3, 2018

Designing and running pipelines, which are going to run couple of days might seem to be threatening. One small bug in the end of the pipeline may cause you lose all results you’ve been waiting for days. I have some tough lessons when building model for seizure prediction.

Seizure prediction means predicting epilepsy attacks from EEG signals. Every second of the recording brings more than 250kB of data and for one single observation we need at least 10 minutes, so for this problem I end up with processing 1,6TB of EEG. It was enough data to make the pipeline run for over one week, and I wasn’t using any deep neural network, definitely not. Just reading the data from disk and computing some basic statistics on signal (e.g. scorrelation) need week. I was working on Ubuntu server with 32GB RAM, AMD FX(tm)-6300 six cores CPU and TITAN X 12GB NVIDIA GPU. Data was stored on external HDD disk and Python was used for developing the pipeline.

The seizure prediction problem brings up the universal question — how can we effectively develop such a computation demanding pipeline? How to avoid most common mistakes? I’m going to provide couple of life hacks, which would make your development less painful.

Design

Modules

I had spent couple of weeks on designing architecture before starting the implementation (which was far too much, by the way). I think that the most important thing is to split the pipeline into reasonable parts — ones which can be run separately. Every problem which we are attacking must contain at least: Preprocessing, Feature Extraction and Cross-Validation. Usually the first two take significant amount of computation time, while the last one needs to be executed many times to optimize params of the classifier.

That’s why it’s highly important to cache results after every part. Also it’s very helpful for development to encapsulate every part — i.e. put them into separate classes, which have some Input (variables from previous steps) and Outputs (variables passed to the next step). There are some open source libraries for this: luigi or airflow, but you can easily implement it by yourself.

Splitting the pipeline into separate parts is one of the methods of reducing the chaos. But not the only one.

Parameters

Every pipeline is controlled with long list of params e.g. number of neurons, path to data, ratio between training and testing dataset and much, much more. It’s very easy to get lost and introduce the chaos, especially when we work on more than one environment: local and computing server. My previous projects often got very messy at some point due to params scattered throughout the pipeline without any centralized view. Holistic view of the params is very important — there must be only one file containing them. I wouldn’t like to be in the developer’s shoes who has to introduce new analytic to the project where every class (preprocessing, feature A, feature B, etc.) has params edited directly. On the top of that I would suggest splitting params into two groups:

technical params, dependent on server environment e.g. path to the data, verbose flag, logging level, path to the configuration file, path to the output folder
characteristics of the model, parameters independent of server environment and often changed e.g. hyperparameters (learning-rate, regularization rate, number of trees), model definition (i.e. type o algorithm), architecture of Neural Network, etc.

I strongly encourage you to keep technical params as command-line arguments:

python main.py -p /Volume/disc/data -l DEBUG -o ./xgboost -w workflows/xgboost_conf.py

While characteristics of the model should be stored as a file added to the git repository. I store it as a pure Python file — it allows storing architecture additionally to the numeric/string params i.e. you can have classifier as parameter (whether it is RandomForest, XGBoost or Neural Network).

Moreover you should save the model characteristic file in the output folder. Then you can easily find out which model was used to generate the results. You would no longer have to name the output folder as xgboost_eta_015_lambda_09_feature_selection_v2_standard_scaler_test_03.

Furthermore adding the characteristic file to a git repository allows you to easily run your hypothesis on the cluster as soon as you test it on a small amount of data on a local computer.

Implementation

Once we’ve created some basic design of the pipeline, now we can jump into implementation. I’m not going to describe all technology which I used in my project — you probably know Python data scientist toolkit (i.e. pandas, numpy, sklearn, etc.), moreover there are lots of tutorials how to use them. Instead I would like to present couple of easy tricks, which makes your life easier. Not at the beginning, but at the moment “OMG?! How could these results not been saved!”

Logging

Let’s start with logging. You might think “what is interesting in logging”, I’ll just print() what I need and redirect output to some file with python main.py > output.log. But wait a second — your output is going to be generated next week if you run the pipeline today, it deserves at least couple of minutes of implementing good logging logic.

First of all whatever you log on the screen, log it into some output file. This way, you can both monitor the computation (e.g. watch progress or just check if everything’s OK) and save the output for later investigation. There is nothing more frustrating than noticing that pipeline running on the server died and you have no idea why. You don’t know if there was a bug (and what bug?) or just some server restart. Here’s the Python code, which is setting logging to both console and file (output.log):

Then in every class of your project you can reuse the logger with simply:

Yes, there is alternative with tee in bash i.e. python main.py | tee output.log. But with the above approach you no longer have to remember to redirect the output — I’m sure you do know what can happen if you forget just one time (bye, bye — results, welcome another week of waiting)!

Prevent the stupid errors

Always create every folder from the output path at the beginning of the pipeline — there is nothing more irritating than losing all results and logs at the end of the pipeline because of the:

FileNotFoundError: [Errno 2] No such file or directory: ‘output/results.csv’

You can easily check if the folder exists and create it with:

Furthermore make sure you catch KeyboardInterrupt. It’s very helpful when you can stop your pipeline with saving current results and metadata. Even better if your pipeline can save its state and continue work when you rerun it.

Easiest way of catching KeyboardInterrupt is:

Saving results

Again — you can ask “What’s a big deal, I just dump the .csv file at the end of the computation!”, but remember — your week-running-pipeline demands more from you!

When you generate and save results in a .csv file I recommend to always save it iteratively. Imagine you’re computing features from data. Saving a whole matrix at the end of the job is really bad idea. Such job is 0–1 i.e. either it is 100% successful and you get the whole matrix or you get nothing. Remember that having unexpected data corruption is very common thing and you should assume that every problem you’re solving has it. If you save your results iteratively, data corruption isn’t so painful. After your pipeline fails, you can check if results you have already generated are fine, moreover you can start working on the next part of the pipeline depending on failed job — actually you can start it even if previous part is still running, because your results are generated iteratively.

Furthermore you can monitor the job live. Iteratively dumping results is useful especially for running computation, which doesn’t have a limit of iterations e.g. random search of hyperparameters, or evaluating models which have big variance of results (example of such computation below). Here’s example code for iterative result saving:

Remote work

In the seizure prediction problem I had to run the pipeline on the remote server. There were two main reasons: resources i.e. more RAM/CPU/GPU/disk size, and convenience — it was impractical to have my personal computer turned on with pipeline computation for weeks. It is very common approach for machine learning applications. Let’s introduce the sshfs tool.

There are two most common development cycles. First — pushing code to a remote repository from a local computer and pulling it back to the server. It’s very annoying when you have to play with some parameters of the model and have to circle back through the remote repository.
The second cycle is to develop a pipeline with vim via ssh, but we all know how painful is to maintain a big project on vim.

I find it very convenient to mount a remote file system on the local computer:

sshfs -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3 kleszczynski@192.168.0.1:/home/kleszczynski/work/SeizurePrediction /home/krzysztof/remote_folder

Then you can use our favorite IDE and easily edit code directly on the remote machine.

You can find description of ServerAliveInterval and ServerALiveCountMax here.

For unmounting:

umount -f PATH                                     # for Linux OS
diskutil unmount force PATH                        # for Mac OS X

A potential drawback of this method is the requirement of a decent network connection. Moreover you have to mount/unmount the folder whenever you turn off/sleep the local computer, otherwise you would end up with dead folder, which you can’t enter or delete (/home/krzysztof/remote_folder in this example) easily.

Summary

For conclusion — last advice — don’t worry too much on designing the whole pipeline before implementing it — it’s very hard to predict all weirdness in the data. For seizure prediction I’ve spent over month designing the architecture, but when I started implementation and playing with the data I had to redesign the architecture and reimplement everything. Iteration workflow is better (i.e. build quickly proof of concept of whole pipeline and iteratively fix issues after every run) than waterfall (create perfectly preprocessing, then feature extraction and so on) — it can easily appear that parts of the pipeline doesn’t fit to each other e.g. implementing the classifier you can figure out, that you need completely different aspects of data created with preprocessing. Such things happen sometimes — don’t worry if your precisely designed pipeline is encountering unexpected issues in the data, that’s why it’s better to implement all parts of the pipeline together.

I’d like to thank OPIUM for mentoring and sharing the computation server for building seizure prediction model and also Daftcode for sharing their machine learning library.

If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.