Choosing between ML models using pipes for code reuse

Jonathan Loscalzo
Hexacta Engineering
5 min readFeb 28, 2020

--

While browsing posts we can find many recipes on how to start a ML project, the popular ”Hello World!”.

Usually those recipes involve the following steps:
1. Data Gathering
2. Data Preparation
3. Train your Model
4. Evaluate it
5. Improve…

At the end, depending on the type of solution, if it is a competition, we will submit the predictions and if it is a real world problem, we will continue with the deployment step.

We say to ourselves “Let’s play with it” following the “recipe”.
An hour later we will end up with big ball of mud so to speak, the well-known spaghetti code. Our experiment will finish with a lack of reproducibility, error prone, nothing to be proud of to be honest.

Furthermore there are many hidden steps in our flow such as Model Selection & Hyperparameter Tuning, which won’t appear in the deployment model (they are also important).

So, how do we evolve our code to be ordered, reproducible, flexible? Can we get a modularized solution? How?

We can address such problems by leveraging the use of sklearn pipeline utilities.

Think of the whole pipe as a big module, made up by other tiny pipes. Each pipe would have its purpose, such as feature selection, feature transformation, prediction and so on.

The better you code today, the easier you will understand in the future.

In this post I will give an intuition on how to solve a classification problem by taking a reproducible and modular approach made up with sklearn and pipes.

The Problem: Pump it Up, Data Mining the Water Table

We will build our approach with a problem which is being sponsored by DataDriven. The challenge is to predict the operative condition (functional, non-functional, functional-need-repair) for each water point in the dataset.

1. Analyzing the dataset

A more general data analysis could be found in the challenge site here. In this section we avoid repeating the previous investigation but it is a good practice to do it by ourselves to understand the data.

In general terms and as a first approximation we will drop some features that seem to have similar information.

We will use a derived technique of feature interaction to obtain which columns has redundant information, if it exists, we will drop it

In other words, we will reduce quantity of features to get a better model performance.

Note: In tree based models, more cardinality will result in more splits, what that result in more complexity. Anyway, other tricks to select or drop features exists such us SelectKBest, SelectFromModel, WOE, IV, and so on.

In our dataset, we end up dropping the following: ‘payment_type’, ‘extraction_type_group’, ‘management_group’, ‘quantity_group’, ‘source_type’, ‘waterpoint_type_group’.

The explanation for these drops is that there exists columns which could inferred from others, such as payment and payment_type.

2. Lean Thinking? Build your first system quickly, then iterate.

Basically, it is to apply the agile methodology.

The generalized concept is:

  • Build a model asap.
  • Measure the error to evaluate how are we doing.
  • Iterate to find better hyperparameters, gather more data, do feature engineering, change the model.

3. Automate the workflow: Pipelines!

In a ML flow some steps are often repetitive and they must applied at different stages such as feature selection or transformation.
In other words to predict over an unseen dataset, we must apply every preprocessing step (transformation) and then predict with our trained model.

These transformation steps usually bring bad smells (even more if we use jupyter notebooks). There are many projects that start as a jupyter notebook and you can tell by its quality, how the code is spread all over without using simple abstractions such as functions.
Don’t take me wrong, jupyter notebooks are a great resource to demonstrate things and share knowledge but there are better tools to engineer data science software.

So, what if instead of spreading our code all over we try to create modules, and compose them? We will be building clean code.
In order to achieve this we mainly need to get acquainted with three functions.

  • sklearn.pipeline.Pipeline: to compose pipes, we could also use make_pipeline method.
  • sklearn.pipeline.FeatureUnion: to concat pipes.
  • sklearn.preprocessing.FunctionTransformer: to create a transformer from our code and make our pipes more flexible.

FunctionTransformer is used to build columns selectors:

Next code shows how to build a preprocess pipeline

OrdinalEncoder in text pipe is from *category_encoders*. It is more powerful than LabelEncoder when we have unseen labels.

Our preprocess pipeline is splitted up in 3 more pipes: date, numeric and text preprocessing! It is easy to reproduce, to test and to build more pipes on top of it.

In a common pipeline we are going to concat transformers and last step will be an estimator. So in a reusable and generic way a full pipeline could be:

The method above lets us build many pipelines with many classifiers! It is a bit cleaner than coding sparse in many cells.

4. Selecting a model at a glance

Now that we have our pipeline up and running we have to choose the model that will best fit our data to then be able to predict new cases. To achieve our first model we are going to use cross_val_score from sklearn:

We are going to evaluate metrics over each of them and select which performs better. In this case we evaluate accuracy although we know the issues related to imbalanced datasets.

But wait, Why are we using the pipeline on each classifier? Transformers, as estimators, should be learnt from a training set and applied to test set (held out) for prediction; Pipelines makes this task easier! Every time you fit the full pipe, you will train transformers and the estimator!

Remember: easy to reproduce is the key

The concept of fit transformers splitted from train and test set is related to data leakage, to capture the main concept we could explain that data leakage is related to avoid data leak from test sets into train sets. For instance our numeric-pipe has a SimpleImputer transformer which is used to impute values filling missing with the mean, it should be estimate only from TRAIN SET! It is easy to do with pipelines.

The execution of the code above was:

********** Start **********
XGBClassifier : 0.7336868686868687 +/- 0.0029033449613791182
Time spent : 0:00:26.835271
********** End **********
********** Start **********
LGBMClassifier : 0.7723063973063974 +/- 0.0009899791533999399
Time spent : 0:00:10.892493
********** End **********
********** Start **********
RandomForestClassifier : 0.8024242424242424 +/- 0.0021431445901952087
Time spent : 0:00:18.336419
********** End **********
********** Start **********
CatBoostClassifier : 0.7148653198653198 +/- 0.0023408544566768295
Time spent : 0:00:07.668704
********** End **********

As we can see RandomForest has the best accuracy but it was very slow to train. Despite this, we will select it.

In the next part we will figure out and select which hyperparameters perform better, train a final model and submit our predictions.

--

--

Jonathan Loscalzo
Hexacta Engineering

Proactive, Developer & Student. Interested in Software Craftsmanship & DataScience