e2eml — A full ML pipeline with just a few lines of code?

Thomas Meißner
8 min readJul 20, 2021

--

In this article I will not only share the library, but also the struggles during this project. The purpose of this article is to help others not doing the same mistakes.

ML is not ML

After I started working as an analyst building machine learning models came up rather soon. Without any prior experience I needed to research how to do classification and regression. Many tutorials show “how to do” these things, but only cover parts of the full pipeline, leaving the impression that these tasks are easy. Just get some data, create some features, maybe scale the data and call “fit” and “predict”.

This is sufficent unless you want a sophisticated model.

Comparing my first models with automl outputs and others on Kaggle illustrated me, how much more performance you can actually squeeze out of a model. However bringing this onto the table requires a significant amount of time. Knowing the best algorithms, chosing the right model, preprocess data suitable to the model, optimize hyperparameters and evaluate your model does not only need knowledge, but also coding time. And here we don’t even have a pipeline, but rather some code sitting in a Jupyter notebook, which can only predict on the 20 percent test data. As you might have noticed, implementing business knowledge has not even been mentioned until this point even though it should take significant room in your considerations.

Why e2eml?

e2eml has actually been created to improve my understanding of Python. At some point I realized, that I never created a custom class and don’t have much knowledge of object-oriented programming. However model building and specifically deployment requires software engineering skills and best practices, including object-oriented programming potentially. Personally I prefer project-oriented learning, so I decided to learn OOP to achieve a goal. The idea for e2eml has been born.

Automl is not new and there are fantastic libraries like H20 and Pycaret out there. At no means I will try to convince you, that you should drop them for e2eml. This article shall just present the process and outcome of my project.

The big struggle and learnings from it

In this section I want to share struggles and learning from it, hoping that others will not fall into the same traps.

Learning I: Plan wisely

Writing a script or a juypter notebook to answer a business question is one thing, but writing a whole library is on a different scale. It requires even more planning and decision making. Looking back I underestimated the importance of planning. I created a rough plan of what the end product shall look like and how to get there. This required quite some rethinking and refactoring later on. Originally I wanted to create the library around the RAPIDS ecosystem (which I really like), but dropped this as it has hard(ware) requirements.

Learning II: Be realistic about what is an MVP

Initially I wanted e2eml to cover:

  • CPU-based preprocessing
  • GPU-based preprocessing
  • Classification
  • Regression
  • Time series

However this would have doubled the development time at least. So I decided to drop GPU-based preprocessing and time series modelling (for now).

Learning III: Unit testing saves time

In the beginning I added many features, but did not write any particular tests. I ran models on some test data in a Jupyter notebook and it even ran through until I changed the test data at some point, realizing that many bugs have been unseen. I started to create at least some high level tests as part of the library. This did not only help to find more bugs, but also accelerated the testing process. However the general testing in e2eml is not very strong yet. It still needs improvement.

Learning IV: Working less can save time

Being a heart project I invested every minute into it. This meant writing code before and after work, during the weekend and sometimes “on the fly” when I had some minutes in-between. Writing code on the fly or when being tired is a very bad idea. Fixing bugs I implemented due to this probably cost more time in the end than just waiting until I have sufficient time and mental resources.

Learning V: Have a more experienced mentor

On a platform called MentorCruise* I found Stephen Gabriel, who regularly mentors me. He is an experienced software engineer and challenged me during the project. This really helped me to understand what, how and why I should do things differently in order to create a better library.

Learning VI: Have your users in mind

A good product is user-oriented. In the beginning I planned the MVP around some shiny ideas. However during the process I rethought who might use this library and did adjustments towards the potential audience. I.e.: Users of this library might need a quick solution, so tried to make the installation as easy as possible: In the beginning, users would have had to download spacy on their own and also needed to specifically tell, if LGBM and Xgboost have been installed with GPU acceleration or not. Both of these happen automatically now.

Learning VI: One project is not enough

After having done all of these mistakes I will actually need another project to solidify my learnings. I am pretty sure, that I would find other mistakes or bad patterns I did not even see yet.

About the product itself

This section will cover e2eml itself as it is the result of this project.

The vision

e2eml shall build a full, sophisticated ML pipeline in a few lines of code. This includes:

  • preprocessing (rare features, colinearity, categorical data, outliers and many more)
  • some feature creation (binning, category-based features etc)
  • model building
  • hyperparameter optimization
  • evaluation

To the core: How e2eml works

If you were patient enough to read until here, you probably wait for the actual e2eml show casing part. Let’s get into it:

To make the library as accessible as possible I published it to Pypi. I recommend to create a fresh environment for it. Then you can install the library with:

pip install e2eml

It is recommended to install Xgboost and LGBM with GPU acceleration into the environment. The models and also SHAP values will run a lot faster. However this is optional. e2eml will automatically detect this and adjust it’s behaviour accordingly (i.e. not calculating SHAP, but inbuilt feature importance instead).

Next we need to import e2eml:

from e2eml.classification import classification_blueprints as cb
from e2eml.full_processing.postprocessing import save_to_production, load_for_production
from e2eml.test.classification_blueprints_test import load_titanic_data
import pandas as pd # we need Pandas for CSV import

Next we import the Titanic dataset. In this case this happens with some additional feature creation in advance:

# load Titianic data
test_df, test_target, val_df, val_df_target, test_categorical_cols = load_titanic_data()

It does not matter how you import your data. e2eml just needs a dataframe to be passed to it (with your target variable being a column of it).

Now we instantiate our e2eml class:

# Instantiate class
titanic_auto_ml = cb.ClassificationBluePrint(datasource=test_df,
target_variable=test_target,
categorical_columns=test_categorical_cols,
preferred_training_mode='auto',
tune_mode='accurate')

Please note, that you can define which columns are categorical or datetime, but this is optional. Otherwise e2eml will detect this automatically. ‘Preferred_training_mode’ is also optional and displays the default value here. With ‘auto’ e2eml will detect automatically, if LGBM and Xgboost run with GPU acceleration. The ‘tune_mode’ defines, if hyperparameters will be validated with on-fold (‘simple’) or ten-fold crossvalidation (‘accurate’). ‘Datasource’ and ‘target_variable’ are mandatory (target_variable is just the name of the column within the datasource dataframe).

Now we run the model blueprint:

# Run chosen blueprint
titanic_auto_ml.ml_bp01_multiclass_full_processing_xgb_prob(preprocessing_type='nlp') #'nlp' adds an additional feature engineering step

That’s it. e2eml will now do the preprocessing, feature creation, model training, hyperparameter tuning, evaluation and logging.

We can save our whole pipeline and also load it again:

# Save pipeline
save_to_production(titanic_auto_ml, file_name='titanic_automl_instance')
# load stored pipeline
titanic_auto_ml_loaded = load_for_production(file_name='titanic_automl_instance')

In the last step we run the blueprint again to predict on new data:

# predict on new data
titanic_auto_ml_loaded.ml_bp01_multiclass_full_processing_xgb_prob(val_df, preprocessing_type='nlp')
# access predicted labels
val_y_hat = titanic_auto_ml_loaded.predicted_classes['xgboost']

That’s it. We have created a full ml pipeline in a few lines of code. During runtime some useful information will be printed out. Time stamps of blueprint steps:

Selected features:

Hyperparameter trial information:

Feature importance (here via SHAP as Xgboost ran on GPU):

ROC curve and different evaluation metrics:

As of now you can select three different preprocessing pipelines and multiple models/model ensembles to be trained on top. If these are not enough it is possible to create custom pipelines as well. Example notebooks and information can be found on Github and Pypi.

Will e2eml be maintained?

Yes. e2eml reached it’s MVP state and will be maintained and enhanced. If you have any questions feel free to drop me a message via LinkedIn.

Some last words

As of now the library covers classification and regression blueprints. It is not truely end to end as deployment is missing. The creation of an API endpoint as part of the library will be investigated.

The library is not designed to run on weak hardware. Runtimes are rather long due to the hyperparameter optimization (except for Logistic and linear regression blueprints).

Please note, that such a library does not guarantee a good model. Even the best libraries will require your business knowledge. The purpose of this library is to reduce your model building time so you can invest more time into business thinking.

I hope you like the article. Feel free to share your thoughts. Happy model building. :-)

*just to be clear: I only mentioned MentorCruise as part of the process. MentorCruise did not have any involvement in getting named here.

--

--

Thomas Meißner

Data scientist at SumUp. Passionate about data, good food, coffee and wine. Father of two lovely children.