How to Build End-to-End Custom Machine Learning Pipeline?

Hasan Basri Akçay
DataBulls
Published in
7 min readOct 10, 2022

A brief introduction to constructing a custom machine learning pipeline

Image by storyset on Freepik — www.freepik.com

What is Machine Learning Pipeline?

ML pipeline is a technique to construct end-to-end workflow such as feature cleaning, encoding, extraction, selection, etc. and helps to increase the reusability of your code by decreasing all ml steps into one model. Furthermore, we can avoid some mistakes using ml pipelines like using the wrong version of the encoding model or forgetting to repeat a process that is done in model training progress. ML pipelines become more significant when considering that many steps are required for successful ml projects in real-world problems. Also, reusability, reliability, and efficiency are major parts of Machine Learning Operations (MLOps). Therefore, the ML pipeline is an important part of the MLOps.

Pipeline Examples

Sklearn has excellent methods for machine learning steps like Column Transformer, Standard Scaler, One-Hot Encoder, Simple Imputer, etc. and these methods ease data scientists' work. In this article, we don’t explain all methods in detail because of keeping reading time short but a lot of models are used for building ml pipelines like the below example. Briefly, the following pipeline example uses Min-Max Scaler for data scaling, KNN Imputer for missing value processing, Ordinal Encoder to convert categorical features to numerical features, and Linear Regression for predictions as a machine learning model.

We used Deepnote which is well designed Jupiter-like web-based notebook environment that supports multi-user development to share codes.

After the pipeline construction, it behaves like any Sklearn machine learning model. Fit, predict, and predict_proba methods of the Sklearn models are available for the pipeline. Furthermore, easy to evaluate by using Cross Validation and easy to hyperparameter tuning by using Grid Search CV, Random Search CV, or the Optuna. In the below example, the Sklearn cross_val_score method is used with the pipeline without any error.

Everything looks fine right now however the real-world datasets are more complex than toy datasets. Data scientists most of the time do specific data cleaning processes to create a good model and data cleaning is not enough sometimes too. When they don’t achieve the goal after data cleaning, they use feature engineering techniques for more accurate models. As a result, only classical Sklearn methods often don’t solve your problem that’s why you need to create your own functions. In this article, you can find how to add a unique function to the pipeline.

To be successfully used by most businesses, artificial intelligence needs to be less focused on building models and more focused around data. — Andrew Ng

Custom Machine Learning Pipeline

There are a lot of models in the Sklearn for machine learning pipeline like Iterative Imputer, Normalizer, Label Encoder, etc. These models are well documented on the Sklearn website. Therefore, we only focus on the custom models in this article.

Feature Cleaning in the Pipeline

Data cleaning or with another name feature cleaning is the process that correcting, compressing, or removing records from the dataset and there are many types of feature cleaning like missing value handling, converting string features to numeric features, etc. The missing value imputation and feature encoding techniques have Sklearn methods, so we primarily focus on outlier handling and correcting wrong-spelled words in this part.

1. Outlier Handling

Outliers are data that are far from other data, and they affect predictions badly, especially for linear models such as Linear Regression, Ridge, Logistic Regression, etc. Therefore, data scientists should clean the outliers, but it is hard to do manually. The minimum and maximum thresholds of each feature should be calculated and saved in a file, every training process the file should be updated for manual outlier cleaning; consequently, we create a python class for outlier elimination below. The reusable Outlier Handler class can work with the Sklearn Pipeline method, and it has one parameter which is a cut point for the range of the probability distribution. As a result, data scientists can save lots of time by using this class.

2. Spell Correction

In the real-world dataset, there can be a lot of misspelled words and humans can understand those spelling mistakes, but machines don’t. Artificial intelligent models think those are different words. For this reason, the engineers should fix the spelling mistakes however finding manually all mistakes can be hard, especially for big datasets. That’s why data scientists use Fuzzywuzzy which is a python library that matches the strings according to the match score, but Fuzzywuzzy is not suitable for Sklearn pipelines. For this reason, the Spell Corrector class is created below with two parameters that are “th”, and “verbose”. The “th” parameter is a threshold for the Fuzzywuzzy score while the second parameter “verbose” is controlling the verbosity. If verbose is true, the training process will be written in the console. The Spell Corrector class has the same methods as other Sklearn models, and the Spell Corrector class makes machine learning engineers’ work easier.

Feature Engineering in the Pipeline

Deep learning models can understand the patterns of features on their own, but machine learning engineers should find the pattern for classical machine learning models. The data scientists try to find the patterns by adding new features to the dataset such as arithmetic features, time features, ml predictions, etc. After the calculation of the patterns, the ml model score will be better; therefore, creating new features is one of the most important steps of data science but it is hard to do inside the Sklearn pipelines. That’s why we created an example of adding arithmetic features and ml predictions below. The below Add Features class has four parameters that are fold, random_state, n_clusters, and features. The first parameter fold is the fold number of the K-fold, random_state, and n_clusters are the parameters of the Kmeans model, and the feature parameter is a list of numeric features that the Kmeans model is trained with. The features that should be added can change issue by issue so data scientists can modify the below class for their problems and use it in Sklearn pipelines.

Feature Selection in the Pipeline

Generally, machine learning models don’t use all input data and some features can slow your model, or even worst they can reduce the model’s score; for these reasons, data scientists select important features. There are many feature selection algorithms like SelectKBest, Variance Threshold, Permutation Importance, Featimp, etc. and it is usually done in the exploratory data analysis part of data science in big data sets for faster preprocessing but there is no big execution time difference in small data sets; therefore, we use SelectKBest in the pipeline.

You can see the example of the pipeline building process below that the Outlier Handler, Spell Corrector, and Add Features classes used. According to the processes, firstly preprocessor is created, and then important features are selected. The last pipeline part regressor where Linear Regression is used for estimating returns the result. After the pipeline construction, you can easily save your model by using Pickle, or Joblib which means all data science processes can save in just a file that makes your work more professional and your code more readable.

Testing is always necessary for data science projects; therefore, data scientists may want to do a final test on the data pipeline. The assertion method of NumPy that checks the equality of two arrays is a good option for testing results. If two arrays are not equal, the function returns an assertion error. As you can see below, the results of cross-validation scores are tested by the assert_array_equal method before and after saving the model; consequently, the model is confirmed for use in the product.

Conclusions

This article was about how to add our own functions to the machine learning pipeline and according to the article's results, we can create a machine learning product using the Sklearn pipeline method with specific functions. These specific functions can be used for outlier handling, spell corrections, feature extraction, and feature selection, or you can modify them and use them for whatever you want to do.

Within the article’s scope, the product’s code becomes more readable, reusable, and suitable for MLOps for this reason the lifetime of the machine learning models will be longer and more accurate.

--

--