Analytics Vidhya
Published in

Analytics Vidhya

Scikit-Learn Pipeline Transformers — The hassle of transforming target variables (Part 1)

After working quite a while on time series forecasting using Scikit-Learn I came across a major issue related to the implementation of standardized Pipelines: Creating entire Pipelines that can transform target variables into features (e.g. Lags, Moving Averages) in a production environment.

After some extensive and unfruitful research on the topic, I decided to try on solving this issue without having to recur to messy and somewhat hardcoded pre-processing functions, which have to be called before the actual model.

The idea of part 1 of this article is to present how I managed to overcome this problem using a simple, yet elegant solution.

The usual approach

Whenever you encounter an article or story concerning time series forecasting problems, there is always the “Transform to supervised” function, where you will pass a DataFrame or array and some parameters to convert a one-dimensional time series into a supervised learning dataset.

The example below, extracted from Machine Learning Mastery depicts such a function

There is absolutely nothing wrong about using this function for prototyping or testing. The major drawback is that it cannot be used inside a Sklearn transformer.

From now on I will assume that you know how transformers work, and also know that if you chain transformers on a pipeline, the only argument that will be passed on is your X (features array) and not your target variables y. The only exception to that are the label transformers or TransformedTargetRegressors.

Bottom line is: you cannot use a transformer that will take a part of your target variable array and transform it into part of your X array. You either transform one or another.

If you are not sure what I am talking about, go ahead and try the following code by yourself and check the results:

If you try to run this code you will get the following error:

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

That happens because the y argument of the transform function is not passed on through the pipeline.

For that exact reason you can’t make your data transformations all happen in a single, neat pipeline than can be stored as an artifact on any MLOps platform. Your only option here is to leave feature transformation take place before the model creation.

OK, got it! So… what do we do?

Well, first things first: what does a transformer actually do?

By definition it takes in any number of arguments during instantiation and, afterwards, takes an object, transforms it, and returns the transformed object. Usually these objects are our DataFrames or numpy arrays, however, the trick here is to understand that these objects can be literally anything!

So, with that in mind, why don't we create a simple object that contains both, X and y, and transform it?

Lets try the same example from above, but using this little trick:

And there you go! Now you have a transformer that can take both, X and y and transform them without breaking a sweat. Note that this strategy can be further expanded to extremely complicated pipelines and it still works.

You can even assign data from your intermediate steps to your object using, for example, a list, in case you need to debug the output data and you don’t want to go around writing print statements and putting breakpoints all over your transformations.

Concluding remarks

In this story we started with the basics of how to trick Scikit-Learn pipelines into transforming our data using both: features and target variables. Now that we have the data transformation pipelines, part 2 of this story will be to demonstrate how we can pass this object to our models without having to create intermediate steps, allowing us to directly use fit() and predict() methods.

Part 2:

https://carlos-schwabe-44885.medium.com/scikit-learn-pipeline-transformers-the-hassle-of-transforming-target-variables-part-2-ec8546c33ac6

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

TA505 adds GoLang crypter for delivering miners and ServHelper

Engineering101

Decentralized volunteer computing through NuNet

Use that extra R.A.M…

Alternating Characters Problem Solution

How Can Full Stack Developers Learn So Many Skills?

Using a CDN for your static assets served by Kubernetes

AZ-900: Exam Questions (11/12)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Carlos Schwabe

Carlos Schwabe

Co-Founder at Brick Insurance and advanced analytics enthusiast

More from Medium

Feature Selection with Boruta in Python

Feature engineering Tools Every data scientist and Aspirint must need to know in 2022

How to Speed Up XGBoost Model Training

Handling imbalanced Datasets in ML