Scikit-Learn Pipeline Transformers — The hassle of transforming target variables (Part 1)
After working quite a while on time series forecasting using Scikit-Learn I came across a major issue related to the implementation of standardized Pipelines: Creating entire Pipelines that can transform target variables into features (e.g. Lags, Moving Averages) in a production environment.
After some extensive and unfruitful research on the topic, I decided to try on solving this issue without having to recur to messy and somewhat hardcoded pre-processing functions, which have to be called before the actual model.
The idea of part 1 of this article is to present how I managed to overcome this problem using a simple, yet elegant solution.
The usual approach
Whenever you encounter an article or story concerning time series forecasting problems, there is always the “Transform to supervised” function, where you will pass a DataFrame or array and some parameters to convert a one-dimensional time series into a supervised learning dataset.
The example below, extracted from Machine Learning Mastery depicts such a function
There is absolutely nothing wrong about using this function for prototyping or testing. The major drawback is that it cannot be used inside a Sklearn transformer.
From now on I will assume that you know how transformers work, and also know that if you chain transformers on a pipeline, the only argument that will be passed on is your X (features array) and not your target variables y. The only exception to that are the label transformers or TransformedTargetRegressors.
Bottom line is: you cannot use a transformer that will take a part of your target variable array and transform it into part of your X array. You either transform one or another.
If you are not sure what I am talking about, go ahead and try the following code by yourself and check the results:
If you try to run this code you will get the following error:
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
That happens because the y argument of the transform function is not passed on through the pipeline.
For that exact reason you can’t make your data transformations all happen in a single, neat pipeline than can be stored as an artifact on any MLOps platform. Your only option here is to leave feature transformation take place before the model creation.
OK, got it! So… what do we do?
Well, first things first: what does a transformer actually do?
By definition it takes in any number of arguments during instantiation and, afterwards, takes an object, transforms it, and returns the transformed object. Usually these objects are our DataFrames or numpy arrays, however, the trick here is to understand that these objects can be literally anything!
So, with that in mind, why don't we create a simple object that contains both, X and y, and transform it?
Lets try the same example from above, but using this little trick:
And there you go! Now you have a transformer that can take both, X and y and transform them without breaking a sweat. Note that this strategy can be further expanded to extremely complicated pipelines and it still works.
You can even assign data from your intermediate steps to your object using, for example, a list, in case you need to debug the output data and you don’t want to go around writing print statements and putting breakpoints all over your transformations.
In this story we started with the basics of how to trick Scikit-Learn pipelines into transforming our data using both: features and target variables. Now that we have the data transformation pipelines, part 2 of this story will be to demonstrate how we can pass this object to our models without having to create intermediate steps, allowing us to directly use