Box-Cox Transformation, Explained
Python and scikit-learn Transformer implementation
The Cox Box transformation is to transform the data so that its distribution is as close to a normal distribution as possible, that is, the histogram looks like a bell.
This technique has its place in feature engineering because not all species of predictive models are robust to skewed data, so it is worth using when experimenting. It probably won’t provide a spectacular improvement, although at the fine-tuning stage it can serve its purpose by improving our evaluation metric.
All code examples can be also found in this Colab notebook
Box-Cox Equation in code
The transformation itself has the following formula
Let’s express them in code using the standard Python library
Or using NumPy package
I have the data, but how to select the lambda?
The case is not complicated, we need a normality test, compare its results for several lambdas in the range (customarily) <-5, 5> then choose the one whose test result is the best. A out-of-the-box solution is provided by the SciPy package
When the second argument (lambda) is not given to the boxcox function, it will be matched and returned.
The only problem we encounter when using this implementation is the requirement that the input data elements have to be greater than zero. However, we have to just shift the values by the minimum of the dataset.
Example for population by state in 2007
The full version of the code can be found in this online notebook, here I will only comment on the results.
On the left, we see the distribution of our input data. A keen eye will notice that imposing the logarithm (middle column) perfectly approximates our data to the normal distribution, but the best effect is achieved by using the title transformation (right column)
Box-Cox as a Scikit-learn transformer
Let’s implement it as a ready-to-use scikit-learn transformer, so you can use it in a Pipeline or FeatureUnion. Additionally, it allows you to use it in train/test data split. Remember, lambda has to be picked using the training dataset only.
Originally published at https://radekbialowas.com on February 7, 2022.
Thanks for reading. Don’t hesitate to clap and follow if you like this content. It takes only 2 seconds to help.