Using Sklearn’s PowerTransformer

Patrick L. Cavins
8 min readMay 13, 2019

--

An example of data before and after it has been transformed using a power transformer [1]

Transforming data is an essential part of the data scientist’s tool kit. As we are changing data, we need to be careful about the methods and reasons we use for those manipulations. In this article, we are going to walk through using sklearn’s power transformer module which contains both the Box-Cox transformation and the Yeo-Johnson transformation.

The most common statistical assumption data scientists make is that their data is from a gaussian distibution, and furthermore many of the most common machine learning models are built upon that assumption. Because many ML tools require gaussian-like data the first check before implementing a model is to determine of the data is Gaussian-like. There are various different approaches to test for normality. Testing for normality falls into two broad categories, visual checks (histograms, QQ-Plots) and statistical methods (Shapiro-Wilk Test, D’Agostino’s K^2 test).

Checking for Normality

For this code along , I am going to be working with part of the Ames Housing Dataset. If you have access feel free to use your own copy. If not, I have provided a link here. If you examine the data, we could use any number of the different features, but I have decided to use Garage Area. A quick EDA, will reveal that there is a single null value, for ease I went ahead and replaced that null value with zero. Also there are numerous zeros reported in the data, let’s just assume that the zero’s represent homes which do not have a garage.

#Load the Data
df = pd.read_csv(‘./train.csv’)
#Checking for Nulls / EDA
nulls = df[‘Garage Area’].isnull().sum()
print (f’The numbers of nulls: {nulls}’)
#Replace the Null with 0
df[‘Garage Area’] = df[‘Garage Area’].replace(np.nan, 0)
nulls = df[‘Garage Area’].isnull().sum()
print (f’The numbers of nulls: {nulls}’)
#Drop Zeros from the pd.series (if zero we can assume that a garage is not around)
garage_area = df[‘Garage Area’]
garage_area = garage_area[garage_area != 0]

So let’s say that you are thinking about using the power transformer method. Not only do we want to test for normality before applying the transformation, but also we might want gather some additional information about the current distribution of the data, like kurtosis and skew. The simplest way to do this is to plot a simple histogram.

#Plot a Histogram of the data 
ax = sns.distplot(garage_area, kde=True)
ax.set(xlabel= ‘Garage Area (ft²)’, ylabel=’Counts’, title=’Frequency of Garage Sizes (ft²)’)
plt.show()
Histogram of Garage Area feature from Ames, IA dataset

As mentioned above, kurtosis and skew are two common terms you might run into when describing normality. In the histogram of the garage area variable, we can clearly see that the distribution is not uniform (normal). We can formally calculate the skew of the distribution using SciPy’s stats model, in this case 0.80.

Table summarizing heuristics about skew

A good heuristic you can use to evaluate this value is in the table below, and in this case the garage area distribution is moderately skewed. A less common descriptive statistic you might run across is kurtosis. Like skew, kurtosis describes the distribution. One of the reasons we talk less about kurtosis is because the nomenclature surrounding it can be confusing. To simplify things, let’s just talk about positive and negative kurtosis. A positive kurtosis (greater than 1) implies the distribution’s shoulders are too narrow, and a negative kurtosis (less than 0) means the distribution’s shoulders are too wide. In this case, we have a kurtosis equal to 0.81, and therefore the distribution exhibits positive excess kurtosis.

Now that we have visually inspected and described the current distribution of the garage area variable, we can use a statistical approach called D’Agostino’s K2 test. In the simplest terms this statistical test calculates the kurtosis and skew of distribution and compares it to a normal distribution. In our case the p-value returned was equal to 3.07 x 10–45 and thus the sample does not look Gaussian, and we can reject the null hypothesis.

When it comes to statistical tests for normality, both Shapiro-Wilk and D’Agostino, I want to included this important caveat. With small samples, say less than 50, normality tests have little power. In these cases visual inspection, skew, and kurtosis of the data is very important. You can also consider using outlier tests like Grubs, and then retest the skew and kurtosis.

Using Sklearn’s Power Transformer Module

Now that we have assessed the normality of our data lets move on to using the power transformer module in sklearn. As the name implies, we are going to change (or transform) the data in our input variable, garage area. This might sound worrisome, but remember we change data all the time. Some common examples are converting between units, centimeter to meter, or converting between Celsius and Fahrenheit. The key difference being the conversions I just described do not change the distribution of the data, but a power transformation will change the distribution of the data. The sklearn power transformer preprocessing module contains two different transformations:

  • Box-Cox Transformation: Can be used be used on positive values only
  • Yeo-Johnson Transformation: Can be used on both positive and negative values

Why Use the Power Transformers Module?

When making a linear regression model we make some assumptions about the data we are using in the model. These assumptions are summarized by the L.I.N.E. acronym.

  • In LINE, N = Normality (the errors follow a normal distribution)
  • In LINE, E = Equality of Variance, homoscedasticity

The Box-Cox Transformation

Box-Cox Transformation

Before we get started on using the module sklearn let’s code through an example using the math. In this example, I chose two arbitrary values for lambda, 0.1 and 1.0 just to demonstrate the strength of the box-cox transformation.

Distribution of Garage Area with Lambda = 0.1
Distribution of Garage Area with Lambda = 1.5

As you can see in the above histograms, the scale of lambda greatly effects the resultant distribution of the feature. Because the distributions are now on different scales it is difficult to compare them. A good strategy, default in sklearn’s power transformer module, is to standardize the data before you transform it.

Visualizing Changes

If you are interested in seeing how the lambda parameter affects the size of the transformation, I suggest using a normalization technique like Z-score or Min-Max Scaler. For this example, I went ahead and used the Z-score which gives a mean of zero, and therefore we must switch from Box-Cox to Yeo-Johnson.

The histograms below show garage area feature normalized and then plotted used seaborn’s kdeplot.

How Lambda Effects the Distribution of the Normalized Garage Area Feature

If we visually inspected these histograms, we can see that the larger lambda value (1.5) does not decrease the right shoulder. The smaller lambda values (0.1 and 0.58) do visibly decrease that same shoulder.

Sklearn Implementation

We have now effectively walked through using SciPy’s power transformation methods using a single feature, garage area. If we wanted to transform multiple features at once though, the SciPy methods will not work because they take in a single array. Herein lie the strength of the sklearn implementation, you can pass multiple features into the module. Going back to the Ames dataset, let’s now look at garage area and lot area. Going through the same process, we need to clean up the data a little bit. To do this with with multiple features, implementing a boolean mask will remove rows where the garage area was zero and corresponding row in lot are at the same time.

Fit & Transform Data

If you are familiar with other sklearn modules then the workflow for Power Transformers will make complete sense. The first step is to insatiate the model. When you insatiate model take note of couple of defaults in the parameters:

  • methods = ‘yeo-johnson’
  • standardize = True (applies a zero-mean, unit-variance normalization to the output)

Next, the .fit method will calculate the optimal lambdas for the features you have included. Those values can be accessed using .lambdas_. Then by using the .tranform method we apply those calculated lambdas to the features, and pass the transformed data into a new data frame. A good sanity check is to make sure that work flow has worked like you assumed, do this by rerunning the stats.normaltest and compare the outputs. The original p-value was equal to 3.07 x 10^-45, and after the transformation it is equal to 0.130. Therefore we fail to reject the null hypothesis.

Transforming (Yeo-Johnson) the features, Garage Area and Lot Area

Conclusions

As you can see from this code along, SciPy and Sklearn both provide methods to do power transformations. One key benefit of the sklearn implementation is that you can pass multiple features into the transformer at once. I have so other notebooks in the github repo that you might find useful.

Good Tidbits

References:

[1] http://www.kmdatascience.com/2017/07/box-cox-transformations-in-python.html

--

--