Supervised Learning on Python — Predicting Customer Churn 2

Deal with the skewness & Create dummy variables

Published in

Cheer and Utkarsh’s trial on Machine Learning

5 min readDec 12, 2019

In the previous chapter, we talked about the process of data preparation and feature visualization. We are going to finish the rest of the EDA in this paragraph.

Deal with the skewness

Before we start to deal with this issue, let’s talk about what is skewness and why we need to handle it.

In reality, lots of the raw data we get is skewed. If the data is skewed, many statistical models might not provide an appropriate answer. In skewed data, the tail region may act as an outlier for the statistical model and we know that outliers adversely affect the model’s performance. Hence, there is a need to remove the skewness from the data.

Lets check the skewness of the numerical variables in our dataset by using .skew() and visualize it by sns.distplot

# Plot temp on y bar
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3,figsize=(30, 5))sns.distplot(telco[“tenure”], ax=ax1 , color=’#36688D’)
sns.distplot(telco[“MonthlyCharges”], ax=ax2, color=’#36688D’)
sns.distplot(telco[“TotalCharges”],ax=ax3, color=’#36688D’)

tenure = round(telco[“tenure”].skew(),5)
print (“the skew of tenure is”, tenure) 
MonthlyCharges = round(telco[“MonthlyCharges”].skew(),5)
print (“the skew of Monthly Charges is”, MonthlyCharges) 
TotalCharges = round(telco[“TotalCharges”].skew(),5)
print (“the skew of TotalCharges is”, TotalCharges)

According to the graphs and results that are shown above, we infer that TotalCharges has a long tail and the result obtained from .skew() is relatively far from 0. In conclusion, we are only going to transform the TotalCharges. To transform this positively skewed (a.k.a right skewed) variable we use Square root transformation:

telco[“TotalCharges”]= np.sqrt(telco.TotalCharges)

Ta-Da! After taking the square root the result is 0.3077. Compared to pre-transformation, it has significantly improved from 0.9632 to 0.3077. The comparison of the distribution is given below.

Left : before transformation / Right : After Square Root Transformation

Scale the numerical features

Another thing to do before the modeling is to scale the numerical features. As usual, we talk about WHY we need to scale the features first.

I like the picture above as it precisely elaborates why there is a need to scale the numerical features before we proceed further with machine learning. Imagine if there are 2 numerical variables with different units 1 representing weight in kg’s and the other representing salary in €’s, the algorithm might provide more weightage to the variable which has higher magnitude (e.q 50000€ > 5kg) and will obtain misleading results.

Have a look for Sudharsan’s article, he explains really well for Why When and How to Scale the features.

A manual way to do this is subtract the average value and divided by standard deviation. But we will just use the library StandardScaler to help us normalize the numerical variables. This give us the columns with average of 0 and standard deviation of 1 (In general case, the range will between -3 and 3)

telco[numerical].info()

from sklearn.preprocessing import StandardScaler# Create scaler
scaler = StandardScaler()# Transform the feature
standardized = scaler.fit_transform (telco[numerical])# Drop the non- scaled numerical columns
telco = telco.drop(columns = numerical, axis = 1)# Merge the non-numerical and the scaled numerical columns
telco1 = pd.merge(telco,standardized, left_index=True, right_index=True, how = 'left')# Plot temp on y bar
fig, (ax7, ax8, ax9) = plt.subplots(ncols=3,figsize=(30, 5))sns.distplot(telco1["tenure"], ax=ax7, color='#36688D')
sns.distplot(telco1["MonthlyCharges"], ax=ax8, color='#36688D')
sns.distplot(telco1["TotalCharges"],ax=ax9, color='#36688D')

we can see the range of y-axis is between -3 and 3.

Create the dummy variable

We mentioned how to scale the numerical features, now its times for us to handle the categorical features. The reason why we need to convert categorical variable into numeric is because of Machine Learning constraints. We are going to use library scikit-learn later in machine learning part to make the predictions and this package does not transform categorical data to numeric automatically. Hence, we need to dummy encode all the categorical variables.

Luckily, Pandas has a function which can split a categorical variable into several variables (depending on the number of levels present in the variable) with values “0 or 1” which makes them a lot easier to quantify and compare.

Normally we can easily create the dummies by using the function pd.get_dummies as below.

P.S The drop_first comment is to get k-1 dummies out of k categorical levels by dropping the first level. In order to avoid multicollinearity (For e.g.- Gender will be split into Male and Female which are perfectly correlated with each other, you only need one to explain the other)

#One-hot Encoding Categorical Variable 
telco1 = pd.get_dummies(data = telco1, columns = categorical, drop_first = True)
telco1.head()

But in this dataset we need to do some commands as below before we use pd.get_dummies since we make a lot of effort on feature engineering that we mentioned in the first chapter. So we need to make sure that all the features that we create earlier are categorized in “categorical”

categorical = list(telco.select_dtypes(‘object’).columns.drop(‘Churn’))

# we will exclude customerID, since we don't need to dummify it
categorical = categorical[1:]

So what is missing here? The variables that we binned for the numerical features in chapter 1. So we use the .append function to add the element back into the list.

categorical.append(‘tenure_amend’) 
categorical.append(‘MonthlyCharges_amend’)

Now all the categorical features that we want to dummify are all there. Now we can apply pd.get_dummies function that we mentioned above to easily do One-hot Encoding.

telco1 = pd.get_dummies(data = telco1, columns = categorical, drop_first = True)

Then That’s it! IMPRESSIVE! Finally the EDA part is all done.

It’s been a looooong way but finally the 60% of the work is done. In this chapter, we learn 3 imperative processes for EDA : skewness , normalization and dummification. For our next chapter, which is also the final one, I am sure that you are ready to go through the Machine Learning and Business Strategy with us!

STAY TUNED!

Resources:

Data transformation

Create dummy variables

One-Hot Encode Nominal Categorical Features

Supervised Learning on Python — Predicting Customer Churn 2

Deal with the skewness & Create dummy variables

Deal with the skewness

Scale the numerical features

Create the dummy variable

Written by Cheer Hung