Feature Engineering: Categorize your data using One-Hot Encoding.

Learn one-hot encoding and redundancy removal techniques to optimize churn-prediction models.

Asish Biswas
AnalyticSoul
6 min readJun 11, 2024

--

Welcome back! In the previous lessons, we introduced the concept of customer churn, explored different types of churn, and performed an initial analysis of our telecom dataset. Today, we’ll take a crucial step in our data preparation process: feature engineering. Specifically, we’ll focus on one hot encoding, a powerful technique to transform categorical data into a format suitable for machine learning algorithms. Afterwards, we will eliminate unnecessary features from our dataset to reduce potential biases.

One Hot Encoding

One hot encoding transforms categorical variables into a binary matrix, where each category is represented by a unique binary vector. This approach is particularly useful for algorithms that cannot work with categorical data directly, such as logistic regression.

Essentially One-Hot encoding creates a binary column for each category, but only the active category is only set to 1 and all the other columns are set to 0. Let’s take an example where we have a column “Color” with three color categories (Blue, Green, and Red). After applying the One-Hot encoding, we get three binary columns for each color. Each column gets activated (value=1) on those records where that particular color was set in the initial column.

For categorical features where no ordinal relationship exists, normal integer encoding creates in issue for the algorithm, because most algorithms assume natural ordinal relationships between categories. Because of this, model performance may suffer.

One Hot Encoding

If we observe the encoded output closely, we see that one column is redundant. If Color_Green and Color_Red are 0, we can safely assume that Color_Blue is the active category. This redundant feature can introduce biases to the model. In fact, because of this, Scikit-learn accepts a boolean parameter drop_first for the one-hot encoder function. If we pass True, it will remove the first column automatically.

Benefits of One Hot Encoding

One hot encoding provides several advantages:

  • Eliminates Ordinal Relationships: It ensures that no ordinal relationships are implied between categories, which is crucial for categorical data.
  • Improves Model Performance: Many machine learning algorithms perform better with numerical input, making one hot encoded data more suitable for modeling.

Identify numeric and categorical features

Let’s separate the categorical and numeric columns as a list. We’ll consider features with less than 5 unique values as categorical, and the rest as numeric features as they resemble more like continuous values.

At first, we drop the CustomerID column as it doesn’t add any value to our model. Then we identify categorical features. We consider a feature as categorical where the number of unique values of that feature is greater than 5. Finally we remove the non-categorical features (numerical features).

# load the dataset
df_telco = pd.read_csv('data/telco_customer_churn.csv', header=0)

# remove 'CustomerID' column
df_telco.drop(columns=['CustomerID'], inplace=True)

# identify categorical columns
categorical = df_telco.nunique()[df_telco.nunique() < 5].keys().tolist()
categorical.remove('Churn')
print('Categorical columns:\n', categorical)

# identify numerical column names
numerical = [x for x in df_telco.columns if x not in categorical]
numerical.remove('Churn')
print('\nNumeric columns:\n', numerical)
Categorical columns

Implementing One-Hot encoding

To implement one-hot encoding we can use Pandas get_dummies() function. We set drop_first = True to remove the first encoded column because it is redundant with the other encoded features.

df_telco_encoded = pd.get_dummies(
data=df_telco,
columns=categorical,
drop_first=True,
dtype=int
)

df_telco.head()

In this sample encoded data, we can see that the Gender_Male column is there but Gender_Female is not. Similarly, SeniorCitizen_Yes and Partner_Yes are there but SeniorCitizen_No or Partner_No are not. That’s the effect of drop_first = True.

Remove redundant features

Now, we’ll take our feature engineering process a step further by identifying and removing redundant features from our telecom dataset. This step is crucial to ensure that our model remains efficient and free of unnecessary complexity.

Redundant features are those that provide little to no additional information to the predictive model. They can lead to overfitting, where the model performs well on training data but poorly on unseen data. Removing these features helps in:

  • Improving Model Performance: By simplifying the model, we can enhance its generalizability and reduce the risk of overfitting.
  • Reducing Computational Cost: Fewer features mean less computational resources are needed for training the model.
  • Enhancing Interpretability: A simpler model with fewer features is easier to understand and interpret.

Pairwise correlation

A pairwise correlation matrix is a great way to identify inter-feature dependencies. It’s essentially a matrix describing the correlation coefficient for all possible pairs of values. We’ve learned about pairwise correlation back in chapter 3. Here we revisit it.

We create a helper function wrap_axis_labels() to format the labels of the x-axis and y-axis of the plot. We also create a mask for the upper triangle of the correlation matrix to avoid redundant information. Then the heatmap is created with the correlation matrix, and the axis labels are formatted accordingly.

def wrap_axis_labels(ticklabel):
return [x.get_text().replace('_', '\n') for x in ticklabel]


# Compute the correlation matrix
corr_matrix = df_telco_encoded.corr().round(2)

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(16, 14))

# Only plotting the columns where the correlations are very high
s = sns.heatmap(corr_matrix,
annot=True,
cmap='RdBu',
vmin=-1,
vmax=1,
mask=mask)
s.set_xticklabels(wrap_axis_labels(s.get_xticklabels()), rotation=90)
s.set_yticklabels(wrap_axis_labels(s.get_yticklabels()), rotation=0)
plt.title('Correlation Matrix')
plt.show()
Correlation matrix

In our correlation matrix, we see that some column pairs have high correlation. When the correlation coefficient is 1 that is known as a perfect correlation. It means changes in one feature affect the other feature in an exact proportion. These redundant features may create biases in the model. The way to handle this situation is simply to remove those redundant features.

Removing redundant features

The column InternetService_No already covers all the other internet-related features. Therefore we can simply remove the redundant columns identified by the correlation matrix.

# remove perfect correlated features (keep only one)
corr_cols = [
'OnlineSecurity_No internet service',
'OnlineBackup_No internet service',
'DeviceProtection_No internet service',
'TechSupport_No internet service',
'StreamingTV_No internet service',
'StreamingMovies_No internet service',
'MultipleLines_No phone service'
]
df_telco_encoded.drop(columns=corr_cols, inplace=True)

Now it’s your turn. Feel free to explore and experiment using the jupyter notebook.

In this lesson, we covered one of the most importance feature engineering techniques “one hot encoding” to categorical data in our telecom dataset. We also refined our dataset by removing redundant features.

In the next lesson, we will focus on model building, where we will use our prepared dataset to construct and train a logistic regression model to predict customer churn. Stay tuned!

What’s next?

Join the community

Join our vibrant learning community on Discord! You’ll find a supportive space to ask questions, share insights, and collaborate with fellow learners. Dive in, collaborate, and let’s grow together! We can’t wait to see you there!

Thanks for reading! If you like this tutorial series make sure to clap (up to 50!) and let’s connect on LinkedIn and follow me on Medium to stay updated with my new articles.

Support me at no extra cost by joining Medium via this referral link.

--

--