The ultimate guide to Encoding Numerical Features in Machine Learning.

Paresh Patil
5 min readAug 18, 2023

--

Table of Contents:

· Discretization:
· Why Discretization?
· Types Of Discretization :-
· Binarization :-

Encoding numerical features refers to the process of representing numerical data in a format suitable for machine learning algorithms.

Unlike categorical features,which are represented as discrete values or text labels,numerical features are continuous and have a range of possible values

However,there are scenarios where numerical features can be encoded or transformed to enhance the performance of machine learning models.

Suppose you have data, and it has a column called ‘AGE’.so how can you convert this data into categorical data?

for eg.

A few days ago, I was working on a small ML problem, and I had a dataset of Google Play Store. It had a column called “Number of Installs.” and it had different values like 23, 1010244, 9934,56534.

This numerical representation will not help you much. if you plot this curve,if you plot this curve it will be like this

Many downloads are very low, and some downloads are very high.So what i did was convert numerical data into categories by creating bins.

After that, my whole problem became easy, and results also started getting better.

It is very problem specific that you may ever have to convert your numerical data into categorical data.

Two famous techniques to convert from numerical to categorical

  1. Discretization (also called as binning)
  2. Binarization

Discretization:

Discretization, also known as binning, is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals or bins that cover the range of values in the variable.

Each interval represents a category or label, and the continuous values are then assigned to their corresponding bins.

Discretization is also called binning, where bin is an alternative name for interval.

Why Discretization?

  1. To handle outliers:By putting values into bins, discretization can assist in handling outliers.
  2. To improve value spread: if there is more data in between, you can spread it.

Types Of Discretization :-

Unsupervised Binning :-

Equal-width or equal-depth binning, another name for unsupervised binning, splits the range of values into a specified number of bins without taking the target variable into account. The objective is to divide the data into bins of roughly equal width or depth, each containing the same amount of data points.

Techniques of unsupervised binning :-

  1. Equal Width Binning (Uniform Binning): With this method, the value range is divided into a predetermined number of bins that are each the same width. The distribution of data inside each bin is not considered. If the data is not spread uniformly across the range, this method may produce an unequal distribution of data points within bins.

Benefit :- To handle outliers and no change in the spread

2. Equal Frequency Binning (Quantile Binning) :Equal frequency binning separates the range of values into bins with an equal number of data points in each bin. An equitable distribution of data points within each bin is the goal of this strategy.

Benefit :- To handle outliers and make value spread uniform

3. K-Means Binning :-K-means binning is a method for discretizing continuous variables by using the k-means clustering algorithm. Similar data points are grouped together into k clusters, where k is the required number of bins. The bin boundaries are based on the cluster centers.

Supervised Binning :-

Optimal binning, sometimes referred to as supervised binning, takes into account the connection between the target variable and the variable you wish to discretize.

Binning decision trees is a typical method for supervised discretization.

Decision tree binning is a valuable technique for feature discretization in classification issues since it allows you to build bins that optimize the separation between various target classes.

Custom Binning:-

Custom binning involves manually defining the bin boundaries based on domain knowledge or specific requirements.

How to implement?

In the Scikit-Learn library, there is a class called KBinsDiscretizer.

# Import the necessary module
from sklearn.preprocessing import KBinsDiscretizer

# Create an instance of the KBinsDiscretizer class, specifying the desired parameters
discretizer = KBinsDiscretizer(n_bins=5, strategy='uniform', encode='ordinal')

# Fit the discretizer to your data
discretizer.fit(X)

# Transform the data into the discrete bin representation
X_discretized = discretizer.transform(X)

Refer the following code and dataset for better understanding :-

Binarization :-

In the process of “binarization,” continuous variables are transformed into binary values (0 or 1) based on a predetermined threshold. Using this method, you can identify if a data point is above or below the threshold.

Take the annual income column, for instance. The limit was set at 600,000. Any result for yearly income less than $600,000 will be given a value of 0, signifying that the person is not in the taxable zone. On the other hand, any income value greater than or equal to $600,000 will be given the number 1, signifying that the person is in the taxable zone.

How to implement?

In Scikit-learn, you can use the Binarizer class to perform binarization.

# import necessary module
from sklearn.preprocessing import Binarizer

# Create an instance of the Binarizer class, specifying the desired threshold and any other parameters:
binarizer = Binarizer(threshold=6.0, copy=True)
# The copy parameter determines whether to create a copy of the input data

# Transform the data using the transform method:
X_binarized = binarizer.transform(X)

# You can further manipulate the X_binarized array as needed for your analysis.

Refer the following code and dataset for better understanding :-

--

--

Paresh Patil

Data wizard, blending science and analysis, conjuring insights to fuel innovation and drive data-driven excellence