10] Data Transformations in Machine Learning Part 2: Data Binnings in Machine Learning: Binning(Discretization) and Binarization

Vinod Kumar G R
3 min readJan 20, 2024

--

You have seen the different data transformation techniques in the last article, i.e., Log Transformer, Reciprocal Transformer, Square Transformer, Square Root Transformer, etc.

We’ll continue our discussion of leftover two important techniques….

Data Transformation Techniques:

  1. Binning / Discretization
  2. Binarization

Binning/discretization and binarization are techniques used to transform continuous numerical data(Height, Weight, Mass, Temperature, Energy, Speed, Length, etc) into discrete or binary representations.

1. Binning / Discretization

Binning is the process of grouping a set of continuous or numerical data points into a smaller number of discrete “bins” for analysis.

What are Bins? Bins are intervals or ranges into which you divide the range of your continuous numerical data.

Why do we create Bins?

  • Simplification: Binning simplifies the data by converting a range of values into a smaller set of discrete categories, making it easier to understand and interpret.
  • Handling Non-Linearity: Some machine learning algorithms may assume linear relationships, and binning can help capture non-linear patterns.
  • Dealing with Outliers: Binning can also be useful for handling outliers by placing extreme values into specific bins.

Here is an example, let’s consider the AGE feature, Instead of using individual ages, you might create bins like “0–10,” “11–20,” and so on. This way, you’ll group ages into categories or bins.

When utilizing the KBinsDiscretizer library in scikit-learn for binning, you will encounter a parameter named 'strategy.' This parameter offers different strategies to define the widths of the bins during the discretization process. The available strategies include 'uniform,' 'quantile,' and 'kmeans.' By experimenting with these strategies in the provided Colab notebook, you can visualize how the data transforms through plots, gaining insights below, to the impact of each strategy on the resulting bin configuration.

2. Binarization:

Binarization is the process of converting numerical data into binary form, typically 0s and 1s. It involves setting a threshold value, and any data point above the threshold is marked as 1, while those below or equal to the threshold are marked as 0.

Let’s take an example, temperatures, where anything above a certain temperature is considered “hot” (1), and anything below is considered “not hot” (0).

You can see the practical things below colab notebook,

Key Differences between both:

  1. Nature:
  • Binning transforms a continuous variable into discrete categories or bins.
  • Binarization transforms numerical values into binary values (0 or 1) based on a threshold.

2. Output:

  • Binning results in categorical features representing different bins.
  • Binarization results in binary features (0 or 1) based on a specified threshold.

3. Method:

  • Binning involves creating predefined intervals and assigning values to those intervals.
  • Binarization involves setting a threshold and transforming values based on whether they are above or below the threshold.

4. Flexibility:

  • Binning allows for more flexibility in defining intervals and capturing patterns in data.
  • Binarization is more rigid, simply categorizing values as either 0 or 1 based on a threshold.

In conclusion, our exploration of data binning and binarization in machine learning underscores the versatility and significance of tailoring our data to align with the demands of diverse models. From the structured discretization introduced by binning to the simplicity of binary representation through binarization, each technique serves a crucial role in reshaping our datasets.

Previous article: 9. Data Transformation in ML

Next article: 11. …………….

--

--