Feature Engineering in Machine Learning (Part 1)

Handling Numeric Data with Binning

Sogo Ogundowole
Hacktive Devs
3 min readMar 3, 2019

--

Intro

In machine learning, data comes in different formats/forms. For proper insight and efficiency in our machine learning model, it has to go through some tweaks so we can have the best out of it.

Feature engineering sits right between “data” and “modeling” in the machine learning pipeline for making sense of data. Most data scientists and machine learning engineers agree that data cleaning and feature engineering takes most of the time in machine learning workflow.

Though this process is very important only few people emphasize it in. There a couple of deep principles involved in it and they will be explained in this series.

Binning

Binning is the process of converting numeric data into categorical data. It is one of the methods used in feature engineering. Binning comes in very handy for numeric features, especially when it is one with wide range. A number of ways binning can be done include

  • Fixed-Width Binning
  • Quantile Binning
  • Binning by Instinct

Fixed-Width Binning

In fixed-width binning, each bin contains a specific numeric range. For example, we can group a person’s age into decades: 0–9 years old will be in bin 1, 10–19 years fall will be in bin 2. When the numbers span multiple magnitudes, it may be better to group by powers of 10 (or powers of any constant): 0–9, 10–99, 100–999, 1000–9999.

Quantile Binning

Using fixed-width will not be effective if there are large gaps for the range of the numerical feature, then there will be many empty bins with no data. This problem can be solved by positioning the bins based on.

This can be done using the quantiles of the distribution. Quantiles are values that divide the data into equal portions. For example, the median divides the data in halves; half the data are smaller, and half larger than the median. The quartiles divide the data into quarters, the deciles into tenths,
etc.

The example below demonstrates how to compute the deciles of a numerical feature:

To compute the quantiles and map data into quantile bins, we can use the Pandas library. pandas.DataFrame.quantile and pandas.Series.quantile compute the quantiles. pandas.qcut maps data into a desired number of quantiles.

Binning by Instinct

This actually involves a manual process of binning manually based on your own personal insight of the data and setting ranges we would like to bin our data into. This should not be a constant practice but is mot times useful when our data points spans irregularly or unevenly.

column_bin stands for the column that will be for the bins.

This can be adjusted by you to suit whatever range our data falls into.

Conclusion

Binning saves us a lot of stress from getting insights to preparation of our data for modeling. Binning may sound simple but that doesn’t reduce its strong effect in feature engineering.

Thanks for reading 🤗

--

--