Data Discretization

Published in

CodeX

4 min readJun 7, 2022

What is it?

According to Wikipedia, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers.

Data discretization is defined as a process of converting continuous data attribute values into a finite set of intervals with minimal loss of information and associating with each interval some specific data value or conceptual labels.

Why is it needed?

Improves the quality of discovered knowledge.
Easy maintainability of the data.
There is a necessity to use discretized data by many DM algorithms which can only deal with discrete attributes.
Reduces the running time of various data mining tasks such as association rule discovery, classification, and prediction.
Prepares data for further analysis, e.g., classification.
Discretization is considered a data reduction mechanism because it diminishes data from a large domain of numeric values to a subset of categorical values.

Steps of Discretization

Step 1: Sorting the continuous values of the feature to be
discretized.

Step 2: Evaluating a cut point for splitting or adjacent
intervals for merging.

Step 3: Splitting or merging intervals of continuous values
according to some defined criterion.

Step 4: Stopping at some point.

Typical Methods of Data Discretization

Binning
Histogram analysis
Cluster analysis
Decision tree analysis
Correlation analysis

Binning

Binning is a top-down splitting technique based on a specified number of bins.
Binning is an unsupervised discretization technique.
The main challenge in this discretization is to choose the number of intervals or bins and how to decide on their width.
Binning methods smooth a sorted data value by consulting its “neighborhood”, that is the values around it. The sorted values are distributed into several “buckets” or bins. Because binning methods consult the neighborhood of values, they perform local smoothing.
Attribute values can be discretized by applying equal-width or equalfrequency binning, and then replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively.

Histogram Analysis

Histogram analysis is an unsupervised discretization technique because it does not use class information.
A histogram partitions the values of an attribute, A, into disjoint ranges, called buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the buckets are called singleton buckets. Singleton buckets are useful for storing high-frequency outliers.
Histograms are effective at approximating sparse data, dense data, as well as highly skewed and uniform data.
The histograms described before for single attributes can be extended for multiple attributes. Multidimensional histograms can capture dependencies between attributes. These histograms have been found effective in approximating data with up to five attributes.
There are two types of histograms: Equal-width(or distance) and Equal-frequency(or equal-depth).
In an equal-width histogram, the width of each bucket range is uniform. It divides the range into N intervals of equal size. If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N and the interval boundaries are: A+w, A+2w, …, A+(k-1)w.
In an equal-frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly the same number of contiguous data samples). It divides the range into N intervals, each containing approximately same number of samples. It’s good for data scaling but managing categorical attributes can be tricky.

Cluster Analysis

Cluster analysis is a popular data discretization method.
A clustering algorithm can be applied to discretize a numeric attribute, A, by partitioning the values of A into clusters or groups based on similarity, and store cluster representation (e.g., centroid and diameter) only.
It partitions the data set into clusters.
There are many choices of clustering definitions and clustering algorithms. Eg: K-Means and K-Medoid algorithm.
Properties of clusters: (i) All the data points in a cluster should be similar to each other. (ii) The data points from different clusters should be as different as possible.
Evaluation metrics for clustering include Interia and Dunn index.

Correlation Analysis

It is a supervised discretization method, i.e it uses class information.
It uses bottom-up merge, i.e it finds the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge.
It’s also known as Chi Merge algorithm.
It is performed recursively by finding best neighboring intervals that have a similar distribution of classes and merge them, until a predefined stopping condition.
Steps include: (i) Each distinct value of a numeric attribute A is considered to be one interval. Chi-2 tests are performed for every pair of adjacent intervals. (ii) Adjacent intervals with the least Chi-2 values are merged because low Chi-2 values for a pair indicate similar class distributions. (iii) This merging process proceeds recursively until a predefined stopping criterion is met.

Thank you for reading!