Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) Algorithm in Machine Learning

Published in

Geek Culture

6 min readJun 2, 2021

Introduction

The exploratory nature of data analysis and data mining makes clustering one of the most usual tasks in these kind of projects. More frequently these projects come from many different application areas like biology, text analysis, signal analysis, etc. that involve larger and larger datasets in the number of examples and the number of attributes .The biggest challenge with clustering in real-life scenarios is the volume of the data and the consequential increase in the complexity, and need for more computational power.
These problems have opened an area for the search of algorithms able to reduce this data overload. Some solutions come from the side of data preprocessing by transforming the data to a lower dimensionality manifold that represents the structure of the data or by summarizing the dataset by obtaining a smaller subset of examples that represent an equivalent information. A different perspective is to modify the classical clustering algorithms or to derive other ones able to cluster larger datasets. This perspective relies on many different strategies. Techniques such as sampling, on-line processing, summarization, data distribution and efficient data structures have being applied to the problem of scaling clustering algorithms.
In this article we will discuss one such algorithm — BIRCH, which stands for Balanced Iterative Reducing and Clustering using Hierarchies, which was developed in 1996 by Tian Zhang, Raghu Ramakrishnan, and Miron Livny.

Explanation

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets by first generating a small and compact summary of the the large dataset that retains as much information as possible. This smaller summary is then clustered instead of clustering the larger dataset. BIRCH is often used to complement other clustering algorithms by creating a summary of the dataset that the other clustering algorithm can now use. However, BIRCH has one major drawback — it can only process metric attributes. A metric attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be present.

The BIRCH clustering algorithm consists of two stages:

Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS) where ’N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF entry to be composed of other CF entries. Optionally, we can condense this initial CF tree into a smaller CF.
Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF entries in the child nodes. Optionally, we can refine these clusters.

Due to this two step process, BIRCH is also called Two Step Clustering. Let us now deep dive into these steps to get a holistic understanding.

Cluster Features

BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to represent a larger set of data points. For clustering purposes, these summary statistics constitute a CF, and represent a sufficient substitute for the actual data.
A CF is a set of three summary statistics that represent a set of data points in a single cluster.
These statistics are as follows:

★ Count [The number of data values in the cluster.]

★ Linear Sum [The sum of the individual coordinates. This is a measure of the location of the cluster.]

★ Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the cluster.]

Together, the linear sum and the squared sum are equivalent to the mean and variance of the data point.

CF Tree

The building process of the CF Tree can be summarized as:

For each given record, BIRCH compares the location of that record with the location of each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes the incoming record to the root node CF closest to the incoming record.
The record then descends down to the non-leaf child nodes of the root node CF selected in step 1. BIRCH compares the location of the record with the location of each non-leaf CF. BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
The record then descends down to the leaf child nodes of the non-leaf node CF selected in step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH tentatively passes the incoming record to the leaf closest to the incoming record.
Perform one of (a) or (b):

(a) If the radius of the chosen leaf including the new record does not exceed the Threshold T, then the incoming record is assigned to that leaf. The leaf and all of its parent CF’s are updated to account for the new data point.

(b) If the radius of the chosen leaf including the new record does exceed the Threshold T, then a new leaf is formed, consisting of the incoming record only. The parent CFs are updated to account for the new data point.

If step 4(b) is executed, and there are already the maximum of L leaves in the leaf node, then the leaf node is split into two leaf nodes. The most distant leaf node CFs are used as leaf node seeds, with the remaining CFs being assigned to whichever leaf node is closer. If the parent node is full, split the parent node, and so on. Note that the radius of a cluster may be calculated even without knowing the data points, as long as we have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to evaluate whether a given data point belongs to a particular sub-cluster without needing to scan the original data set.

Clustering the Sub-Clusters

Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters (the CF leaf nodes), to combine these sub-clusters into clusters. The task of clustering becomes much more easier as the number of sub-clusters are much less than the number of data points. When a new data value is added, these statistics may be easily updated, thus making the computation more efficient.
In the original paper, the authors have used agglomerative hierarchical clustering.

Parameters of BIRCH

There are three parameters in this algorithm, which needs to be tuned. Unlike K-means, here the optimal number of clusters (k) need not be input by the user as they are determined by the algorithm.

✓ threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree can hold.

✓ branching_factor : This parameter specifies the maximum number of CF sub-clusters in each node (internal node).

✓ n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is complete i.e., number of clusters after the final clustering step. If set to None, the final clustering step is not performed and intermediate clusters are returned.
Scikit-learn offers a robust implementation for Python.

Conclusion

A detriment of BIRCH clustering is the following. Because of the tree structure inherent in the CF tree, the clustering solution may be dependent on the input ordering of the data records. To avoid this, the data analyst may wish to apply BIRCH clustering on a few different random sorting of the data, and find consensus among the results. However, a benefit of BIRCH clustering is that the analyst is not required to select the best choice of k, the number of clusters, as is the case with some other clustering methods. Rather, the number of clusters in a BIRCH clustering solution is an outcome of the tree-building process.

Please feel free to share your suggestions.
Thanks for Reading! Stay Safe!

References

Data Mining and Predictive Analytics, 2nd Edition by Chantal D. Larose, Daniel T. Larose
https://www.geeksforgeeks.org
https://scikit-learn.org