An Overview of BIRCH Algorithm

Kottapalli Sai swaroop
12 min readDec 29, 2022

Introduction:

Finding useful patterns in large datasets one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. There were many algorithms for clustering. But, does not adequately address the problem of large datasets and minimization of I/O costs. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is suitable for very large databases.

What is BIRCH Algorithm?

BIRCH incrementally and dynamically clusters incoming multidimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints).

Given a large set of multi-dimensional data points, the data space is usually not uniformly occupied. Data clustering identifies the sparse and the crowded places, and hence discovers the overall distribution patterns of
the dataset. Besides, the derived clusters can be visualized more efficiently and effectively than the original dataset.

There are two types of attributes involved in the data to be clustered:

  1. Metric

2. Non-metric.

we consider metric attributes, as in most, of the Statistics literature, where the clustering problem is formalized as follows:

Given the desired number of clusters K and a dataset of N points, and a
distance-based measurement function (e.g., the weighted total/average distance between pairs of points in clusters, We will discuss these in this article), we are asked to find a partition of the dataset that minimizes the value of the measurement function. This is a non-convex discrete optimization problem(Reaching local minima rather than Global Minima). Due to an abundance of local minima, there is typically no way to find a global minimal solution without trying all possible partitions.

We adopt, the problem definition used in Statistics, but with an additional, database-oriented constraint,: The amount of memory available is limited (typically,much smaller than the data set size) and we want to minimize the required for I/O. A related point, is that it is desirable to be able to take into account the amount of time that a user is willing to wait for the results of the clustering algorithm.

We present a clustering method named BIRCH and demonstrate that it is especially suitable for very large databases. Its I/O cost is linear in the size of the dataset: a single scan of the dataset yields a good clustering, and one or more additional passes can (optionally) be used to improve the quality further. BIRCH architecture also offers opportunities for parallelism, and
for interactive or dynamic performance tuning based on knowledge about the dataset, gained over the course of the execution. Finally, BIRCH is the first clustering algorithm proposed in the database area that addresses
outliers.

Existing approaches for clustering:

Previous approaches, 1) probability-based (like most approaches in Machine Learning) and 2) Distance- based (like most work in Statistics) , do not, adequately consider the case that, the dataset can be too large to fit in main memory.

1.Probability-based approaches:

They typically make the assumption that probability distributions on separate attributes are statistically independent of each other, In reality, this is far from true. ( correlation between attributes exists, and sometimes this kind of correlation is exactly what we are looking for. The probability representations of clusters make updating and storing the clusters very expensive, especially if the attributes have a large number of values because their complexities are dependent not, only on the number of attributes, but also on the number of values for each attribute.

2.Distance-based approaches:

They assume that all data points are given in advance and can be scanned frequently. They totally or partially ignore the fact that not, all data points in the data are equally important with respect to the clustering purpose, and that data points which are close and dense should be considered collectively instead of individually. They are global or semi globol methods at the granularity of data points. That, is, for each clustering decision, they inspect, all data points or all currently existing clusters equally no matter how close or far away they are, and they use global measurements, which require scanning all data points or all currently existing clusters.

Hence none of them have linear time scalability with scale quality.

Contributions of BIRCH:

An important contribution is our formulation of the clustering, problem in a way that, is appropriate for very large datasets, by making the time and memory constraints explicit. In addition, BIRCH has the following advantages over previous distance-based approaches.

  1. BIRCH is local (as opposed to global) in that each clustering decision is made without scanning all data points or all currently existing clusters. It uses measurements that reflect the natural closeness of points, and at the same time, can be incrementally maintained during the clustering process.
  2. BIRCH exploits the observation that the data space is usually not uniformly occupied, and hence not every data point is equally important for clustering purposes. A dense region of points is treated
    collectively as a single cluster. Points in sparse regions are treated as outliers and removed optionally.
  3. BIRCH makes full use of available memory to derive the finest possible sub clusters (to ensure accuracy) while minimizing I/O costs (to ensure efficiency). The clustering and reducing process is organized and characterized by the use of an in-memory, height balanced and highly-occupied tree structure. Due to these features, its running time is linearly scalable.

BIRCH is an incremental method that does not require the whole dataset in advance, and only scans the dataset once.

Background:

Before getting to BIRCH algorithm, we need to get some background knowledge on some distance metrics that we use in BIRCH. We begin by defining centroid, radius and diameter for a cluster. Given N d-dimensional data points in a cluster: {Xi} where i = 1,2, …. N, the centroid Xo, radius R and diameter D of the cluster are defined as:

R is the average distance from member points to the centroid. D is the average pairwise distance within a cluster. They are two alternative measures of the tightness of the cluster around the centroid. Next between two clusters, we define 5 alternative distances for measuring their closeness.

Given the centroids of two clusters: X01 and X02 , the centroid Euclidean distance D0 and centroid Manhattan distance D1 of the two clusters are defined as:

Given N1 d-dimensional data points in a cluster: {Xi } where i = 1,2, … . N], and N2 data points in another cluster: {Xj} where j = N1+l , N1 + 2 ,… Nl+N2.

The average inter-cluster distance D2, average intra-cluster distance D3 and variance increase distance D4 of the two clusters are defined as:

D3 is actually D of the merged cluster. For the sake of clarity, we treat X0, R and D as properties of a single cluster, and DO, D 1, D2, D3 and D4 as properties between two clusters and state them separately. Users can optionally preprocess data by weighting or shifting along different dimensions without affecting the relative placement.

Clustering Feature and CF Tree: Clustering Feature and CF Tree are the core of BIRCH’S incremental clustering.

Clustering Feature: Clustering Feature is a triple summarizing the information that we maintain about a cluster.

CF Tree:
A CF tree is a height-balanced tree with two parameters: branching factor B and threshold T. Each non leaf node contains at most B entries of the form
[CFi, childi], where i= 1,2,…, B and “childi’ is a pointer to its i-th child node, and CFi is the CF of the sub cluster represented by this child. So a nonleaf node represents a cluster made up of all the subclusters represented by its entries. A leaf node contains at most, L entries, each of the form [CFi], where i = 1, 2, …. L, In addition, each leaf node has two pointers, “prev” and “next” which are used to chain all leaf nodes together for efficient scans. A leaf node also represents a cluster made up of all the sub clusters represented by its entries. But all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T.

he tree size is a function of T. The larger T is, the smaller the tree is. We require a node to it, in a page of size P, once the dimension d of the data space is given, the sizes of leaf and non leaf entries are known, then B and L are determined by P. So P can be varied for performance tuning.

Such a CF tree will be built dynamically as new data objects are inserted. It, is used to guide a new insertion into the correct sub cluster for clustering purposes just the same as a B+ tree is used to guide a new insertion into the correct position for sorting purposes. The CF tree is a very compact representation of the dataset because each entry in a leaf node is not a single data point but, a sub cluster (which absorbs many data points with diameter or radius under a specific threshold T).

Insertion into a CF Tree:

We now present, the algorithm for inserting an entry into a CF tree. Given entry “Ent”, it proceeds as below:

1. Identifying the appropriate leaf: Starting from the root, it recursively descends the CF tree by choosing tile closest child node according to a chosen distance metric: DO, D1 ,D2, D3 or D4 as defined above.

2. Modifying the leaf: When it reaches a leaf node, it finds the closest leaf entry, say Li and then tests whether Li can “absord” “ Ent” without violating the threshold conditions. If so, the CF vector for Li is updated to reflect this, If not, a new entry for “Ent,” is added to the leaf. If there is space on the leaf for this new entry, we are done, otherwise we must split the leaf node. Node splitting is done by choosing the farthest pair of entries as seeds and redistributing the remaining entries based on the closest criteria.

3. Modifying the path to the leaf: After inserting “Ent” into a leaf, we must update the CF information for each non leaf entry on the path to the leaf. In the absence of a split, this simply involves adding CF vectors to reflect the addition of “Ent”. A leaf split, requires us to insert a new nonleaf entry into the parent, node, to describe the newly created leaf, If the parent, has space for this entry, at all higher levels, we only need to update the CF vectors to reflect, the addition of “Ent”. In general, however, we may have to split the parent as well, and so on up to the root, If the root is split, the tree height increases by one.

4. Merging Refinement: Splits are caused by the page size, which is independent of the clustering properties of the data. In the presence of skewed data input order , this can affect the clustering quality, and also reduce space utilization. A simple additional merging step often helps ameliorate these problems: Suppose that there is a leaf split, and the propagation of this split stops at some non leaf node Nj, i.e., Nj can accommodate the additional entry resulting from the split. We now scan node Nj to find the two closest entries. If they are not the pair corresponding to the split, we try to merge them and the corresponding two child nodes. If there are more entries in the two child nodes than one page can hold, we split the merging result again. During the re splitting, in case one of the seed attracts enough merged entries to fill a page, we just put the rest entries with the other seed. In summary, if the merged entries fit, on a single page, we free a node space for later use, create one more entry space in node Nj, thereby increasing space utilization and postponing future splits; otherwise we improve the distribution of entries in the closest, two children.

The BIRCH Clustering Algorithm:

After all background knowledge required, we now dive into BIRCH algorithm. Below figure presents the overview of BIRCH algorithm.

The main task of Phase 1 is to scan all data and build an initial in memory CF tree using the given amount, of memory and recycling space on disk. This CF tree tries to reflect the clustering information of the dataset as fine as possible under the memory limit. With crowded data points grouped as fine sub clusters, and sparse data points removed as outliers, this phase creates a in memory summary of the data.

Phase 1. It starts with an initial threshold value, scans the data, and inserts
points into the tree. If it runs out of memory before it finishes scanning the data, it increases the threshold value, rebuilds a new, smaller CF tree, by re-inserting the leaf entries of the old tree. After the old leaf entries have been re-inserted, the scanning of the data (and insertion into the new tree) is resumed from the point at which it was interrupted. Below figure shows Control Flow of Phase 1.

After Phase 1, subsequent computations in later phases will be:
1. fast became (a) no I/O operations are needed, and (b) the problem of clustering the original data is reduced to a smaller problem of clustering the subclusters in the leaf entries.
2. accurate because (a) a lot of outliers are eliminated, and (b) the remaining data is reflected with the finest granularity that can be achieved given the available memory.
3. Less order sensitive because the leaf entries of the initial tree form an input order containing better data locality compared with the arbitrary original data input, order.

Phase 2 (optional): We have observed that the existing global or semi-global clustering methods applied in Phase 3 have different input size ranges within which they perform well in terms of both speed and quality. So potentially there is a gap between the size of Phase 1 results and the input range of Phase 3. Phase 2 serves as a cushion and bridges this gap: Similar to Phase 1, it scans the leaf entries in the initial CF tree to rebuild a smaller CF tree, while removing more outliers and grouping crowded sub clusters into larger ones.

The undesirable effect of the skewed input order, and splitting triggered by page size causes us to be unfaithful to the actual clustering patterns in the data. This is remedied in Phase 3 by using a global or semi-global algorithm to cluster all leaf entries. We observe that existing clustering algorithms for a set of data points can be readily adapted to work with a set, of sub clusters, each described by its CF vector. For example, with the CF vectors known naively, by calculating the centroid as the representative of a sub cluster, we can treat each sub cluster as a single point and use an existing algorithm without, modification or to be a little more sophisticated, we can treat a sub cluster of n data points as its centroid
repeating n times and modify an existing algorithm slightly to take the counting information into account or to be general and accurate, we can apply an existing algorithm directly to the subclusters because the information in their CF vectors is usually sufficient, for calculating most distance and quality metrics.

we adapted an agglomerative hierarchical clustering algorithm by applying it directly to the sub clusters represented by their CF vectors. It uses the accurate distance metric D2 or D4, which can he calculated from the CF vectors, during the whole clustering, and has a complexity of O(N²). It also provides the flexibility of allowing the user to specify either the desired number of clusters, or the desired diameter (or radius) threshold for clusters.

After Phase 3, we obtain a set, of clusters that, captures the major distribution pattern in the data, However minor and localized inaccuracies might exist because of the rare misplacement problem, and the fact that Phase 3 is applied on a coarse summary of the data. Phase 4 is optional and entails the cost of additional passes over the data to correct those inaccuracies and refine the clusters further. Note that up to this point, the original data has only been scanned once, although the tree and outlier information may have been scanned multiple times.

Phase 4 uses the centroids of the clusters produced by Phase 3 as seeds, and redistributes the data points to its closest seed to obtain a set of new clusters. Not only does this allow points belonging to a cluster to migrate,
but also it ensures that all copies of a given data point go to the same cluster. Phase 4 can be extended with additional passes if desired by the user, and it has been proved to converge to a minimum. As a bonus, during this pass each data point can be labeled with the cluster that it belongs to, if we wish to identify the data points in each cluster. Phase 4 also provides us with the option of discarding outliers. That is, a point which is too far from its closest, seed can be treated as an outlier and not included in the result.

--

--