Clustering
Clustering is the task of dividing data points into groups, such that data points in the same groups are more similar to other data points in the same group than those in other groups. Essentially, the goal is to segregate groups with similar attributes and traits, and assign them into clusters.
Clustering has a large quantity of applications and use-case scenarios, spread across various vertical and horizontal domains. Some examples include market segmentation, recommendation systems, social media analysis, anomaly detection.
We can think of two approaches for clustering, Hard (where each data point either belongs to a cluster completely or not) and Soft (where instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned).

There are many approaches to clustering — based on the key idea of defining and determining ‘similarity’ among data points. The most popular algos fall into a few categories though, namely:
- Connectivity models: based on the concept that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. Start with either classifying all data points into separate clusters & then aggregating them as the distance decreases, or classifying as a single cluster and then partitioning as the distance increases. Choice of distance function here is subjective. Easy to interpret but not much scalable. Ex: hierarchical clustering (HC) and variations.
- Centroid models: iterative algos in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. Number of clusters required at the end have to be mentioned at the start, (ie., good to have previous knowledge about the data). These models run iteratively to seek local optima. Ex: K-Means.
- Distribution models: based on the notion of how probable is it that all data points in the cluster belong to the same distribution Ex: Expectation-Maximization algo (uses multivariate normal distributions).
- Density Models: such models search the data space for areas of varied density of data points in the data space, isolating different density regions & assigning data points within these regions to the same cluster. Ex: DBScan.

The most popular is K-Means (K-M), which is all about specifying the (desired) number of clusters, assigning a data point to one of the clusters in random fashion, computing the cluster centroids, and then re-assigning the data point to the closest cluster centroid. The cluster centroids are then subsequently re-computed. These last two steps are repeated, to the point where no more improvements are obtained (ie global optima is reached).


Conversely, Hierarchical Clustering (HC) builds on a hierarchy of clusters. In the beginning, data points are assigned to a cluster of their own. Subsequently, two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

Results can be shown using a dendrogram:

The decision of the number of clusters that can best convey different groups can be chosen by observing the dendrogram. Regarding thebest choice of the number of clusters, is the number of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster. Here is an example:

Worth pointing out that , regarding HC, the decision of merging two clusters is usually made on the basis of closeness of these clusters. There exist several metrics for deciding the closeness of two clusters, with Euclidean distance being most used (||a-b||2 = √(Σ(ai-bi))).
Regarding the main differences in practice between K-M and HC:
- HC does not usually handle large data volume that well, K-M clustering is more adept at it, due to the fact that the time complexity of K-M is linear (O(n)) while HC displays a quadratic one (O(n2)).
- Outputs produced by running K-M multiple times might differ. HC has results which are reproducible.
- K-M also calls for prior knowledge of K ; in HC one can stop at whatever number of clusters he/she determines to be appropriate by interpreting the dendrogram
