Clustering: A Very Short, Selective Summary

Grouping similar items, i.e., clustering, is a fundamental human activity for organizing, making sense of, and drawing conclusions from data. Across many scientific fields, clustering goes by different names but serves a common purpose: helping humans explore, interpret, and summarize data.

This is unsupervised machine learning. There are no labels, no ground truth. For example, is there one true clustering of all the articles the New York Times ever published? Humans have goals, and how useful clusters are to them is all that matters.

Humans are pattern finders. A human can look at collections of items and group them in various ways. For example, a collection of dogs could be grouped according to their size or the color of their fur. Both clusterings are equally valid. Clustering by size may be more useful if the purpose is to decide which dogs get walked together.

Computers need more instructions before they can step in and help cluster items, when there are too many for a human to cluster them alone. Items need to be represented in a machine-readable way, and human judgements need to be replaced with computation.

The first critical decision when preparing a collection for clustering by a computer is how to represent the items in a machine-readable format, e.g., as images or sets of attributes, also called features, like weight, height, and fur color. This representation effects every subsequent step of the clustering process. Since the clustering is performed for a human with a goal, the representation ideally captures aspects of the items that are relevant to that goal.

Further customizing the item representation for the human goal, also known as feature extraction, is optional but often very helpful. Feature extraction is computing new features about each item from the original representation. These new features may better capture the aspects of items that are relevant to the human goal. For example, if the goal is to partition dogs by fur color and dogs are represented by images, one might extract the most common color in every dog image.

A second optional step is feature selection, the process of determining what features can be ignored because they are irrelevant to the human goal. For example, if the goal is to partition dogs by fur color and dogs are represented by collections of features, then only the fur color feature may be selected, ignoring height and weight.

Hopefully, at the end of this process, some items are closer to each other than they are to others, with respect to some aspect of the items the human cares about. If a computer is to verify this, one needs to define exactly what the closer means. The computer needs a human to define a distance function that quantifies how close or far items are from each other.

The next step is choosing a clustering method, i.e., a general strategy. Clustering is finding clusters of items that are closer to each other than they are to items in other groups. There is no one best method for all clustering problems, but some methods produce clusters that support the human’s problem-specific goal better than others.

Some methods have an objective function that quantifies some desirable characteristic of clusters to maximize or some undesirable characteristic of clusters to minimize. One objective function sums up how different each item is from its cluster center. Presumably, the smaller the differences, the better the clustering. Specific clustering algorithms may define different centers for each cluster, tally up the differences in different ways, and follow different processes to minimize the objective function.

Some methods are more easily characterized by their process than their objective function. For example, hierarchical clustering methods are characterized as either:

  • agglomerative or bottom-up, by merging individual items into small clusters, then merging small clusters into larger clusters, etc.
  • divisive or top-down, by splitting the entire collection of items into clusters, then splitting those clusters into further clusters, until all clusters only contain one item.

The objective functions used in these methods help determine which clusters to merge or split.

Some statistical clustering methods, such as mixture models, assume that the items are samples from underlying, unseen distributions of items. By making some assumptions about what those distributions might look like, an algorithm can map items to clusters to maximize a statistical measure of fit between the observed data and the unobserved distributions.

Many methods require the human to provide additional values and thresholds, such as how similar two items need to be to be grouped together or how many clusters to look for. Some methods incorporate techniques that anticipate outliers and attempt to be robust to them.

The types of clusters produced by clustering algorithms can be partitioned in several ways, such as:

  • Hard versus soft: If each item is mapped to a single cluster, the method is hard. If items can partially belong in different degrees to multiple clusters, the method is soft.
  • Partition vs. overlap vs. hierarchy: If each item belongs to one and only one cluster, the method is producing strict partitions. If each item belongs to one or more clusters, than the method is producing overlapping clusters. If each item belongs to one cluster and clusters are strictly contained by other clusters, the method is hierarchical.

Some methods do not produce clusters of items, but rather clusters of components within items. For example, Latent Dirichlet Allocation (LDA) does not cluster entire documents. It produces a hard, strict partition of words into different clusters, typically called topics. At the level of documents, however, this can look like a soft clustering: an individual document might be 50% hospital-related words, 30% government-related words, and 20% negotiation-related words.

Clusters are hard to objectively evaluate. It is expensive and time consuming to bring in humans with goals to evaluate how useful the clusters are to them. Cluster validation refers to the metrics used to approximate the quality of clusters with little or no human input. These metrics can be the basis of choosing one method over another, one parameter value over another, or one clustering over another quickly but approximately.

Another way to produce more helpful clusterings is to involve the human in an interactive process. The clustering algorithm produces clusters and gives the human one or more ways to give feedback. Feedback could be in the form of indicating what is good or bad about this clustering. Feedback could be the human reaching in and directly moving an item from one cluster to another. The method reruns based on this new information, and produces a new clustering for the human to evaluate and give feedback on.

One aspect of clustering that does not fall into the traditional conversation about features, distance functions, and methods is interpretability. A clustering may have good quality scores with respect to those cluster validation metrics, but if it is difficult for the human to understand what each cluster contains, they may have trouble using it to achieve their goals.

There are thousands of published methods that exhibit combinations of these properties and strategies. When comparing methods, it is helpful to break the comparison down by representation, features, distance function, objective function, and algorithm they use as well as the type of clusters they produce.