A Study of Unsupervised Clustering

8 min readJul 7, 2020

Introduction

Clustering is a technique for identifying similar entities in a dataset. Entities within each cluster share more characteristics in common with entities within that cluster, than with entities in other clusters.

Clustering is a common first step for many advanced analytics and machine learning initiatives, so it is imperative that data scientists understand its nuances before using them. This series will take the reader through a study on Clustering methods.

Part I of our study will:

· Differentiate clustering objectives,

· Examine the characteristics of clustering models,

· Briefly survey popular clustering methods, and

· Review appropriate clustering validation techniques

In Part II of our study will take a case study approach to:

· Demonstrate and compare a selection of clustering algorithms on a categorical dataset

Before we start with Clustering lets understand this below tree.

https://in.mathworks.com/help/stats/machine-learning-in-matlab.html?w.mathworks.com

Machine learning algorithms can be divided into Unsupervised and Supervised learning.

Supervised learning is when you know what the right answer is (you already have a dataset of input and output variables), and you want to train your algorithm to learn and predict the right answers. The dataset used to train algorithms to learn to map input to output labels is called training dataset.

Unsupervised learning is based on the idea that computer algorithms can train themselves to identify inherent complex structure and patterns without human guidance. So Unsupervised learning is used when nobody knows what the right answers are. Unsupervised learning helps us to learn the inherent structure of the dataset. Clustering falls into the unsupervised learning.

Part 1: Understanding Clustering

Clustering Objectives

Hard Clustering

This objective results in non-overlapping clusters. For example, consider the clustering of customers for a retail store. Hard clustering methods would place each customer into exactly one cluster.

Soft Clustering

This objective results in overlapping clusters. Soft clustering methods might place a single entity in multiple clusters, often with a confidence (belief) associated with its membership in each of the target clusters. Returning to the example of clustering of customers for a retail store; each customer will receive a series of probabilities that represent the belief (likelihood) of its membership in each of the resulting clusters.

Characteristics of clustering models

Clustering could get complex based on nature of the business, data varieties, learning approach. There is no generic best or right clustering algorithm. The best and right algorithm is that which suits your data and business needs. Success with different clustering models and the decision of which model to select, are closely interwoven with an understanding of the characteristics of clustering models.

Dissimilarity or Similarity measures

Dissimilarity (or similarity) measures are the cornerstone of Clustering methods. These measures determine how entities in a dataset will be clustered. There are distinct notions of dissimilarity, giving rise to the four clustering methods discussed below.

Connectivity methods a.k.a. hierarchical clustering: These models are based on a similarity notion that data points closer in the data space are more like each other than the data points lying farther away. Examples of these models are agglomerative hierarchical clustering and divisive clustering and their variants.
Centroid methods: These are iterative clustering algorithms where the notion of similarity is determined by the distance of a data point to the centroid of the clusters or a single vector. The centroid may not be a part of the member of the data set. Examples include K-means, K-median, and K-mode clustering variants.
Distribution methods: These models are based on the notion of similarity is defined as the probability that all data points in the cluster belong to the same distribution (e.g., Normal, Gaussian). These models often suffer from over fitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
Density methods: These models are based on the notion of density of data points in the data space. They search the data space and isolate areas of different density into different clusters. Popular examples of density models are DBSCAN and OPTICS.

Data Varieties:

The choice of clustering is often influenced by the data varieties under consideration. e.g., The Euclidean distance similarity measure is appropriate for numerical values, but cannot be used when analyzing social media data. This is why understanding the data sources, and the data varieties within them becomes important before determining candidate clustering models. The suite of candidate clustering models should depend on the data varieties included. Below we have listed a few different data types.

Numerical data: e.g., price, quantity
Categorical data: e.g., sex, zip code, product purchases
Text data: e.g. web and social media feeds
Multimedia data: e.g., image, audio, and video. (Flickr, YouTube)
Time Series data: e.g., sensor, the stock market
Sequence data: e.g., biological genome sequences
Stream Data: e.g., IoT /sensor streams, web click streams
Graphs and homogeneous network
Heterogeneous Networks
Uncertain data

Brief Survey of Popular Clustering Algorithms

Numeric clustering algorithms:

Numerical clustering methods are the go-to methods for most data scientists. To that end, it is essential to understand numerical clustering before diving into other types of clustering. This is because any data type like text, categorical, image can be represented in numerical form which we can as encoding the data. Once we represent the input data in numerical form we can then use any of the numerical clustering methods. In Part II of this study, we will see exactly how that can be achieved using one hot encoder. Below are the most popular numerical methods.

K-means: The K-means algorithms produces a fixed number of user defined clusters, each associated with a center and each data point belongs to a cluster with the nearest center.

R Package: Kmeans

Dissimilarity Matrix: Distance function

Affinity Propagation: This Clustering algorithm based on message passing between data points. it finds a subset of points as exemplars based on (dis)similarities and assigns each point in the given data set to the closest exemplar.

R Package: apcluster

Dissimilarity Matrix: Distance function like negatively squared Euclidean distance

Mean Shift: This clustering is a sliding-window-based algorithm that attempts to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing.

R Package:MeanShift

Dissimilarity Matrix: Probability density function to place centroids.

Spectral Clustering: This clustering is a graph-based clustering which uses the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

R Package: spectralClustering

Dissimilarity Matrix: Similarity graph between N objects to cluster. Compute the first k eigenvectors of its Laplacian matrix to define a feature vector for each object.

Agglomerative Clustering: starts with each point in its own cluster and then, for each cluster, use some criterion to choose another cluster to merge with. It repeats this process until we have only one cluster clusters branching down to the last layer which has a leaf for each point in the dataset.

R Package: HCLUST

Dissimilarity Matrix: Distance function

DBSCAN: Clustering algorithm that finds clusters through density-based expansion of seed points.

R Package: DBSCAN

Dissimilarity Matrix: Density Reachability matrix

Density reachability: a point q is said to be directly density reachable by another point p if the distance between them is below a specified threshold \epsilon and p is surrounded by sufficiently many points. Then, q is density reachable by p

Categorical clustering algorithms:

As described earlier data types are not always numerical. Listed below are some popular clustering algorithms used when we are dealing with categorical data types. Some of these algorithms can be used for more than one data types. For example, the Grower method is popular when dealing with datasets with mixed data types. In Part II of this study we will explore some of these algorithms in detail and contrast their performance relative to different objectives.

ROCK: (RObust Clustering using linKs) clustering algorithm which belongs to the class of agglomerative hierarchical clustering algorithms. It uses concept of links to measure the similarity/proximity between a pair of data points.
K-Mode: k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes.
K-Median: It is a variation of k-means clustering where instead of calculating the mean one calculates the median to determine its centroid.
k-Prototypes: This clustering algorithm which used by combining k-means and k-modes when we have both numerical and categorical data
Grower: This clustering is also used for mixed data type. This algorithm is based on Gower distance. The concept of Gower distance is a for each variable type, a distance metric that works well for that type is used and scaled to fall between 0 and 1. Then, a linear combination using user-specified weights (most simply an average) is calculated to create the final distance matrix.
K means- after encoding data: The standard k-means algorithm isn’t directly applicable to categorical data, for various reasons. So, we convert the data by encoding and then apply k-means to it. With data encoding, a categorical feature becomes an array whose size is the number of possible choices for that features.
SOM(Self organizing Maps) is artificial neural network used for clustering. The goal of SOM is to find a set of centroids (reference or codebook vector in SOM terminology) and to assign each object in the data set to the centroid that provides the best approximation of that object. There is one neuron associated with each centroid

Clustering validation statistics

Clustering validation is one of the most challenging and vital tasks when doing cluster analysis. A robust cluster validation should include all three of the following facets (see Charrad et al. 2014, Brock et al. (2008), Theodoridis and Koutroumbas (2008)):

Internal cluster validation, which uses the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.
External cluster validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. It measures the extent to which cluster labels match externally supplied class labels. Since we know the number of clusters in advance (from the cardinality of the cluster label set), this approach is used for selecting the right clustering algorithm for a specific data set. Note that this is different from supervised learning, where the external labels are provided as mechanisms to train the algorithm. Instead, here the external labels are employed to validate the classification fit and robustness of the clustering algorithm.
Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g., varying the number of clusters k). It is typically used to determine the optimal number of clusters.