Data Mining → Clustering

diwakar Dhungana
Nov 3 · 4 min read

Clustering is the grouping of particular set of objects or entity based on their characteristics and aggregating them according to their similarities.

Clustering is similar to Classification, data are grouped. However, unlike the classification, the groups are not predefined. Instead the grouping is accomplished by finding the similarities between data according to characteristics found in the actual data.The groups are called as clusters.

Given a database D = {t1, t2, ….., tn}, a distance measure dis(ti, tj) defined between any two objects ti and tj, and an integer value k, the clustering problem is to defined a mapping f: D → {1, ….., k} where each ti is assigned to one cluster Kj, 1 < = j < = k .
Here ‘k’ is the number of clusters.

Data Clustering

Cluster is a collection of data objects, in which the objects similar to one another within the same cluster and dis-similar to the objects in another cluster.

Cluster analysis is the process of finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.

Clustering is Unsupervised Classification. There is no predefined classes.

How are the data made cluster?
There are various algorithm used to made data cluster. Some of them are:
1. K-Means Clustering
2. Mean-Shift Clustering
3. Density-Based Spatial Clustering (DBSCAN)

  1. K-Means Clustering:
    Each cluster is represented by the center of the cluster.
    Algorithm:
    1. Choose k, number of clusters to be determined
    2. Choose k objects randomly as the initial cluster centers
    3. Repeat
    3.1 Assign each object to their closest cluster center
    3.1.1 Using Euclidean distance
    3.2 Compute new cluster centers
    3.2.1 Calculate mean points
    4. Until
    4.1 No change in cluster centers OR,
    4.2 No object change its cluster
Clustering using K-Means

Is their are any weaknesses by using K-Means Clustering?
1. Applicable only when mean is defined.
2. Need to specify K, the number of clusters, in advance
2.1 Run the algorithm with different K values
3. Unable to handle noisy data and outliers
4. Works best when clusters are of approximately of equal size

Hierarchical Vs Partitioning Clustering

Hierarchical Clustering:
A nested set of cluster is created, each level in the hierarchy has the separate set of clusters. At the lowest level, each item is in the own unique cluster. At the highest level, all items belong to the same cluster. With this clustering the desired number of cluster is not input.

Hierarchical Clustering are of two types:
1. Agglomerative Clusteirng
2. Divisive Clustering

1. Agglomerative Clustering
It starts with as many clusters as the records, with each clusters having only one records. Then pairs of clusters are successively merged until the number of clusters reduces to k. At each stage, the pair of clusters are merged which are nearest to each other. If the merging is continued, it terminates in the hierarchy of clusters which is built with just a single cluster containing all the records.

Agglomerative vs Divisive Clustering

2. Divisive Clustering
This algorithm takes the opposite approach from the Agglomerative methods. These starts with all the records in one cluster, and then try to split cluster into smaller pieces.

Partitioning Clustering:
It is a methods used to classify observations, within a data set, into multiple groups based on their similarity.

Construct a partition of a database D of n objects into set of k clusters, such that we have the minimum sum of squared distance.

Above example shown of K-Means Clustering is a method by using partitioning clustering.

Outliers
Outliers are the points with values much different from those of the remaining set of data. It may represent the error in the data or could be correct data values that are simply much different from the remaining data.

They are viewed as the solitary clusters. However, if a clustering algorithm attempts to find the larger clusters, these outliers will be force to be placed in some cluster. This process may results the creation of poor clusters by combining two existing cluster and leaving the outliers in its own cluster.

Outlier

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade