Vector Dot Based Similarity Mean Clustering

Deep Learning has clearly changed the world the way we dream.It clearly revolutionized the field of machine learning and increase its popularity among top tech Giants.After Deep Learning,a Supervised Learning Technique the researchers now concentrating on other Learning Mechanisms Unsupervised Learning ,Reinforcement Learning and Evolutionary Algorithms. Clustering is a well significant Unsupervised Learning task.Where the algorithm have to predict the labels or find the clusters in the data.It has a wide range of applications in Market Segmentation and in very wide variety of fields.Researchers think that labeling or clustering techniques can lead to one of the major tasks of general intelligence which is “creativity”.I have recently started studying these techniques one of the basic and most beautiful clustering techniques is K-means during its study I simply got an idea to improve it.Lets start with K-means

K-Means Clustering

K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity.In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:
-Reassign data points to the cluster whose centroid is closest.

-Calculate new centroid of each cluster.
These two steps are repeated till the within cluster variation cannot be reduced any further.
The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroid
Formal Definition of K-Means:-
Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance).
The Algorithm is described as below:-

The algorithm clearly uses the Euclidean Distance as a metric to find the nearest mean.If we clearly observe the underlying semantic significance of the Euclidean Metric what it is trying to do is it is using distance metric to find the similar once based on hypothesis that the more similar the feature are the more closely plotted in a n-dimensional space.Clearly one can see that Euclidean distance is computationally expensive task.What if we think of other metrics that helps us to obtain the similarity that this distance metric is trying to accomplish.Studying Netflix Recommendation they have used a Vector Dot product to suggest the Movies to there users.

A dot product between two vectors a and b is defined as follows

a.b= ‖a‖ ‖b‖ cosθ

The dot product is maximum if both the vectors are identical or similar

i.e squ(‖a‖)

It clearly says that the more closer the Data points in a n-dimensional space the maximum Dot product score they obtain.

Building on this idea can we replace the Euclidean Distance Metric with Vector Dot Product??

Just like K-means random initialization of cluster take place

1)Now we perform Dot product of all the data points and assign to the cluster which has maximum Dot product score with.The Replaced equation might look like this

### c(i)=arg max{X.μ}

2)Calculate new centroid of each cluster.

The following Algorithm is tested in a random data that Follows a Gaussian Distribution

If this Vector Dot product similarity works then we can clearly reduce the computation time of these algorithms making them stronger and efficient.This is simply a taught experiment and I dont know whether somebody has got this idea or Someone already pulished and this is my first blog post.

Like what you read? Give Guru Charan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.