Unsupervised Learning: KMEANS Clustering
Unsupervised learning is the training of an artificial intelligence (AI) algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance.

In unsupervised learning, an AI system may group unsorted information according to similarities and differences even though there are no categories provided. AI systems capable of unsupervised learning are often associated with generative learning models, although they may also use a retrieval-based approach (which is most often associated with supervised learning). Chatbots, self-driving cars, facial recognition programs, expert systems, and robots are among the systems that may use either supervised or unsupervised learning approaches.
In this blog, we are going to learn the unsupervised learning technique which can be used to understand the unlabeled data and can be used to improve any kind of business. So let's get started :

K means is a technique to find the clusters inside the data according to their similarities. In this, we can use the formula of euclidean distance to find the differences between the clusters. Let's do it practically:
Import all the required libs and use read_csv function to make a dataframe of the dataset which is in the income.csv. This income.csv contains three cols NAME, AGE and INCOME.

Lets Plot all this data in a graph so that we can see the behavior of the data using matplotlib.

As we can easily see generally there is 3 kind of clusters according to the data. But still its not that easy to find the clusters. Because there could be more that 3 or lesser than 3 clusters could be possible. Lets see how? For that first create the classifier of k means.

We made the Kmeans classifier with n_clusters=3 this is just the random guess because of visualization. if we use the .fir_predict() it will help us to predict according to n_clusters=3 which data point comes under which category. As we can see it has given output 2, 0, 1 these are the categories predict by the KMeans algorithm. I simply merged the predictions with the data frame.

Use the cluser_centers_ to see the centers of cluster predicted by our classifier and used matplotlib to see the datapoints, Clusters and their centers.

It has predicted some of the cluster’s center wrong. It is because the units used in income and age is totally different. So our euclidean distance works on the principle of distance. So These kinds of errors occurs. To get rid from it. We have to convert all the datapoints to the same standard distribution. We will use min max scaler in this case to achieve that.

So now you can see all data is converted into same standards. Now once again make a classifier.

Create a classifier and get the predictions and thier cluster centers.

Now it has correctly classified proper clusters. Now Story abhi baki h mere dost.
How would you determine the number of clusters. We will iterate a loop from 1 to n and find the SSE sum of squared distance and plot a graph.

We have to choose the elbow part of the graph.

Congratulations we have found the number of clusters from the dataset using the SSE and elbow method.
Hope you enjoyed. Have a great day!
