Learn Data Mining by Applying it on Excel(Part 2)

Albert Assaad
Analytics Vidhya
Published in
6 min readFeb 1, 2021

Welcome to part 2 of this series, after we applied in our previous post(here) the Linear Regression method, it’s time now for a new method called K-Means Clustering.

K-Means Clustering:

K-Means is a classification or grouping method, imagine that you own a retail or a wholesale company with many products and this company has lots of customers(more than 1000)and you decide to group the customers into 4 groups depending on their purchase of the company’s major 4 or 5 products, maybe to give each group a different discount amount. This is where the K-means clustering method is used, K is the number group you want to cluster your data in our example is 4.

Each row contains the name of the customer and values for each product, this row is called a record since it’s composed of several numerical values it may be also called a data point.

In k-means clustering, each cluster(group) is described by the centroid (or mean) of the data points of the cluster.Suppose, for example, that a cluster has three data points expressed as three vectors (6,7,8,9,10), (7,8,9,10,11) and (10,9,8,7,6), the centroid of this cluster is ((6+7+10)/3),((7+8+9)/3),((8+9+8)/3),((9+10+7)/3),((10+11+6)/3) which is (7.7, 8 , 8.3 , 8.7). Please note that all the values in this method must be numerals.

Let’s start our example by downloading the excel sheet (here) which contains a list of 39 customers that we want to divide into 4 clusters. We may need to try several different k values to see which one that better suits our function.

source:https://aihub.cloud.google.com/u/0/p/products%2F0e0d2ed0-5563-4639-b348-53a83ac4ff4e

After opening the Excel sheet you will get:

1- Column A2 which will have customer names

2- Columns B2:F2 the yearly sales per 5 different products

3- Column H we the name of the 5 products

4- Columns I2:L2 we inserted C1, C2, C3, C4 in capital since we need 4 clusters(the value of k) under each column we inserted a random number within the range of sales for each product.

5- Column R2 will have the name of the cluster that is nearest which will decide to which group it belongs.

6- From I3 to L7 we inserted k data points as the initial centroids. we can insert them randomly or pick them based on the data we have in B3:F41

7- Column N2:Q2 we inserted the clusters name C1, C2, C3, C4 in capital, and under each column we will calculate the distance between the data points and the centroid. To do that we use the euclidean function, so if we have 2 data points(x1,x2,x3,x4) and (y1,y2,y3,y4) their Euclidean distance is calculated as:

to apply the above, we inserted the following function that will store the distance between the point Customer1(100000,30000,90000,8100000,4050000) and the centroid in column I3:I7 C1(5000,30000,8000,6000000,190000), in N1 write the following function: =SQRT(($B3-I$3)²+($C3-I$4)²+($D3-I$5)²+($E3-I$6)²+($F3-I$7)²) this will compute the difference between Customer1 and Cluster1 “C1”

Now Continue by dragging and autofill from N3 till Q3 then select cells N3:Q3 and autofill till N41:Q41. Now we have all the distances between the Customer dataset and all centroids.

8- In column R we need to insert the cluster to which point the Dataset is close we can achieve that by entering this function in cell R3: =INDEX($N$2:$Q$2,1,MATCH(MIN(N3:Q3),N3:Q3,0)) this function will get the text from row N2:Q2 depends on the minimum value from N3 to Q3.

9-Drag and autofill from cell R3 till R41 now we will get each dataset is closest to which point thus to which cluster it’s belonging.

10- Now we need to start over in order to continue the clustering process we need to copy the current sheet and rename it to k2 we do that we right-click on the sheet name “k1” at the bottom “move or copy” then tick on “create copy” rename the newly copied sheet to “k2”

11- In k2 we need a formula in cell I3 to recalculate the mean of attribute Product 1 Yearly sales for all data points inside cluster 1, we can achieve that by the following formula =AVERAGEIFS(‘k1’!$B$3:$B$41,’k1’!$R$3:$R$41,I$2) that will calculate the average value if the datapoint belongs to C1 in (k1!R3:R41) but since we will repeat this process several times by creating more sheets it’s better to use the Indirect, ADDRESS functions, and operator &. so we just change the value of one cell better then writing the function and selecting the values each time we create a worksheet, we can achieve that by:

  • In k2 sheet cell G1 insert “k1” in lowercase
  • In G3 to G7 fill 2,3,4,5,6 respectively those will be the column reference for the values in sheet k1 since column B is column 2 , column C is column 3 …these numbers will be used as a reference for columns B, C, D, E, F
  • In I3 replace the formula with this one: =AVERAGEIFS(INDIRECT($G$1&”!”&ADDRESS(3,$G3,1)):INDIRECT($G$1 & “!” &ADDRESS(41,$G3,1)),INDIRECT($G$1 &”!$R$3" ):INDIRECT($G$1&”!$R$41"),I$2), as we can realize that the INDIRECT function will get the string next to it in our case we took the value of g1 which is “k1” and we concatenate it with ! and Address(3,g3,1) which will return K1!B3, now later when we work on sheet k3 we just need to change the value in G1 to k2 as easy as that.

12- Autofill the cells from I3 to L7 new centroids are defined

13- The difference between the data points and newly calculated centroids are automatically calculated from cells N3 till Q41 and the shortest Cluster is being automatically inserted in cells R3:R41

14- In Cell S1 enter number 1 and autofill till cell S41, in T2 enter the text “Old Cluster” and in U2 enter the text “Difference”

15- In Cell T3 enter the following formula: =INDIRECT($G$1&”!R”&S3) that will get the value from sheet k1!R3 cell and autofill till cell S41

16- In cell U3 enter this formula: =IF(T3=R3,0,1) this will check if the datapoint still belongs to the same cluster(group) as before or not autofill till cell U41.

17-we need to repeat from point 10 till 16 until we have all the values in cells U3:U41 are all 0, don’t forget to replace in the newly copied sheets the value in G1 with the name of the previous sheet.

In this example, I created 8 worksheets in the order I reached the result, you can find the final excel sheet (here), as you can see below the difference for all datapoints are 0 which means that it didn’t change its group and didn’t move to another one. Now you can filter the values in column R and see which Customers belong to C1, C2, C3, and C4.

The below link is an example of K-means iterations as we can see the data points change their cluster until they stick and reach their final one.

That’s all for the K-Mean function, I tried to simplify it the best, If you have any question please mention them in the comment.

<==Learn Data Mining by Applying it on Excel(Part 1)

--

--

Albert Assaad
Analytics Vidhya

Multi-platform developer (Android,IOS,Dynamics Nav,C#.net,Vb.net,Python) interested in everything technology.