AI & ML: How combine KMeans and DBSCAN. A Methodological Approach for Segmentation

Published in

Analytics Vidhya

14 min readDec 29, 2020

- Or how listening to a conference of a Noble Prize on Exoplanets brought me to thinking about a data approach for AI & ML Clustering Methods -

By Norberto Cioffi [1] [2] [3]

Context and Idea Proposal (or a kind of abstract)

A few weeks ago I attended a conference held by the Nobel Prize winner Dider Queloz[4] at the University of Geneva in Switzerland[5] (and we even had the chance to have a coffee together[6]). Along with Michelle Mayor, he won the Nobel Prize for Physics in 2019 for the discovery in 1995 of the first exoplanet — 51 Pegasi b. More than 25 years later, we have recorded more than 4000 exoplanets[7].

In Astronomy, as in Science in general, the application of Artificial Intelligence & Machine Learning Algorithms for classification (Supervised and Unsupervised techniques[8] ) is widely used[9] [10]. More in detail there is a specific interest for Artificial Intelligence (AI) and Machine Learning (ML) techniques to cluster Exoplanets[11].

Moreover, segmentation and clustering techniques are used within the corporate and industrial world, for example to group and segment potential clients ( i.e. to tailor ad hoc marketing campaigns or to plan seasonal productions [12] ).

Some of the main AI & ML Algorithms used for unsupervised clustering purposes are KMeans and DBSCAN[13]. However, as with all unsupervised algorithms, there is trade off that has to be accepted when considering which algorithm could be the optimal one and how many clusters must be defined / is reasonable to have. Finally, a crucial part on setting up the analysis is driven by the knowledge of the domain for which the study itself is taking place.

By taking as a working example a dataset of more than 900 exoplanets, I am proposing a methodological approach to compare different clustering algorithms (and their optimal solutions’ choices — Elbow[14] and Knee Methods[15] ) and give some related guidelines on how to select the more convenient pre-data treatment (log-transformation[16], standardization, normalization[17]).

The proposed idea in this article is to optimally choose the number of clusters by considering the best pre-data treatment strategy that leads to less discrepancy between the 2 algorithms outputs. So, in other words, it is based on a 3 steps process:

1. Implement one selected pre-data treatment

2. Run both Algorithms (KMEANS and DBSCAN) identifying and collect each single output for every choice of point 1.

3. Having run all the possible pre-treatments data options (of point 1) and collected all related algorithms clustering outputs (of point 2), choose as final optimal the case where the two algorithms’ outcomes are the most similar.

The data set description, Jupyter and Python Code

Data are publicly available through different online resources. As of today (October 2020) the number of identified exoplanets is 4368. Each planet is cataloged through several characteristics. For the analysis I decided to consider only 3 of them[18], all available for each record[19], making a total number of 974 Exoplanets, so described:

1. Planet Mass — by now also considered as X axis

2. Start Mass — by now also considered as Y axis

3. Orbital Period[20] — by now also considered as Z axis

The analysis run by using Python on a Jupyter workbook environment[21].

The Python code I wrote can be found here, via google drive [22].

The methodological approach: a schematic description

For those working on projects about Machine Learning, and Artificial Intelligence in general, it is common practice to prepare the dataset before tackling it with the analysis. Implemented steps are generally related to cleaning and harmonizing data formatting, acting on blank data (i.e. apply average on missing values and/or remove all the records not fully described) and making data comparable among themselves by applying data shape transformation[23].

In this work I applied 4 data transformations:

1. Standardization

2. Normalization

3. Log Transformation

4. “Ad Hoc” Log Transformation[24]

For each of the 4 choices I then ran the 2 Clustering algorithms (KMeans and DBSCAN) and obtained for each case[25]:

a. For KMeans: by applying the Elbow Method and Silhouette Analysis obtained the optimal clusters number

b. For DBSCAN: by applying the Knee Method to setup the optimal EPS value, obtained the optimal clusters number [26]

Out of a total of 8 clustering outputs (4x2) the optimal choice was taken by considering which data transformation brought the closest results for both KMeans and DBSCAN (and concretely, as described further on, the choice of the “Ad Hoc” Log Transformation that brought for both algorithms 5 clusters).

A final remark, important for who is writing these lines, is the following. Even if some theoretical criticisms to this approach can be raised, it is common practice in the business to have a pragmatic, in some cases even “artisanal”, approach (or in other words based by the experience of doing) instead of just being driven by theory. Lastly, the ultimate choice about the quality of the obtained outcome should always be reviewed by someone who has the knowledge of the domain for which the study has been run[27].

Coding Output and Graphs

(- conclusions about methodology are at the end of these lines, in the dedicated paragraph -)

1 Initial Data Set

The choice to analyse the dataset with 3 dimensions, is not at all random. In fact, it implies the possibility to draw graphs for a better comprehension. Here below you can see the 974 planets plotted on a 3D Map (as dots).

The following part is structured as follow:

1. Choose the data transformation (4 in total)

2. Plot obtained data

3. Run KMeans Analysis:

a. Elbow Method and related graph: preliminary choice of optimal clusters numbers

b. Silhouette Analysis: results and plot

c. Definition of final clusters number

d. Plot of clustered Dataset

e. Print out of some Data statistics (mean, numbers of elements per cluster etc.)

4. Run DBSCAN Analysis:

a. Knee Method Plot

b. Identified Optimal EPS rate

c. Plot of clustered Dataset

d. Identification of final clusters number

e. Print out of some Data statistics (mean, numbers of elements per cluster etc.)

5. Comparison of Results

2.1 Standardization

Standardization implies rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Below is the plotted data:

2.2. KMeans Analysis

2.2.1 Elbow Method to find best cluster(s) number(s)

Optimal Clusters number: 6 to 8

2.2.2 Silhouette Analysis

For n_clusters = 6 The average silhouette_score is : 0.8705326724380881For n_clusters = 7 The average silhouette_score is : 0.8730096408923297For n_clusters = 8 The average silhouette_score is : 0.7982196885773016

2.2.3 Optimal Clusters Number

Within 6 to 8 silhouette analysis range, final cluster number taken is 7, the one with best silhouette score (0.873)

2.2.4 Plot of clustered Dataset

2.2.5 KMeans Statistics

Average of Data per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                -0.217158  -0.058892       -0.061369
1                 1.871795  -0.215893       26.603112
2                 6.021872  -0.076752        0.069941
3                -0.199618  20.329225       -0.079844
4                 2.355313  -0.042638        0.115091
5                -0.169162   7.995528       -0.079860
6                 0.750149   0.041935        7.961525Count of Elements per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                      910        910             910
1                        1          1               1
2                       17         17              17
3                        2          2               2
4                       39         39              39
5                        2          2               2
6                        3          3               3

2.3. DBSACN Analysis

2.3.1. Finding the Optimal EPS rate with knee method

2.3.2. Identified Optimal EPS rate

EPS Rate: 0.15358041639814596

2.3.3 Run DBSCAN Analysis with selected EPS

2.3.4 Identification of final clusters number

The selected EPS identified only 1 Cluster [28] [29]

2.3.5 KMeans Statistics

Average of Data per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                2.510945   0.690743        0.859376
 0               -0.224693  -0.061811       -0.076902
Count of Elements per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                      80         80              80
 0                     894        894             894

3.1 Normalization

Normalize a data implies rescaling the dataset according to a Gaussian distribution (0 to 1).

3.2. KMeans Analysis

3.2.1 Elbow Method to find best cluster(s) number(s)

Optimal Cluster number: 6 to 8

3.2.2 Silhouette Analysis

For n_clusters = 6 The average silhouette_score is : 0.546915511486785For n_clusters = 7 The average silhouette_score is : 0.520452001127997For n_clusters = 8 The average silhouette_score is : 0.507950916276848

3.2.3 Optimal Clusters Number

Within 6 to 8 silhouette analysis range, final cluster number taken is 6, the one with best score (0.5469).

3.2.4 Plot of clustered Dataset

3.2.5 KMeans Statistics

Average of Data per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                 0.435509   0.320256        0.823894
1                 0.020016   0.054814        0.995913
2                 0.970420   0.095111        0.154775
3                 0.139757   0.282285        0.941569
4                 0.738071   0.288113        0.580821
5                 0.057810   0.868372        0.400254
Count of Elements per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                      119        119             119
1                      479        479             479
2                       41         41              41
3                      243        243             243
4                       72         72              72
5                       20         20              20

3.3 DBSACN Analysis

3.3.1. Finding the Optimal EPS rate with knee method

3.3.2. Identified Optimal EPS rate

EPS Rate: 0.09639028435584383

3.3.4 Identification of final clusters number

The selected EPS identified 2 Clusters

3.3.5 DBSCAN Statistics

Average of Data per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                0.490175   0.458527        0.464900
 0                0.153068   0.166143        0.932434
 1                0.992676   0.053397        0.070132
Count of Elements per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                      55         55              55
 0                     893        893             893
 1                      26         26              26

4.1 NATURAL LOG

As previously mentioned, one of the techniques used to compare data and bring about a more Gaussian distribution is to apply a log transformation to the dataset. Here I imposed the same transformation for all the 3 dimensions. Find below the dataset plotted accordingly:

4.2. KMeans Analysis

4.2.1 Elbow Method to find best cluster(s) number(s)

Optimal Cluster number: 4 to 6

4.2.2 Silhouette Analysis

For n_clusters = 4 The average silhouette_score is : 0.4224896238902955For n_clusters = 5 The average silhouette_score is : 0.3442082847922259For n_clusters = 6 The average silhouette_score is : 0.3459242410026600

4.2.3 Optimal Clusters Number

Within 4 to 6 silhouette analysis range, final cluster number taken is 4, the one with best score (0.422).

4.2.4 Plot of clustered Dataset

4.2.5 KMeans Statistics

Average of Data per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                -4.142526  -0.591658        1.767688
1                 0.317616   0.009636        1.207144
2                -2.179556  -0.127672        3.480561
3                 2.236817  -0.294047        7.923827
Count of Elements per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                      229        229             229
1                      492        492             492
2                      171        171             171
3                       82         82              82

4.3. DBSACN Analysis

4.3.1. Finding the Optimal EPS rate with knee method

4.3.2. Identified Optimal EPS rate

EPS rate :  1.2810780029446476

4.3.3 Run DBSCAN Analysis with selected EPS

4.3.4 Identification of final clusters number

The selected EPS identified only 1 Cluster.

4.3.5 DBSCAN Statistics

Average of Data per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                0.088657  -0.910813        3.859141
 0               -1.088861  -0.127528        2.188625
Count of Elements per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                      67         67              67
 0                     907        907             907

5.1 « Ad Hoc » LOG Transformation

Similarly, as done in section 4, I applied a log transformation to the 3 dimensions, but not the same for all. Here below there are the 3 plots of each dimension containing the raw data (blue line) and a specific log transformation for each one, bringing a reasonable Gaussian distribution (orange line).

For clarity I do report in this case the Python Code:

sns.distplot(df.final_mass)

sns.distplot(5*(np.log(df.final_mass)))

plt.title(“Planet Mass Distribution”)

sns.distplot(df.star_mass)

sns.distplot(10*(np.log(df.star_mass)))

plt.title(“Star Mass Distribution”)

sns.distplot(df.orbital_period)

sns.distplot(20000*np.log(df.orbital_period))

plt.title(“Orbital Period Distribution”)

See below the comprehensive transformed dataset graph:

5.2. KMeans Analysis

5.2.1 Elbow Method to find best cluster(s) number(s)

Optimal Cluster number: 5 to 7

5.2.2 Silhouette Analysis

For n_clusters = 5 The average silhouette_score is : 0.5883037969041494  For n_clusters = 6 The average silhouette_score is : 0.5881160930365811 For n_clusters = 7 The average silhouette_score is : 0.5553145596447594

5.2.3 Optimal Clusters Number

Within 5 to 7 silhouette analysis range, final cluster number taken is 5, the one with best score (0.588).

5.2.4 Plot of clustered Dataset

5.2.5 KMeans Statistics

Average of Data per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                -3.664519  -0.573782    23948.345346
1                 8.680214  -5.099493   184058.496680
2               -11.329849  -2.822090    56338.626899
3                 1.404428  -1.770407   114206.177884
4                -6.571787  -6.743543   -20009.716705
Count of Elements per cluster:               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
0                      524        524             524
1                       46         46              46
2                      283        283             283
3                       77         77              77
4                       44         44              44

5.3. DBSACN Analysis

5.3.1. Finding the Optimal EPS rate with knee method

5.3.2. Identified Optimal EPS rate

EPS rate: 3293.17249989343

5.3.3 Run DBSCAN Analysis with selected EPS

5.3.4 Identification of final clusters number

The selected EPS identified 5 Clusters.

5.3.5 DBSCAN Statistics

Average of Data per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                6.540385  -3.869472   110369.970896
 0               -6.563493  -1.600403    34919.907776
 1               -0.414042  -4.336668   109044.003308
 2               -5.798799  -0.592534    96547.550417
 3               11.809209  -0.191690   139739.798212
 4                2.571507  -4.356497   160996.038502
Count of Elements per cluster (-1 is Noise Cluster):               Planet Mass  Star Mass  Orbital Period
Cluster Label                                        
-1                      70         70              70
 0                     841        841             841
 1                      22         22              22
 2                      13         13              13
 3                      18         18              18
 4                      10         10              10

6. Comparison of Results

The table below lists all the numbers of optimal clusters obtained by each iteration. It is clear that for lines 1 to 3 the discrepancy between the two appears to be high. However, for line 4 the number of clusters are not so different (and specifically for this dataset both equals to 5).

It is also quite interesting how the comparison of the 2 clustering plots for case 4 brings, de facto, to the same qualitative assumption:

CONCLUSIONS

Within Unsupervised Machine Learning Techniques, Clustering and Segmentation are widely used in many fields. One crucial aspect is in identifying the correct number of clusters.

The proposed clustering methodology highlights how pre-data manipulation could influence the outcome of clustering exercises.

Here 4 different data manipulations were undertaken. For all of them, KMeans and DBSCAN were run, respectively using widely used methods to optimize their outputs.

The conclusive optimal result must be taken by considering which manipulation brings about less discrepancy between the KMeans and DBSCAN.

However, the final remark is that the interpretation of the chosen outcome must be analyzed and approved by a domain expert in which the study has taken place.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

[1] The opinions expressed in these lines are those of the author. They do not reflect the opinions or views of any other person mentioned.

[2] Images are under free sharing agreements and here available: https://pixabay.com/illustrations/artificial-intelligence-brain-think-4550606/ and https://time.com/wp-content/uploads/2015/11/exoplanets_by_jaysimons-d9dv6th-large.jpg

[3] Let me thank you Malaika and Caroline for the precious support to correct my, sometimes funny, written English. Any mistake here is totally imputable to the author.

[4] https://en.wikipedia.org/wiki/Didier_Queloz

[5] https://youtu.be/Tacbw-p468Q?t=1259

[6] https://drive.google.com/file/d/1urexAR_ZL-QJ-RYWZarB8xvSX8gupovQ/view?usp=sharing

[7] http://exoplanet.eu/catalog/

[8] https://www.youtube.com/watch?v=cfj6yaYE86U

[9] https://astrostatistics.psu.edu/RLectures/clustering_classification_Jessi.pdf

[10] https://academic.oup.com/nar/article/46/14/e83/4990634

[11] “Data analysis on the extrasolar planets using robust clustering” Ing-Guey Jiang,1 Li-Chin Yeh,2 Wen-Liang Hung2 and Miin-Shen Yang3, link

[12] www.researchgate.net/publication/271302240_Customer_Segmentation_Using_Clustering_and_Data_Mining_Techniques

[13] Theoretical references, among others, can be found in this book: https://sebastianraschka.com/books.html

[14] https://www.youtube.com/watch?v=lbR5br5yvrY

[15] Ville Satopaa et al. “Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior” — 2011

[16] https://www.youtube.com/watch?v=ZKaOfJIjMRg

[17] https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

[18] Concerning unit of measurements and other details, references are here: http://exoplanet.eu/catalog/

[19] This is not the case for all the records mentioned in the CVS file, available on this link (top right, CSV Option): http://exoplanet.eu/catalog/#

[20] For further details about used nomenclature other definitions can be found here:

https://exoplanetarchive.ipac.caltech.edu/docs/API_exomultpars_columns.html

[21] https://jupyter.org/about

[22] https://drive.google.com/file/d/1UZXD1ZO61UJecqfJqIZl5H-A7vskgkTT/view?usp=sharing

[23] The reasoning behind it and related techniques can be found on the internet, for example here: https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff#:~:text=Standardization%20is%20useful%20when%20your,regression%2C%20and%20linear%20discriminant%20analysis.

[24] A better explanation will be given in the following lines of the article. The idea is however to define for each chosen dimension the more appropriate log transformation to obtain a “more” Gaussian distribution.

[25] The theoretical field of application of KMeans and/or DBSCAN are out of these lines. It is a choice of the author to use them at the same time. It is also common practice, during analysis, to try different techniques and later on interpret data results based on the knowledge of the analysed domain. For a more comprehensive review of Clustering algorithms some references are here: https://scikit-learn.org/stable/modules/clustering.html

[26] Theoretical explanations of those techniques are out of these lines. They are considered a prerequisite to have. The above notes have already quoted some related resources.

[27] A Funny video about « strange » correlations and wrong theoretical misinterpretations can be found here: https://www.youtube.com/watch?v=WNsLcg2GQMY

[28] While for KMeans the number of optimal clusters is a parametric choice (defined by using Elbow and Silhouette methods), for DBSCAN is the result of the EPS selected (by using the Knee Method). Theoretical references can be found in the above-quoted notes.

[29] DBSCAN defines, if identified by the algorithm, a cluster labeled as -1, blue in the picture above, called Noise (the intuition behind it is that somehow those data points are a noise of the dataset, and so must be removed by the analysis).