Cluster Analysis on Adult Dataset

Ceyda Akbulut
Geek Culture
Published in
12 min readJan 23, 2022

--

Thanks to Hyosang Kim(FU Berlin), Raniyaharini Rajendran(FU Berlin) for this teamwork!

Code: https://colab.research.google.com/drive/1EdAXpKzQ8GzWKgPQ1EHay80P0AHTms-t#scrollTo=rujaEpoG0T2h

We will perform cluster analysis on the dataset used for the previous article on Medium. Let`s remember the general info about this set.

Dataset characteristics: Multivariate

Attribute characteristics: Categorical, Integer

Number of Instances(Total): 48842

Number of Instances(Train):32561

Number of Instances(Test):16281

Number of Attributes:14

Data Preprocessing and Transformation

We applied the same steps as we did in the last article for data integration, data cleaning, identify or remove outliers…

Feature Selection

First, we analyse the type of the attributes. Age field has been considered as a numerical value. Education and education number fields have the same meaning but in different forms.We accept these two fields as ordinal-categorical value. For the remaining columns, see the figure below.

Since our dataset consists of features of different types and our method is unsupervised learning, we have to design the way that we will follow. In general, for feature selection there are different methods. If the dataset includes only numerical values PCA can be useful or if the dataset has just categorical values, CA or MCA may be other ways to analyse correspondence between attributes. In our situation, factor analysis of mixed data (FAMD) could be good. There is very limited information about the FAMD method(prince library). Therefore, we used PCA and MCA methods separately after splitting our dataset to numerical and categorical columns. After PCA, MCA techniques, we had different columns (for PCA we selected 3 components, for MCA 5 components). As we know, after feature selection methods, the dataset should include similar features. So, we have decided to change our methods to select effective features.

Main idea is detecting the correlations between attributes.Our new methods are correlation matrix for numerical columns and chi-square test for categorical columns.

The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables. Therefore, we will only consider nominal and binary values for this method. Education and education-num have the same meaning with different forms. Therefore, we will consider only one of them in the future.

categorical_main_data=main_data[[‘workclass’,’education’,’marital-status’,’occupation’,’relationship’,’race’,’sex’,’native-country’]]

numerical_main_data=main_data[[‘fnlwgt’,’capital-gain’,’capital loss’,’hours-per-week’]]

numerical_main_data[‘age’]=pd.to_numeric(main_data[‘age’])

Correlation for numerical values

Correlation varies should be between -1 and +1. The meaning of this range is:

  • -1: perfect negative linear correlation
  • +1:perfect positive linear correlation
  • No correlation

Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

Figure Correlation matrix with colors
Figure Correlation matrix with colors and numerical values

The correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself and the correlation matrix is perfectly symmetrical. In general,correlation coefficient values below 0.3 are considered to be weak; 0.3–0.7 are moderate; >0.7 are strong. As we can see, the max value(without diagonal) is 0.119828 and the min value is -0.004473. These correlations are weak. Therefore, we cannot say that there are correlated values.

Chi-square test for categorical values

The chi-square independence test is a procedure for testing if two categorical variables are related in some population

The chi-square test statistic is calculated as

Null hypothesis: Selected attributes are independent.

Alternative hypothesis: Selected attributes are dependent.

P > 0.05 is the probability that the null hypothesis is true.

workclass-marital-status

We start to analyze with two different categorical columns, workclass and marital-status.

Chi2:

1765.0402635587825

The p-value of the test:

0.0

Result: 0 <0.05, alternative hypothesis is correct.

Null hypothesis: Selected attributes are independent.

Alternative hypothesis: Selected attributes are dependent.

P > 0.05 is the probability that the null hypothesis is true.

workclass-occupation

Chi2:

12315.810397342453

The p-value of the test:

0.0

Result: 0 <0.05, alternative hypothesis is correct.

Null hypothesis: Selected attributes are independent.

Alternative hypothesis: Selected attributes are dependent.

P > 0.05 is the probability that the null hypothesis is true.

We collect all chi-square test results(p-values) in a list as below. p<0.05 situations highlighted by red color. We see that workclass-marital status, workclass-occupation,workclass-occupation,workclass-relationship, marital status-occupation,maritalstatus-relationship,occupation-native-country,occupation-sex,relationship-sex are dependent.In this situation, we have to remove one of the pairs.

After checking p values, we decided to remove occupation and relationship from our dataset.

For the ordinal data, we will continue with the education column. Last version of our dataset:

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

sex: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad Tobago, Peru, Hong, Holland-Netherlands.

Dimension Reduction

FAMD

Before performing clustering, we need to process the dataset into an executable format. Because we have the dataset consisting of both numerical and categorical features, factor analysis with mixed data type (FAMD) would be useful to integrate the types of features so that we can use it for clustering. This is because unlike OHE, it is possible to give the same weights to calculate values of projected inertia. As we already have done the feature selection above, the point of this step is processing the dataset.

Parameter Tuning of FAMD

n_components: the number of components

To calculate the number of principal components used for FAMD, we fit the model with different parameters and observe the explained inertia by the number of components on the graph. As you can see on the above graph, it should be very high value for 90% of variance to be explained.

n_iter: the number of iterations used for computing SVD

We did not change the default n_iter since we already have high n_components and increasing the number of iterations would worsen the time complexity.

Clustering

Choice of algorithms

K-means Clustering :

The k-means clustering method is an unsupervised ML technique used to identify clusters of data objects for a dataset.

Parameter Tuning

The hyper-parameters for k-means clustering as given by Sklearn are: n_clusters, init,n_init, max_iter, tol, verbose, random_state, copy_x, algorithm.

  • n_clusters — 3 The optimal number of clusters to be formed
  • init — k-means++ selects initial cluster centres for kmeans clustering to speed up convergence.The final results will be the best output of n_init consecutive runs in terms of inertia.
  • max_iter — The number of times the algorithm will run for different centroid seeds(default=10)
  • tol — Maximum iterations in a single run(default = 300)
  • verbose — default=0
  • random_state — Centroid initialization for random numbers (Here it is decalred as none)
  • copy_x — Typically used inorder to centre the data (Here it is true). If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.
  • algorithm — Default=auto (for backward compatability). The classical EM-style algorithm generally uses “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

For now “auto” (kept for backward compatibility) chooses “elkan” but it might change in the future for a better heuristic.

Evaluation measures

Here, we have used two measure to evaluate the model performance:

  • Intrinsic measures — Silhouette Score, Calinski Harabasz Score

To measure a cluster’s fitness within a clustering, we can compute the average silhouette coefficient value of all objects in the cluster. To measure the quality of a clustering, we can use the average silhouette coefficient value of all objects in the data set.

The score is defined as ratio between the within-cluster dispersion and the between-cluster dispersion.

  • Extrinsic measures — Homogeneity Score, V measure Score

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.It’s bounded between 0 and 1, with low values indicating a low homogeneity.

V measure score can be used to evaluate the agreement of two independent assignments on the same dataset.

DBSCAN :

Why does it suit our dataset?

  1. (Partitioning Clustering) K-means, K-medoids: As K-means are used above, it would be effective to use the one with different types of clustering.
  2. (Hierarchical Clustering) Agglomerative, DIANA: Inefficient for large datasets.
  3. (Density Clustering) DBSCAN:
  • Pros: resistant to noise and handle clusters of different shapes and sizes.
  • Cons: However, it fails to identify clusters of varying densities. Also, it has the problem in high-dimensional data due to the curse of dimensionality. But our dataset does not have high-dimensionality since the number of observations is bigger than the number of features.

The DBSCAN is density based clustering, which is resistant to noise so that it can handle clusters of different shapes and sizes. It is not suitable for high-dimensional data due to the curse of dimensionality. However, our dataset does not have high-dimensionality since the number of observations is bigger than the number of features.

Parameter Tuning of DBSCAN

min_samples: minimum number of instances

This is the minimum number of instances to form a cluster. Following the common method for tuning this parameter, it should be equal or greater than the number of dimensionality. First of all, we gave min_samples = 2 * n_components.

eps: the fewest number of points required to form a cluster

This is determined by running the K-nearest neighbours algorithm. First of all calculate the average distances between each point and its K-nearest neighbours, where K will be the min_samples that we defined above.

The graph shows the points sorted by distance to the 260th number of neighbours. Until the epsilon reaches to the value around 8, the increase of the sorted points seems to be acceptable. Finally, we set the epsilon value to 8.

However, eps = 8, min_samples = 260 setting makes only one cluster with outliers. Therefore, we tried to adjust the parameters from setting new min_samples, performing K-NN algorithm, then picking the optimal epsilon value that can classify at least 3 clusters with 2 for main clusters and one for outliers.

Final parameters set for DBSCAN

eps: 6.8

min_samples: 130

Evaluation measures

DBSCAN has their own way of measuring the performance of clustering, which is called DBCV. It is a better evaluation measure with the ability to handle the noise and non-globular shaped clusters.

However, DBCV can only be accessed by code implementation from github and running it on our code did not end.

Therefore we used the same evaluation measures with the one for K-means clustering. However, it is likely to be biased to have higher values.

Bias and Fairness in Clustering

Bias of the training model in machine learning could have originated from data itself which reflects the existing prejudices. This historical unfairness is the one that we are trying to measure from our training models. A model is fair when the errors are distributed similarly across protected groups. In our task, the protected attribute is ‘sex’, which refers to the different physiological characteristics of males and females.

Fairness in clustering is different from the one for classification as it focuses on the distribution of instances in a cluster which is supposed to be balanced even if they are divided by the protected attribute. It is easier to understand if referring to this picture from (http://aitime-lundao.oss-cn-beijing.aliyuncs.com/AitimeReport/20210826/1629963170842)

Therefore, we are going to see the distribution of instances in one cluster divided by protected attribute. The instances with the same cluster label are supposed to be fairly numbered when counting with and without protected attribute.

Dataset = fair_data

Protected attribute = ‘sex’ {‘ Female’, ‘ Male’}

Cluster labels = y_hat

Here, y_hat changes depending on the cluster model and the parameter settings.

Measures of Fairness — Statistical Parity Difference (SPD)

This evaluates that clustering model is unbiased if the prediction cluster label is independent of the protected attribute S so that

P(y_hat| S ) == P(y_hat)

This can be measured by checking the difference between the probability of prediction of y conditioned by whether s = ‘ Male’ or s = ‘ Female’. And the equation for measurement is,

SPD = P( y_hat = 1 | S = ‘ Male’ ) — P( y_hat = 1 | S = ‘ Female’ )

If subjects in both protected and unprotected groups should have equal probability of being assigned to the positive class.

Fair range

To evaluate the fairness from each clustering model, we refer to the following paper for the fair range (https://arxiv.org/pdf/2110.13029.pdf).

If the measures fall into the range between -0.1 to 0.1, we can say that the model does not contain a bias towards the protected attribute S in our dataset. In other words, if the measures are below -0.1 or above 0.1, we can say that the model is biased regarding the attribute S.

SPD from K-Means

Dataset: fair_data

Protected attribute: ‘sex’ column from dataset

Y_hat: ‘kmeans_y_0’, ‘kmeans_y_1’ columns from dataset

Result tables

n_cluster = 3
n_cluster = 5

SPD from DBSCAN

Dataset: fair_data

Protected attribute: ‘sex’ column from dataset

Y_hat: ‘dbscan_y_0’, ‘dbscan_y_1’ columns from dataset

Result tables

eps = 6.8, min_samples = 130
eps = 6.5, min_samples = 140

The results from the models show that when coordinating parameters, the number of clusters gets bigger and the overall SPD becomes smaller. However this fact does not prove that the model with more clusters would result in less bias. As the instances are more sparse in the model with greater number of clusters, the distributional difference between two clusters would be also small.

Also, comparing the SPD tables between two different clustering methods, there was no meaningful difference.

4 tables show that our dataset models cannot guarantee the perfect fair clustering as at most 2 labels in one SPD model shows the value out of the fair range (-0.1, 0.1).

Reference

K-means clustering

https://realpython.com/k-means-clustering-python/#how-to-perform-k-means-clustering-in-python

https://towardsdatascience.com/clustering-with-k-means-1e07a8bfb7ca

Understanding FAMD

https://www.datacamp.com/community/tutorials/introduction-factor-analysis

https://www.wikiwand.com/en/Factor_analysis

http://data-mining-tutorials.blogspot.com/2013/03/factor-analysis-for-mixed-data.html

Parameter tuning

https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd

Fair clustering

https://arxiv.org/pdf/2110.13029.pdf

http://aitime-lundao.oss-cn-beijing.aliyuncs.com/AitimeReport/20210826/1629963170842

--

--