How to delete bias in your dataset before applying a clustering?

Published in

fifty-five | Data Science

11 min readNov 2, 2021

Introduction

Clustering consists in dividing a population of individuals (or data points more generally) into a number of groups, in such a way that the individuals of one group are more similar to each other than to those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

To evaluate if two individuals are similar or very different, this unsupervised learning algorithm will analyze how close they are in a P-dimensional space, where P is the number of variables that defines each individual. Hence, the choice of the variables used to describe the population is an essential matter, because it will intrinsically impact the results the clustering algorithm will produce afterwards.

From a business point of view, we want to take into account as much information as possible. But from a clustering perspective, we need to make sure that all the information is equitably represented in the clustering space, in order to create an appropriate separation of the population. In this article, I will show you a method that conceals both perspectives, by unbiasedly encoding a population described in a dataset, before dividing it into clusters. This technique will require you to fully understand the logic behind the Principal Component Analysis (PCA) algorithm; if you are not familiar with it, I advise you to read this article before going any further.

I] The problem

Let’s say that I want to do a clustering of some children from kindergarten. These children (Figure 1) are described by their age (in months), their size (in cm), their weight (in kg), and the ages at which they crawled and walked for the first time (in months).

If we used the “old” way to do clustering, we would have to work in two steps:

First, we scale all the variables (for instance using a standardization, bringing the means to 0 and the standard deviations to 1). This way, the average “separation” produced by each variable over the data points will be comparable. In other words, we give to all the variables a similar inertial power (how much separation they can produce) in the clustering space.
Then, we apply the clustering algorithm directly over the dataset of scaled variables.

We should get a clustering that will look appropriate, and could consider this use case done. However, it seems that the result from this protocol will heavily depend on the variables that we chose to put in the initial dataset. Therefore, how can we be sure that these choices enabled us to objectively describe this population?

If we do a quick analysis of the variables through the correlation matrix (Figure 2), we can see that two groups of very correlated variables emerge: first, the ages, sizes and weights of the children, and second, the ages of their first crawl and first walk.

*Figure 2: Correlation matrix from the children dataset.*

Now, we could simply look at these correlations, perhaps try to make some sense of it, and then leave it at that. Or, we could see that, the fact that some variables are correlated actually means that those variables relate the same information. In this case, the age, size and weight all relate to how “grown” the children are, while the first crawl and first walk relate to how “precocious” they are. In other words, there are only 2 independent insights present in this dataset, that were biasedly displayed through an unequal number of variables (in this case, three for the growth and two for the precocity).

The problem with using the “old” protocol is that the scaling was done on the variables, not on the independent insights. As a result, those insights are not necessarily scaled equitably. Some insights may have more inertial power than the others, and hence have more importance in the making of the different clusters. And this is the reason why the old protocol cannot be satisfactory.

To illustrate this idea, let’s compare the separation initiated by one information, depending on how many times this information is represented through a variable. In Figure 3, we visualize a population described by the age once (in red) or twice (in orange). While the information displayed for both plots is fundamentally the same, an additional inertia was given to the orange plots: for two given individuals, their distance will be bigger through the orange plots than the red plots (ex: d2 > d1).

*Figure 3: Population described by their (scaled) ages, first through one dimension (red), and then through two dimensions (orange).*

To top it all off, in practice, this situation is even more delicate. Usually, in a real business case, you will be given access to a messy data source with raw data, and after a bit of cleaning, you will have to aggregate all this blurry information into insightful variables.

For instance, let’s consider the context of online navigation on a website. From the activity of a user (i.e. all the clicks and choices done on each pages of the website at each timestamp), you can create a considerable number of relevant variables, such as time spent over different categories of articles, frequencies of the clicks, share of the web activity depending on the hour of the day, the day of the week, the month of the year… The list can go on, and you can easily obtain more than 100 different variables that could all give some insightful knowledge over the population analyzed.

In the end, the distribution of the information through those variables can be extremely uneven, and applying the old protocol could lead to very biased results.

This puts us in a difficult position:

On the one hand, we don’t want to delete some variables, because even if they are partially correlated to one another, they still all bring some insights that we don’t want to throw away. This would also bring the question of how to choose the ones we keep and the ones we don’t, which is another subject…
On the other hand, while doing the clustering, we don’t want to give more separational power to some information just because it is more represented in the set of variables that we chose in a biased manner.

II] The solution

Since the problem is that the independent parts of fundamental knowledge are sprayed, in an unbalanced way, throughout the variables of our dataset, the idea behind the solution is to directly use the independent parts of knowledge to describe the population. This is where the PCA comes into great use.

We’ll consider here a numerical dataset, but we could also work with numerical and/or categorical variables using Factorial Analysis of Mixed Data (FAMD, see remark 1. at the end). The algorithm to implement follows three simple steps:

First, standardize all the variables from the original dataset. This is used to give them comparable scales, since there is no reason to initially give more inertial power to some variables over the others.
Second, we apply the PCA over the set of variables. The PCA will look iteratively for the orthogonal axes of maximal inertia in the Euclidean space, which can be interpreted as the independent insights from our dataset, from the most represented in our initial dataset to the least represented. As a result, we get a set of new variables, which are all the principal components. By construction, the first components will have more inertial power than the last ones.
Finally, after regrouping the correlated parts of knowledge hidden in the dataset with themselves, we want to give all of them the same importance. The fact that their inertial power is initially high or low only relates to how much they were represented through the initial variables, and does not relate to how important they fundamentally are. So, to give equal importance to these components, we also standardize them, to make sure the insights they each translate will have the same amount of separational power in the clustering space.

After this, we should have a dataset of new variables that keep all the important information, while eliminating the bias we previously introduced!

III] The proof

The objective of this part is to find an experiment that will show if our solution actually works. To do so, we implemented the following protocol:

Let’s consider a dataset of N individuals described through 3 fundamental variables (A, B, C), which are all already standardized. We created this fundamental dataset in a way that 3 “obvious” clusters would naturally appear (Figure 4 below). The clustering obtained (using k-means with k=3 over (A, B, C)) will be called fundamental clustering. It is the clustering we will eventually want to find back, because it is actually based on the 3 variables that fundamentally describe this population.

*Figure 4: Fundamental clustering of the dataset.*

After this, we add M (M≥1) “noised” copies of column A to the dataset. Each new noisy copy A*i is obtained by adding a gaussian noise (μ=0, σ fixed) to A, before applying a standardization over the column. This new dataset (A, B, C, A*1, …, A*M) is called the production dataset, since it refers to the data given in production, where fundamental information can be unevenly spread throughout the different variables we have (here, the information from A is much more represented than the one from B and C).

Case 1: the old protocol. We apply the k-means algorithm (with k=3) directly over the production dataset, and obtain the production clustering. This clustering might be different from the fundamental clustering, because the information from A is much more represented in this dataset, and will hence have more inertial power (similarly as in Figure 3).
Case 2: the new protocol. We apply PCA with n_components=3 (see remark 2.) over the production dataset, and we standardize the 3 principal components obtained. We can now apply k-means (k=3) over this PCA dataset to get the PCA clustering. If our approach is correct, then this clustering should be extremely similar to the fundamental clustering, even if the production clustering is not, because we would have been able to delete the bias introduced by the overrepresentation of the variable A in the production dataset.

☀ Now, how can we evaluate the similarity between two clusterings?

Computationally, doing a clustering comes down to assigning a label to all the observations of a dataset, with the labels being numbers between 1 and k (with k the number of clusters).

However, the value itself of those labels is arbitrary, and any permutation of those values still defines the same clustering. So, to get the real similarity, we compare the labels of the first clustering with all the possible permutations of the labels from the second clustering. We save the permutation for which the first column matches the second column the most often, and we call matching ratio the ratio of lines which are the same. For instance, with the example described below, we would have a matching ratio of 90% between the two clusterings (for the permutation where only the column C does not match).

Figure 5: Comparison of two clusterings of the same population

To observe the results, we plotted the matching ratios, first between the production clustering and fundamental clustering (in blue), and then between the pca clustering and fundamental clustering (in red), depending on the number M (number of noisy repetitions of the column A in the production dataset). Finally, we repeated this process for 4 different values of σ, and we got the following curves:

*Figure 6: Evolution of matching ratios, for the production and the pca clusterings, with the fundamental clustering of the dataset, depending on M, for 4 different values of σ.*

We can see in these graphs that:

The matching ratio between fundamental and production clustering (in blue) always follows 2 phases: first constant and very close to 1 (total compatibility), and then constant but much lower than 1. This means that, when the information behind the variable A becomes overwhelmingly present, it drowns the knowledge behind B and C, and we can never find back the fundamental clusters we originally had through the production dataset.
However, the matching ratio between fundamental and pca clustering(in red) is always constant and very close to 1. This proves that using the scaled principal components enabled us to delete the bias introduced by the noisy repetitions of A!
Finally, we also remark that, as σ gets bigger, the blue line takes more time to collapse, and the red line tends to stand a bit lower. This is due to the fact that, when σ is bigger, the variables A*i become mainly composed of noise, so it takes more repetitions of A*i for the knowledge behind A to drown the information behind B and C; and the clusters obtained are globally a bit different from previously because of the noise that was added to the dataset.

We now have proof that our solution can suppress the bias from our dataset, and find the fundamental clustering based on the real information behind the population!

Remarks:

If you have numerical and categorical variables in your dataset, you can replace the steps 1) and 2) in II] with the Factorial Analysis of Mixed Data algorithm. FAMD is the generalization of PCA to categorical and numerical variables. To learn everything about it, you can read my previous article on the subject here.
Choice of n_components during the PCA: Previously, we chose n_components=3 because, deep down, we knew that this dataset only needed 3 axes to describe all its information. In practice, this won’t be the case and you’ll have to make a choice. Many ideas are possible. Personally, I choose the following reasoning: if we kept all the K components (where K is the number of variables in the original dataset), we could approximate that, on average, the components will have an explained variance ratio of 1/K. Since we want to keep all the components that actually bring insights from the dataset, but not the ones that only translate some little random noise present in the dataset, we can keep the components for which the explained variance ratio > α* 1/K, with α ≤ 1. In practice, α = 0.5 or α = 0.25 both work well.

Conclusion

By looking for the principal components and scaling them, we are now able to delete the bias naturally introduced in a dataset, and extract the fundamental information into well-balanced insights, to adequately encode the population for a clustering use case. The encoding of the population is an essential matter in unsupervised learning, and this method will help making sure that the clusters obtained in such a mission will meet the reality we want to capture.