Published in

--

I began reading this paper knowing very little about clustering and algorithmic fairness. So, I decided to write this reflection as a mini research paper of sorts. The content in this short piece comes from sources online including research papers, but also Wikipedia and other websites.

Clustering, which groups together similar objects in a group, is notably used in pattern recognition and exploratory data analysis that summarizes the characteristics that classify the objects. It is an iterative process of knowledge extraction that involves various algorithms, and trial and failure.

There are different kinds of clustering. I summarized these to better understand how clustering is used.

• Connectivity-based/hierarchical clustering

You can physically see the clusters when represented on a graph. They do not accommodate outliers very well.

• Centroid-based clustering (relevant to k-xxx clustering below)

There are a predetermined number of clusters, which is a big drawback of using this technique. It works such that the squared distance between objects and the nearest cluster center is minimized.

• Distribution-based clustering

Objects are grouped based on distribution models, which means they are grouped by the mathematical function that gives the highest probability of them following the function. This can tell us the correlation and dependence between object attributes.

• Density-based clustering

Clusters are determined by the density regions on the graph. Different groups with different densities may be found.

Outliers may create their own cluster or influence the results of the formed clusters. We can consider ignoring them or removing them before we process the data and cluster. This might bring up problems in the real world though if the outliers in themselves all represent a single group. As a trivial example, if one cluster of people say their favorite greeting is “Hello”, another cluster says it is “How are you doing?”, and three people say it is “Good morning”, “Buenos Dias”, and “Guten Morgen”, the latter three should all be grouped in one cluster because they are saying the same thing, albeit in different languages.

I want to note that I noticed space-based clustering is not something used in industry. We do not divide up the objects by grouping them together by drawing dividing lines on the graph, like having 3x3=9 squares and thus 9 clusters. It is also not standard practice in industry to divide objects into groups based on human partiality. This might be akin to how the United States of America was divided up into 50 states rather arbitrarily, based on culture, money, leaders of the day, and the wants and needs of the people.

The paper also talks about k-means, k-median, and k-center clustering. The cluster center aka the centroid point is the average of all the points in the set and will change in as each object is added.

• k-median

Minimizes sum of distances to cluster center

• k-mean

Minimizes sum of squared distances to cluster center

• k-means++

Just means that initial centroids were randomly picked from the data points and are not developed as points become added to the graph. Causes initialization sensitivity and affects the final formed clusters. (Can make a huge difference.)

• k-center

Minimizes maximum distance of any point to cluster center

This paper offers the world a method for increasing fairness in clustering tasks, based on group representativeness, especially when the groups are dissimilar in size. It pains me to think of how slow the research in academia reaches industry because the method this paper provides could be adopted by organizations that should try to improve representation in areas like race, gender, nationality, sexual orientation, class, and education level. I wonder if the government should try to play a bigger role in understanding the research that comes out of the money they provide to research institutions, and see how they can use results to encourage fairness in how public and private organizations make decisions and treat customers.