Creating Clusters for Educational Institutions

Casey Tabatabai
INST414: Data Science Techniques
4 min readMay 12, 2022

Within this assignment I will be constructing clusters for different educational institutions around the world based on global rankings for quality of education and alumni employment. For this assignment I wanted to use data that can create useful clusters for generating insights on how different educational institutions are similar. I used quality of education as one of my variables as it provides context for how the institution is viewed globally in terms of how strong the education level is. Additionally, I chose alumni employment as my other variable as it provides context for how often alumni’s from a certain institution find success in their field. The insights I wish to extract from my data is the similarity between different institutions, which will inform people who wish to attend or send their child to an institution similar to one they previously attended.

The source of my data is taken from Kaggle, titled “World University Rankings” and was created by Myles O’Neill. The data was provided in the form of a CSV file which I converted to a Pandas DataFrame through the use of the Pandas read_csv function. The clustering method that I will use in this assignment is k-means clustering using the Euclidean distance similarity metric. K-means clustering is a method used to divide data into different clusters based on how similar they are to each other. Euclidean distance is used as a similarity metric as it measures the distance between two vectors.

I selected the k value for the number of clusters in my dataset using the elbow method. The elbow method is a way to select a k value for the number of clusters by plotting the inertia value in the k-means model and finding where the metric changes from a steep to shallow decrease. In cluster analysis the quality of clusters increases as inertia decreases, so the elbow method is a good way of finding the k value as it visualizes the change.

DataFrame for k value and inertia score
Plot to optimize k value using elbow method

The k value for the number of clusters that I chose for this assignment is 40. As you can see in the visual and DataFrame above, the inertia values appear to plateau around the 40 mark.

Each cluster within my dataset represents a set of institutions that share similarities in regards to quality of education and alumni employment.

Cluster 1 DataFrame

For example, the institutions listed in cluster 1 all have an education rank ranging from 134 to 221, and an employment rank ranging from 12 to 86. In a dataset of 1000 institutions, this is a solid cluster of similar data.

Cluster 10 DataFrame

The cluster that appears to feature many top institutions is cluster 10. Every institution included in this cluster has an education rank ranging from 1 to 40 and an employment rank ranging from 1 to 64. This cluster is comprised of many Ivy League schools in the United States of America such as Harvard University and Yale University, as well as top institutions outside of the country like the University of Oxford and University of Tokyo.

The main software that I used to facilitate my analysis was SKLearn. SKLearn is a Python library that includes many tools to perform data analysis. SKLearn allowed me to perform k-means clustering on my data.

One bug that I ran into when working with my data was duplicate values. Originally my data had 2000 rows, so I believed that there were rankings of 2000 different institutions. However, after visualizing my data and putting it into clusters, I noticed that there were institutions with the same names. After looking into the data a bit more I realized that there was data for the years 2014 and 2015. I solved this issue by going into Excel and removing the 2014 data as I wanted to use the more recent data.

One limitation of my data is the fact that there are only rankings for the top 367 institutions for education quality, and only rankings for the top 567 institutions for alumni employment. Each institution not ranked in the top 367 institutions for education quality was given a rank of 367, and likewise for alumni employment for 567. I noticed this with the clustering as many of the institutions with rankings of 367 and 567 for education quality and alumni employment were clustered together. If my data included full rankings for both variables then I believe the clustering would be much more accurate.

My main takeaway from my analysis is how useful clustering is to determine similarity in a dataset. If someone had a great experience with a particular institution, then they can use clustering to find similar institutions based on a variety of variables. If I am ever in need of finding something similar to a specific set of data, then I will certainly use clustering in the future.

--

--