Clustering datasets: “World Happiness Report up to 2022”

Yang Lu
INST414: Data Science Techniques
4 min readApr 5, 2022

I wanted to see what countries are similar in happiness levels. The dataset I used is “World Happiness Report up to 2022” by Mathurin Aché from Kaggle. The insights gained could I used “Ladder score” and “Healthy life expectancy” as features for similarity. The K value I used was 20, as it appears close to the elbow point found by using the elbow method.

The data itself appears as a scatter plot better suited for regression analysis rather than clustering. After clustering by using the library sklearn:

model = KMeans(n_clusters=20)matrix=df[[“Healthy life expectancy”,”Ladder score”]]model.fit(matrix)df[“cluster”] = model.labels_plt.scatter(matrix[“Healthy life expectancy”], matrix[“Ladder score”], c=df[“cluster”])

The resulting plot was just a colorful scatter plot, with no easily visible logic to the clustering.

As I wanted to know how the data was clustered, and be able to use the data, I used a counter to print through each cluster:

counter=0while counter<(len(set(df["cluster"].values))):df2 = df.loc[df["cluster"] ==counter]print("CLUSTER:",counter)print(df2[["Country name","Regional indicator","Healthy life expectancy","Ladder score"]])print(df2["Regional indicator"].value_counts())counter+=1

There is some logic in the clustering, such as region indicators, healthy life expectancy and ladder score, although it is not perfect.

The main problem with using this method of clustering for these data is that it is inconsistent. The cluster groups changes with each run. This is most likely due to not setting a random state seed. I also used another clustering method called Agglomerative Clustering from sklearn, with an initial cluster of 20.

num_clusters=20
model = AgglomerativeClustering(n_clusters=num_clusters)
matrix=df[["Healthy life expectancy","Ladder score"]]
model.fit(matrix)
df["cluster"] = model.labels_
plt.scatter(matrix["Ladder score"], matrix["Healthy life expectancy"], c=df["cluster"])

There is some clustering differences and the cluster groups are consistent this time. Here is the hierarchical dendrogram:

The country of interest will be the United States. As such the main question is what countries are similar in happiness to the United States as measured by the dataset’s “ladder score” and “Healthy life expectancy”. I chose healthy life expectancy because it can imply a lot of information on the country’s population health, which can allow for the research of whether happiness level plays a role in life expectancy. United States is found in cluster 19.

To test the similarity of the countries in cluster 19, I used cdist from scipy. Healthy life expectancy, and Ladder score were used as the coordinates.

cdist_df=df.set_index('Country name')
cdist_df=df.loc[df["cluster"] == 19]
cdist_df=cdist_df[["Healthy life expectancy","Ladder score"]]
cdist_df["points"]=list(zip(cdist_df["Healthy life expectancy"], cdist_df["Ladder score"]))points = cdist_df["points"].values.tolist()
sim=scipy.spatial.distance.cdist(points, points, 'euclidean')
cluster_list=['United States', 'Lithuania',"Colombia","Hungary","Nicaragua","Peru","Bosnia and Herzegovina","Vietnam"]
sim_df = pd.DataFrame(sim, columns = cluster_list)
sim_df["Country Name"]=cluster_list
sim_df=sim_df.set_index("Country Name")
print(sim_df)

It is clear that these countries are very similar in regards to Healthy life expectancy, and Ladder score. The country most similar in happiness and life expectancy to the United States is Lithuania.

I would not bet on the above answer. The limitations of this method is that it only measured similarity by clusters. It may be better to use the cdist method against all the countries in the dataset. The main problem with that however, is that there are ~149 countries in the dataset, in doing cdist without a better way of visualization would only provide a poorer quality in visualization due to cluttering.

--

--