Data Exploration using Unsupervised Machine Learning — Cluster Analysis

Published in

The Startup

8 min readJun 12, 2020

The Philippine Statistics Authority (PSA) spearheads the conduct of the Family Income and Expenditure Survey (FIES) nationwide. This survey, which is undertaken every three years, aimed at providing data on family income and expenditure, including, among others, levels of consumption by item of expenditure, sources of income in cash, and related information affecting income and expenditure levels and patterns in the Philippines. The published data is considered for this study on ‘unsupervised learning’. Those methods were used to explore the data and analyze any interesting unexposed structures in the data.

Introduction:
Unsupervised learning is a type of machine learning that looks for any undetected patterns in a data set with no pre-existing labels. Two of the main methods used in unsupervised learning are principal component and clustering. Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other clusters. Among 100 published clustering algorithms available, two most reliable types: K-means clustering (centroid-based) and Hierarchical clustering (connectivity-based) are being studied here.

Analysis:
The dataset contained over 25 variables which are primarily comprised of the household income and expenditures of the specific household in the Philippines. The correlation matrix is plotted to identify the relationship between variables and to decide on the most important variables to be intended for the study.

Fig.1 Correlation matrix of all variables

From fig.1, it is found that the variables food, meat, restaurant, communication, transportation, total income, total expenses, rental value, and housing are important and thus a new data frame (subset) is constructed with the selected variables and used for further analysis. Also, when two variables are highly correlated its intuitive to cluster those two.

The next step is to find if the chosen dataset is suitable for clustering analysis. To verify, Hopkins statistic test is performed on the dataset as shown in Fig.2.

Since the output value is close to 1 (far above 0.5), we can conclude that the dataset is significantly clusterable.

Now a few things from the dataset are plotted to see if there are any patterns exist. From Fig.3 it looks like almost all the variables are discrete and have some kind of spread or pattern in it. Thus for further analysis let us explore if the Total_income and Total_expense variable show any insights on the other parts of the data.

Scaling:
Before starting with the clustering analysis, the values of the variables in the data set are rescaled (normalized) so they share a common scale. The data is converted into a specific range using a linear transformation to generate good quality clusters which ultimately improves the accuracy of clustering algorithms. Fig.4 shows a glimpse of the scaled data.

K- means Clustering:
K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. An important step to be performed before starting the k-mean analysis is to decide on the number of clusters. For this, the elbow method is used. In cluster analysis, the elbow method is a heuristic used in determining the optimal number of clusters in a data set. The Within-Cluster-Sum of Squared Errors (WSS) for different values of ‘k’ is calculated and for whichever ‘k’ value the WSS (the sum of squares of the distances of each data point in all clusters to their respective centroids) becomes first starts to diminish, that particular ‘k’ is chosen.

From Fig.5, the optimum value for K would be 3. As we can see that with an increase in the number of clusters the WCSS value decreases. We select the value for K on the basis of the rate of decrease in WCSS. For example, from clusters 1–2–3 in the above graph we see a sudden and huge drop in WCSS. After 3, the drop is minimal and hence we chose 3 to be optimal for K.

Now, to start with, one-dimensional k-mean clustering analysis is performed as illustrated in Fig.6. The variables chosen are Age, clothing and Footwear expenses, Food expenses, and the total income of the families.

Fig.6 One dimensional K-mean cluster analysis

Here K value is chosen to be 3 and thus 3 clusters can be seen. For the total_income variable, it can be categorized as families that have low income, medium income, and higher income. With regards to age plot, its young, middle-aged, and elderly. For the plot on Food expenses, its low spenders, medium spenders, and high spenders.

Fig.7 Size of the clusters in Food expenses (reversed)

The number of families in each category related to food expenses is shown in Fig.7. This result can be used by the Philippines government to decide on the type of services it has to offer to focus on low spenders. The plan can also be related to the implementation of food laws and regulations to improve their life quality, as food is universal that all the people can relate to and enjoy. Anyway, with respect to this one-dimensional cluster, only minimal things can be interpreted and those are obvious. For the clusters to have more meaning and many interpretations, two-dimensional clustering is studied further.

First, let us see if the clusters by income show any insights on the other variables. The ‘nstart’ option attempts multiple initial configurations and reports on the best one within the ‘kmeans’ function. As Fig.8 shows, for n=1, the ‘total withinss’ is 733.162 and for n=100, it is 680.838. The nstart value of 100 is preferred and also on verifying for all the variables (for multiple trials), it showed significantly less withinss value.

Fig.8 Plots with different initial conditions ( nstart=1 and nstart=100)

The second cluster in Fig.8 can be interpreted as ‘low-earners and low-spenders’ on clothing, ‘medium & high-earners and low & medium-spenders’ and ‘medium & high-earners and medium & high-spenders’.

The common inference which can be made form the two plots in Fig.9 is ‘young people who earn less and spend less’ (413 and 419), ‘elderly people who earn less and spend less’(509 and 509) and ‘middle-aged people who earn high and spend high’ (78 and 72).

Fig.9 Age vs (Total_income and Total_expenses)

Fig. 10 shows the income and communication expenses. It can be interpreted as ‘very low- earners and low-spenders on mobile bills’, ‘medium- earners and low & medium-spenders’ and ‘high-earners and high-spenders’.

Fig. 11 shows the income and Food expenses. It can be interpreted as ‘very low- earners and low-spenders on food’, ‘low&medium- earners and low & medium-spenders’ and ‘medium&high-earners and high-spenders on food’.

Fig.10 Income vs communication expense, Fig.11 Income vs Food expenses

Fig. 12 shows the income and housing expenses. It can be interpreted as ‘very low- earners and very low-spenders on housing’, ‘low&medium- earners and low & low&medium-spenders’ and ‘medium&high-earners and medium&high-spenders on housing’.

Fig. 13 shows the income and Transportation expenses. It can be interpreted as ‘very low- earners and very low-spenders on Transportation’, ‘low&medium- earners and low&medium-spenders’ and ‘medium&high-earners and medium&high-spenders on Transportation’.

Fig.12 Income vs Housing expenses, Fig.13 Income vs Food Transportation expenses

Hierarchical Clustering:
Hierarchical clustering is the unsupervised clustering algorithm that involves creating clusters that have predominant ordering from top to bottom. The main output of Hierarchical Clustering is the dendrogram, which shows the hierarchical relationship between the clusters. As the elbow method used in the k-means algorithm in determining the number of optimum clusters, the result can be leveraged from the dendrogram to approximate the number of clusters. The best choice of the number of clusters is the number of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster. For the chosen dataset, the dendrogram is displayed here in Fig.14 with the red line (in the left picture) indicating the cut (at height 20). Hence it could be conveyed that the chosen cluster count is 3. The second picture is displayed for the ease of understanding which shows all the 3 different clusters.

Fig.15 shows the sizes of the clusters obtained from the hierarchical clustering discussed above. 964 indicates the size of the cluster in green, 28 the cluster in blue, and 8 the cluster in pink.

Fig.15 cluster sizes from hierarchical clustering

When the results of the hierarchical clustering are plotted in the form of a graph, the results were not much satisfactory. In Fig.16, first cluster can be interpreted as ‘low earners and low spenders’ but the second and the third cluster doesn’t seem to be showing any pattern and is randomly distributed. The same goes for Fig.17 too as the graph doesn’t look great and satisfying. The underlying reason might be related to the huge amount of data undertaken where normally hierarchical clustering performs comparatively less.

Fig.16 Income vs Housing expenses, Fig.17 Income vs Food Transportation expenses

Conclusion:
To recapitulate, a glimpse of clustering in the context of unsupervised learning is discussed with the obtained family information from the Philippines and many possible variables are explored in a detailed manner. These results can be used by the authorities and government officials in the planning of their economy in the forthcoming years. Also, compared to the hierarchical clustering, K-means clustering seems to give accurate and useful results for the dataset undertaken for the study.

Datasets:
The data-set source is processed and is made available by the Philippine statistic Authority in Kaggle. The data can be accessed through the link mentioned above.

Data Exploration using Unsupervised Machine Learning — Cluster Analysis

Written by Kavin Ammankattur Palaniappan