Practicing Clustering Techniques on Survey Dataset

Tama Handika
11 min readNov 4, 2017

--

Hello All,

We have practiced Regression problem last time, which is a category of Supervised learning. You can check the article here .

Today, we will explore a different category of Machine Learning, the Unsupervised learning algorithms. As the name suggest, we don’t have the luxury of knowing the right answer (target variable) upfront.

We will practice clustering using student evaluation survey dataset. Our goal is to group the students based on the similarity of their answers on the survey. Notice that we don’t know how many cluster (group) of students will be. In fact, we will use different methods of clustering to decide the best “natural” number of group of this dataset.

I used the word “natural” because theoretically we can group the observations with any number of group, as long as the number of group is smaller or equal the number of observations. The clusters will need to follow observation’s pattern to be “natural”. In this article, we will cross-check the result of our clustering by each algorithms to decide the best “natural” cluster.

You can download the complete source code here

This article will explore the following concepts :

  1. Use PCA to visualize the clustering result.
  2. Compare the clustering result with and without PCA.
  3. Define the number of clusters using Elbow method & Dendogram.
  4. Compare the result of K-Means vs Agglomerative Clustering.
  5. Do deeper analysis with clustering result.

Outline

  1. Data Exploration
  2. Data Preparation
  3. Training Models (K-Means & Agglomerative Clustering)
  4. Conclusion
  1. Data Exploration

You can check the meaning of each questions here

First things first, let’s check the dimension of this dataset :

>>> df_train.shape
(5820, 33)

We have 5820 observations (students) with 33 features (28 survey questions and 5 attributes).

We can check the first 5 features with :

df_train.head()

And basic statistical attributes with :

df_train.describe()

Although we can conveniently do a quick check on the dataset with the methods above, I prefer to open the csv directly to do some basic analysis. We can see that this dataset have 3 Instructors, who taught 13 classes.

Let’s do some analysis to understand this dataset better. We can compare the number of responses by those 3 Instructors :

hist_total_response_by_instr()

Most student taught by the third Instructor, but this don’t necessarily mean he taught most of the classes. There could be a large gap of the number of students between each classes, so let’s check it :

hist_total_response_by_class()

The number of responses of some classes is clearly much lower compared to the rest. Doing a little bit more analysis, we could see that the 3rd instructor taught 7 classes : class no 3, 4, 5, 8, 9, 12, and 13.

Let’s move on to the series of questions answered by the students. There are 28 questions, each answered from 1 (very bad) to 5 (very good). The first thing I would like to know is the grand mean of all questions answered by all students :

X_questions = df_train.iloc[:,5:33]
question_means = X_questions.mean(axis = 0)
grand_mean = question_means.mean()

We got a middle value grand mean (3.186156). Let’s zoom in to the mean of each question :

The mean of each question tend to be around the middle value (3). Looking at the graph above, the standard deviation should be small. Let’s do a quick check to be sure :

std_by_questions = question_means.std()

The standard deviation is 0.109106, which is reasonably small and inline with above graph. This mean that the quality of each class is more or less the same according to the survey answers.

2. Data Preparation

We are lucky that the Q1-Q28 features are Likert-type, which are scaled from 1–5. Hence, the only thing we can fit them to our models directly. Let’s remove all non-question features because we will group the student based on their answer in the survey :

X = df_train.iloc[:,5:33]

Recall that we don’t know how many cluster will be on this data and whether the clusters our model generated will be “natural” or not, so we will need to confirm it by using visualization.

However, we have 28 features that will be 28 dimensional space. It will be very hard for us to visualize (and understand) this much dimensions, so people usually use dimensionality reduction methods to transform them to 2 dimensional space (x, y).

Specifically, we will use Principal Component Analysis (PCA) to do the dimensionality reduction. PCA can help us to identify patterns based on the correlation between features. This algorithm aims to find maximum variance using fewer dimension than the original data.

Under the hood, PCA will draw the orthogonal axes (principal component) as the direction of maximum variance. The image below illustrate this concept :

Courtesy of https://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/

One drawback of using PCA is the new features generated by PCA (x, y in this case) will have less variance than the original features. This means we might lose some information and lower the predictive power of our model.

Hence, in this article we will use PCA to visualize the clustering result and then remove the PCA (use default features) and compare the difference outcome between them.

pca = PCA(n_components = 2, random_state=1)
X_pca = pca.fit_transform(X)

X_pca will have 5820 observations with only 2 features (dimensions). Let’s check the amount of variance we preserved from the original data :

>>> pca.explained_variance_ratio_.cumsum()[1]
0.86713816788910847

We managed to save around 87% of the variance, which is not very good but not bad either (people tend to prefer explained variance ≥ 90%). We don’t have to worry to much about this because we will not apply PCA for the final clustering result (and we will compare their result).

Another disadvantage of PCA is it transforms the features into abstract features. This mean we will lose data interpretability, hence making us unable to answer many business-related questions.

3. Training Models : K-Means

You can check the complete code in training/TrainKMeans.py

Let’s roll our slevees up and start by one of the most popular clustering algrorithm : The K-Means clustering. In a nutshell, this algorithm will cluster each observation based on their distance to the nearest centroid.

Centroid is randomly picked from the observations (a sample/row) initially and will be moved on each iteration based on the mean (for continuous features) or the modus (for categorical features) of the observations that were assigned to it. The number of centroid will be the number of the clusters.

The inner working of this algorithm is summarized by the following steps :

  1. Pick the number of cluster (we will use Elbow method). Let’s call this number k.
  2. Randomly pick k observations as initial centroids.
  3. Assign each observation based on the nearest centroid (computed by Euclidean distance).
  4. Move the centroid to the center (average) of the observations that were assigned to it.
  5. Repeat step 3 & 4 until the cluster do not change (convergence) or maximum iteration is reached.

The clustering result of this algorithm depend on the random intial centroids. This mean the result might not be the same if we run this algorithm multiple times. Fortunately, the K-Means model in sklearn have implemented more advanced version of this algorithm, called K-Means++.

K-Means++ will place the initial centroids far away from each other, which lead to faster convergence and more consistent result. This is the default value of init parameter.

The next thing on our to do list is to perform Elbow method. This method allow us to pick the best number of clusters (k) by computing the Sum of Squared Error of each cluster (also called distortion). Smaller value of SSE mean the distance between centroid and each observation within the cluster is closer.

Our goal is to identify the k number of cluster where the distortion decrease most rapidly. In this article, we will try fitting the K-Means model with k from 1–5, and compare the change of distortion of each k :

distortions = []
K_to_try = range(1, 6)

for i in K_to_try:
model = KMeans(
n_clusters=i,
init='k-means++',
n_jobs=-1,
random_state=1)
model.fit(X_pca)
distortions.append(model.inertia_)
plt.plot(K_to_try, distortions, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Distortion')
plt.show()

Notice that we put the model in the for loop and use n_clusters=i to iterate through each value of K_to_try (range(1, 6) will generate a list of integers of 1–5). The code above will generate the following graph :

The “elbow” seems to be located at k = 3. This mean the distortion will not be decreased significantly if we tell the algorithm to use a larger number of clusters, hence the clustering result might be too complicated and not “natural”.

Let’s train the K-Means with 3 clusters :

# use the best K from elbow method
model = KMeans(
n_clusters=3,
init='k-means++',
n_jobs=-1,
random_state=1)

model = model.fit(X_pca)

y = model.predict(X_pca)

And visualize the result by scatter plot :

We could see that the observations are tightly grouped (this confirmed our analysis in the Data Exploration earlier) and the clusters are reasonably “natural”.

Let’s see what those clusters mean in the context of answers given by those students. Let’s check the number of observations belonged to each cluster :

>>> print(collections.Counter(y))
Counter({2: 2358, 1: 2222, 0: 1240})

Next, let’s compare the clustering result without applying PCA. We simply train the model using the default features (with 28 dimensions) :

model_k = model_k.fit(X)

y_final = model_k.predict(X)

Notice that we fit the model with X instead of X_pca. Let’s compare the number of students in each cluster and check the difference :

Final K Means Result (no PCA) : 
Counter({1: 2358, 2: 2223, 0: 1239})

Important detail here : Somehow, y and y_final mapped with different label. The observations with y==1 is the same observations as y_final==2. I confirmed this by generating a csv for each of them. You can check it in output/raw_result.csv and output/raw_result_pca.csv

Only one observation has different cluster compared to the result with PCA (an observation moved from the 1st cluster to the 2nd). This mean reducing the data to 2 dimensions by PCA don’t decrease the clustering performance significantly.

Let’s compute the average answer of each cluster to see the difference between them :

Mean of cluster 1 : 1.43439409662,STD :0.470279735218
Mean of cluster 2 : 3.0183418151,STD :0.302760863977
Mean of cluster 3 : 4.34051474841,STD :0.466456137353

We can conclude that the K-Means algorithm, combined with Elbow method successfully cluster the students based on their average answer on the survey.
Note that cluster 1 is observations with y=0 and so on.

Having known the cluster of each observation, we can analyse those clusters deeper to identify their correlation with other features/variables. For example, we can see that the majority (868/1239 = 0.701) of dissatisfied students (the 1st cluster) were taught by Instructor number 3.

He seems to be the culprit, let’s question him! But wait a minute, recall that most of the students (3601) in this survey were also taught by this Instructor, so the calculation above is not fair to him (and probably misleading). Let’s calculate the percentage of dissatisfied students taught by him instead : 868/3601 = 0.241.

My point is, the University might be able to use these clusters as additional information, for example they might want to perform deeper interview with students from the first cluster to find out why they don’t satisfy with the teaching process, and see if there any correlation with other variables.

3. Training Models : Agglomerative Clustering

You can check the complete code in training/TraingAgglClustWard.py

Next, we will explore Agglomerative Clustering algorithm. One advantage of using this algorithm is it’s ability to plot dendograms (we will plot it later), which can help us to interpret the clustering result. Another advantage of using dendogram is it will tell us the number of best cluster (k) hence we don’t need to try out each value of k using Elbow method.

In a nutshell, Agglomerative Clustering will assign each observation as individual cluster and merge those clusters based on their distance (similarity) pair by pair, iteratively.

There are many ways to compute the similarity of the clusters, we will focus on Ward’s linkage algorithm in this article. Ward’s linkage will merge clusters that lead to minimum increase of Sum of Square Error (SSE).

Similar with K-Means implementation earlier, we will apply PCA for visualisation purpose. Let’s start by generating the dendogram :

dendrogram = hier.dendrogram(hier.linkage(X_pca, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('questions')
plt.ylabel('Euclidean distances')
plt.show()

The code above will generate the following dendogram :

Notice that there are two colours : green and red (exclude blue, which is the root). This mean the best k number of clusters according to Ward’s linkage (method = ‘ward’) is two.

The best k number of clusters will be different if we change the linkage algorithm. For example, in the source code I also tried complete linkage which lead to 4 clusters.

Let’s fit the model with the best k we found earlier :

model = AgglomerativeClustering(n_clusters = 2, 
affinity ='euclidean',
linkage ='ward')
y = model.fit_predict(X_pca)

And check the clusters by scatter plot :

We will need to do further investigation to decide whether this clustering is appropriate for this dataset. Let’s print the number of each cluster :

Counter({0: 3502, 1: 2318})

Around 60% students belong to the first cluster. Similar to K-Means implementation, we are going to check the difference between the two clusters by checking their mean :

Mean of cluster 1 : 2.44248184711,STD :0.829040130493
Mean of cluster 2 : 4.30968815481,STD :0.480492591784

I prefer the result of K-Means with k=3 from Elbow method because having three different groups seems to be more meaningful (Dissatisfied, Neutral, and Satisfied). Smaller number of clusters (Dissatisfied/Neutral & Satisfied) from the Ward’s linkage lead to a larger gap between the mean of each cluster.

4. Conclusion

First things first, let’s summarize what we have explored :

  1. Use PCA to visualize the clustering result.
    Checked. We used PCA both on K-Means and Agglomerative Clustering.
  2. Compare the clustering result with and without PCA.
    Checked. Only one observation moved from the first cluster to the second. This suggest that applying PCA do not decrease the clustering power significantly.
  3. Define the number of clusters using Elbow method & Dendogram.
    Checked. We practiced using both in this article.
  4. Compare the result of K-Means vs Agglomerative Clustering.
    Checked. Both algorithm give very different clustering result, with different number of clusters.
  5. Do deeper analysis with clustering result.
    Checked. I applied similar analysis for both algorithm.

Ultimately, the decision whether to have two or three clusters of students depend on the business decision. Do you want to treat the “Neutral” and “Dissatisfied” group differently? If the answer is yes, then we will stick with K-Means clustering, and vice versa.

Thanks for reading this article.

--

--