# Practicing Clustering Techniques on Survey Dataset

Hello All,

We have practiced Regression problem last time, which is a category of Supervised learning. You can check the article **here ****.**

Today, we will explore a different category of Machine Learning, the Unsupervised learning algorithms. As the name suggest, we don’t have the luxury of knowing the right answer (target variable) upfront.

We will practice clustering using student evaluation survey dataset. Our goal is to group the students based on the similarity of their answers on the survey. Notice that we don’t know how many cluster (group) of students will be. In fact, we will use different methods of clustering to decide the best “natural” number of group of this dataset.

I used the word “natural” because theoretically we can group the observations with any number of group, as long as the number of group is smaller or equal the number of observations. The clusters will need to follow observation’s pattern to be “natural”. In this article, we will cross-check the result of our clustering by each algorithms to decide the best “natural” cluster.

**You can download the complete source code ****here**

This article will explore the following concepts :

- Use
*PCA*to visualize the clustering result. - Compare the clustering result with and without
*PCA*. - Define the number of clusters using
*Elbow method*&*Dendogram*. - Compare the result of
*K-Means*vs*Agglomerative Clustering*. - Do deeper analysis with clustering result.

**Outline**

- Data Exploration
- Data Preparation
- Training Models (
*K-Means & Agglomerative Clustering*) - Conclusion

**Data Exploration**

You can check the meaning of each questionshere

First things first, let’s check the dimension of this dataset :

>>> df_train.shape

(5820, 33)

We have **5820** observations (students) with **33** features (28 survey questions and 5 attributes).

We can check the first 5 features with :

df_train.head()

And basic statistical attributes with :

df_train.describe()

Although we can conveniently do a quick check on the dataset with the methods above, I prefer to open the *csv *directly to do some basic analysis. We can see that this dataset have 3 Instructors, who taught 13 classes.

Let’s do some analysis to understand this dataset better. We can compare the number of responses by those 3 Instructors :

Most student taught by the third Instructor, but this don’t necessarily mean he taught most of the classes. There could be a large gap of the number of students between each classes, so let’s check it :

The number of responses of some classes is clearly much lower compared to the rest. Doing a little bit more analysis, we could see that the 3rd instructor taught 7 classes : class no 3, 4, 5, 8, 9, 12, and 13.

Let’s move on to the series of questions answered by the students. There are 28 questions, each answered from 1 (very bad) to 5 (very good). The first thing I would like to know is the grand mean of all questions answered by all students :

X_questions = df_train.iloc[:,5:33]

question_means = X_questions.mean(axis = 0)

grand_mean = question_means.mean()

We got a middle value grand mean (3.186156). Let’s zoom in to the mean of each question :

The mean of each question tend to be around the middle value (3). Looking at the graph above, the standard deviation should be small. Let’s do a quick check to be sure :

std_by_questions = question_means.std()

The standard deviation is 0.109106, which is reasonably small and inline with above graph. This mean that the quality of each class is more or less the same according to the survey answers.

**2. Data Preparation**

We are lucky that the Q1-Q28 features are *Likert-type, *which are scaled from 1–5. Hence, the only thing we can fit them to our models directly. Let’s remove all non-question features because we will group the student based on their answer in the survey :

X = df_train.iloc[:,5:33]

Recall that we don’t know how many cluster will be on this data and whether the clusters our model generated will be “natural” or not, so we will need to confirm it by using visualization.

However, we have 28 features that will be 28 dimensional space. It will be very hard for us to visualize (and understand) this much dimensions, so people usually use *dimensionality reduction* methods to transform them to 2 dimensional space (*x, y*).

Specifically, we will use *Principal Component Analysis* (*PCA*) to do the dimensionality reduction. *PCA* can help us to identify patterns based on the correlation between features. This algorithm aims to find maximum variance using fewer dimension than the original data.

Under the hood, *PCA* will draw the orthogonal axes (principal component) as the direction of maximum variance. The image below illustrate this concept :

One drawback of using *PCA* is the new features generated by *PCA* (*x, y* in this case) will have less variance than the original features. This means we might lose some information and lower the predictive power of our model.

Hence, in this article we will use *PCA* to visualize the clustering result and then remove the *PCA* (use default features) and compare the difference outcome between them.

pca = PCA(n_components = 2, random_state=1)

X_pca = pca.fit_transform(X)

*X_pca* will have 5820 observations with only 2 features (dimensions). Let’s check the amount of variance we preserved from the original data :

>>> pca.explained_variance_ratio_.cumsum()[1]

0.86713816788910847

We managed to save around 87% of the variance, which is not very good but not bad either (people tend to prefer *explained variance* ≥ 90%). We don’t have to worry to much about this because we will not apply *PCA* for the final clustering result (and we will compare their result).

Another disadvantage of *PCA* is it transforms the features into abstract features. This mean we will lose data interpretability, hence making us unable to answer many business-related questions.

**3. Training Models : K-Means**

You can check the complete code intraining/TrainKMeans.py

Let’s roll our slevees up and start by one of the most popular clustering algrorithm : The* K-Means* clustering. In a nutshell, this algorithm will cluster each observation based on their distance to the nearest *centroid*.

*Centroid* is randomly picked from the observations (a sample/row) initially and will be moved on each iteration based on the *mean* (for continuous features) or the *modus* (for categorical features) of the observations that were assigned to it. The number of *centroid* will be the number of the clusters.

The inner working of this algorithm is summarized by the following steps :

- Pick the number of cluster (we will use
*Elbow method*). Let’s call this number*k*. - Randomly pick
*k*observations as initial*centroids.* - Assign each observation based on the nearest
*centroid*(computed by Euclidean distance). - Move the centroid to the center (average) of the observations that were assigned to it.
- Repeat step 3 & 4 until the cluster do not change (convergence) or maximum iteration is reached.

The clustering result of this algorithm depend on the random intial *centroids*. This mean the result might not be the same if we run this algorithm multiple times. Fortunately, the *K-Means* model in *sklearn* have implemented more advanced version of this algorithm, called *K-Means++*.

*K-Means++ *will place the initial centroids far away from each other, which lead to faster convergence and more consistent result. This is the default value of *init* parameter.

The next thing on our to do list is to perform Elbow method. This method allow us to pick the best number of clusters (*k*) by computing the *Sum of Squared Error *of each cluster (also called *distortion*). Smaller value of *SSE* mean the distance between *centroid* and each observation within the cluster is closer.

Our goal is to identify the *k *number of cluster where the distortion decrease most rapidly. In this article, we will try fitting the *K-Means* model with *k *from 1–5, and compare the change of distortion of each *k* :

distortions = []

K_to_try = range(1, 6)

for i in K_to_try:

model = KMeans(

n_clusters=i,

init='k-means++',

n_jobs=-1,

random_state=1)

model.fit(X_pca)

distortions.append(model.inertia_)

plt.plot(K_to_try, distortions, marker='o')

plt.xlabel('Number of Clusters (k)')

plt.ylabel('Distortion')

plt.show()

Notice that we put the model in the *for* loop and use* n_clusters=i *to iterate through each value of* K_to_try *(*range(1, 6)* will generate a list of integers of 1–5). The code above will generate the following graph :

The “elbow” seems to be located at *k* = 3. This mean the distortion will not be decreased significantly if we tell the algorithm to use a larger number of clusters, hence the clustering result might be too complicated and not “natural”.

Let’s train the *K-Means* with 3 clusters :

# use the best K from elbow method

model = KMeans(

n_clusters=3,

init='k-means++',

n_jobs=-1,

random_state=1)

model = model.fit(X_pca)

y = model.predict(X_pca)

And visualize the result by scatter plot :

We could see that the observations are tightly grouped (this confirmed our analysis in the Data Exploration earlier) and the clusters are reasonably “natural”.

Let’s see what those clusters mean in the context of answers given by those students. Let’s check the number of observations belonged to each cluster :

>>> print(collections.Counter(y))

Counter({2: 2358, 1: 2222, 0: 1240})

Next, let’s compare the clustering result without applying *PCA. *We simply train the model using the default features (with 28 dimensions) :

model_k = model_k.fit(X)

y_final = model_k.predict(X)

Notice that we fit the model with *X* instead of *X_pca*. Let’s compare the number of students in each cluster and check the difference :

Final K Means Result (no PCA) :

Counter({1: 2358, 2: 2223, 0: 1239})

Important detail here: Somehow, y and y_final mapped with different label. The observations with y==1 is the same observations as y_final==2. I confirmed this by generating a csv for each of them. You can check it inoutput/raw_result.csvandoutput/raw_result_pca.csv

Only one observation has different cluster compared to the result with *PCA *(an observation moved from the 1st cluster to the 2nd). This mean reducing the data to 2 dimensions by *PCA* don’t decrease the clustering performance significantly.

Let’s compute the average answer of each cluster to see the difference between them :

Mean of cluster 1 : 1.43439409662,STD :0.470279735218

Mean of cluster 2 : 3.0183418151,STD :0.302760863977

Mean of cluster 3 : 4.34051474841,STD :0.466456137353

We can conclude that the *K-Means* algorithm, combined with *Elbow* method successfully cluster the students based on their average answer on the survey.

Note that cluster 1 is observations with *y*=0 and so on.

Having known the cluster of each observation, we can analyse those clusters deeper to identify their correlation with other features/variables. For example, we can see that the majority (868/1239 = **0.701**) of dissatisfied students (the 1st cluster) were taught by Instructor number 3.

He seems to be the culprit, let’s question him! But wait a minute, recall that most of the students (3601) in this survey were also taught by this Instructor, so the calculation above is not fair to him (and probably misleading). Let’s calculate the percentage of dissatisfied students taught by him instead : 868/3601 = **0.241**.

My point is, the University might be able to use these clusters as additional information, for example they might want to perform deeper interview with students from the first cluster to find out why they don’t satisfy with the teaching process, and see if there any correlation with other variables.

**3. Training Models : Agglomerative Clustering**

You can check the complete code intraining/TraingAgglClustWard.py

Next, we will explore *Agglomerative Clustering* algorithm. One advantage of using this algorithm is it’s ability to plot *dendograms *(we will plot it later), which can help us to interpret the clustering result. Another advantage of using *dendogram* is it will tell us the number of best cluster (*k*) hence we don’t need to try out each value of *k *using Elbow method.

In a nutshell, *Agglomerative Clustering *will assign each observation as individual cluster and merge those clusters based on their distance (similarity) pair by pair, iteratively.

There are many ways to compute the similarity of the clusters, we will focus on *Ward’s linkage* algorithm in this article. *Ward’s linkage* will merge clusters that lead to minimum increase of *Sum of Square Error* (SSE).

Similar with *K-Means* implementation earlier, we will apply *PCA* for visualisation purpose. Let’s start by generating the *dendogram* :

dendrogram = hier.dendrogram(hier.linkage(X_pca, method = 'ward'))

plt.title('Dendrogram')

plt.xlabel('questions')

plt.ylabel('Euclidean distances')

plt.show()

The code above will generate the following *dendogram* :

Notice that there are two colours : green and red (exclude blue, which is the root). This mean the best *k* number of clusters according to *Ward’s linkage *(*method = ‘ward’*)* *is two.

The best *k* number of clusters will be different if we change the *linkage *algorithm. For example, in the source code I also tried *complete linkage *which lead to 4 clusters.

Let’s fit the model with the best *k* we found earlier :

model = AgglomerativeClustering(n_clusters = 2,

affinity ='euclidean',

linkage ='ward')

y = model.fit_predict(X_pca)

And check the clusters by scatter plot :

We will need to do further investigation to decide whether this clustering is appropriate for this dataset. Let’s print the number of each cluster :

Counter({0: 3502, 1: 2318})

Around 60% students belong to the first cluster. Similar to *K-Means *implementation, we are going to check the difference between the two clusters by checking their *mean* :

Mean of cluster 1 : 2.44248184711,STD :0.829040130493

Mean of cluster 2 : 4.30968815481,STD :0.480492591784

I prefer the result of *K-Means *with *k=3* from *Elbow* method because having three different groups seems to be more meaningful (Dissatisfied, Neutral, and Satisfied). Smaller number of clusters (Dissatisfied/Neutral & Satisfied) from the *Ward’s linkage* lead to a larger gap between the *mean* of each cluster.

**4. Conclusion**

First things first, let’s summarize what we have explored :

- Use
*PCA*to visualize the clustering result..*Checked**We used PCA both on K-Means and Agglomerative Clustering.* - Compare the clustering result with and without
*PCA*..*Checked**Only one observation moved from the first cluster to the second. This suggest that applying PCA do not decrease the clustering power significantly.* - Define the number of clusters using
*Elbow method*&*Dendogram*..*Checked**We practiced using both in this article.* - Compare the result of
*K-Means*vs*Agglomerative Clustering*..*Checked**Both algorithm give very different clustering result, with different number of clusters.* - Do deeper analysis with clustering result.
*Checked**. I applied similar analysis for both algorithm.*

Ultimately, the decision whether to have two or three clusters of students depend on the **business decision**. Do you want to treat the “Neutral” and “Dissatisfied” group differently? If the answer is yes, then we will stick with *K-Means *clustering, and vice versa.

Thanks for reading this article.

References :

**Documentation scikit-learn: machine learning in Python - scikit-learn 0.19.1 documentation**

*Edit description*scikit-learn.org

**Python Machine Learning - Second Edition | PACKT Books**

*Unlock modern machine learning and deep learning techniques with Python by using the latest cutting-edge open source…*www.packtpub.com

http://archive.ics.uci.edu/ml/datasets/Turkiye+Student+Evaluation

https://www.kaggle.com/blazetama/clustering-use-case-using-turkey-student-data