Using machine learning to decide which students need early intervention

Published in

Coinmonks

9 min readJul 31, 2018

Schools are a place where we should build a safe environment for students to support their growth potential.

Introduction

There are amazing data products across different industries. On one side a lot of industries were reshaped in the past through industrial revolution or recent advances in deep learning. On the other side we have the education sector where we are using the same teaching and learning methods for the past decades. Not a lot changed through the years and students usually don’t get the help they deserved. At least, that is what I observed a Germany.

Usually in schools the worst performing students don’t get the help they need. I am not sure what is the root cause for this problem, but I also experienced the same issue when I was back in school. That is a topic for another blog post ;).

In this blog post, I want to build a student intervention system by using machine learning with a dataset which was provided by udacity in one of their nanodegree programs.

What kind of task is this — classification or regression?

Before we start to explore the data we firstly need to define the output of the project. Let’s recall the task: We need to build a model for student intervention. As a teacher I would expect that I upload the data of my students periodically and I want to know which students need intervention. In that case I would like to receive a list of students (maybe by their names) and a flag if they need intervention or if they don’t need intervention. Based on this observation we will build a classification model.

How does the data look like?

The dataset is used in this blog post contains different information about students. In detail, the present data includes social, demographic and school related features. Furthermore, we have information about the grades of the students. The dataset was collected from two Portuguese schools by using school reports and questionaries. The subjects for the performance evaluation are physics and math. Also there is a “passed” label which tells if a student passed or failed the exam. In total there are 395 observations and 31 attributes.

Let’s explore the data

Let’s start with data exploration. Our goal in this steps is look more closely to the data in an explorative way. We will analyse and visualise the data. Also at this step we should take a look on the quality of the data in the sense of:

are there missing values?

2. type of data?

3. do we need to clean the data?

We will answer these questions step by step. But first lets take a look on some attributes of the data.

The first thing we take a closer look is the age of the students. The younger students are 15 years old while the older students are 22 years old. Most of the students are between 15 and 18 years old. Students 19 and older are not so common in the school. They represent a minority of all students. I could imagine that these students failed some classes and this could be a reason why we have some samples in the dataset.

Fortunately, there is an attribute called failures which describes the amount of failed classes. This attribute can have the values 0, 1, 2 and 3. The number refers to the amount of failed class. As we can see in diagram below in every age group we have students who failed a class. It is very interesting that most of students 19 and older have failed a class. There is only a minority of students who didn’t fail any class in these groups.

I showed you two sample visualisations. At this point, I don’t want to dive deeper into the exploration because there are a lot of possibilities to explore this dataset in more detail. If you are interested into more visualisation, just take a look into the notebook.

Through the explorative analysis we have a learned a lot about the dataset. Now, it is time to answer the questions of this section:

are there missing values? While exploring I didn’t noticed any missing values or nan’s in the data. I think that the dataset was previously cleaned by udacity.
type of data? In the dataset, we have binary, nominal and numeric attributes. One example for binary is the “sex” attribute. This attribute has the values “F” for female and “M” for male. A example for nominal is the attribute “Mjob” which refers to the mother’s job of the student. This attribute is divided into the categories “teacher”, “health”, “civil services”, “at home” and “other”. One example for a numeric attribute is the “age” of the student.
do we need to clean the data? We don’t have to clean the data since there are no missing values and outliers. However, before training an algorithm, there is still some work to do. This work is referring to the feature processing. The binary as well as the nominal data has to be encoded into a numeric value. We will explore this in the next section.

Preparation of the data

This section deals with the preparation of the data. We will address the feature processing of the nominal and binary data. For the categorical data, we usually have to do some kind of encoding.

We have 30 feature attributes and 1 target attribute.

numeric attributes: age, traveltime, studytime, failures, famrel, freetime, goout, Dalc, Walc, health, absences
Since this attributes have a numeric representation we don’t have to apply any kind of transformation.
binary attributes: school, sex, address, famsize, schoolsup, famsup, paid, nursery, higher, internet, romantic, passed
The binary attributes have either the value “yes” or “no”. We will convert “no” to 0 and “yes” to 1.
nominal attributes: Mjob, Fjob, reason, gurdian
For the nominal attributes we have wide range of possible values. The framework pandas provides the method get_dummies(). This method will transform every value of a nominal attribute to a new column. E.g. the attribute guardian has the values “mother”, “father” and “other”. By applying this method we will add the new columns “guardian_mother”, “guardian_father” and “guardian_other”. The values of the new columns are binary.

After the transformation of the attributes, we have a total of 48 attributes. The next step is to split the data into train and test. The dataset is split into 75% train and 25% test. I used the train_test_split method of the sklearn framework.

Model selection — What model has the best performance?

There are three machine learning algorithms which we should train and compare to each other. Firstly, I will describe the algorithms and secondly, I will start to train and compare them:

Naive Bayes is classification algorithm and uses the probabilities underlying the Bayes theorem to classify the input to a certain output. The model are trained by using maximum likelihood for parameter estimation.
One advantage of Naive Bayes it that it only requires a small amount of data to estimate parameters for classification in comparison to other machine learning algorithms.
The most disadvantage of naive bayes is when there are no occurrences of a certain class label with a certain attribute because this will result in a a computed probability of zero. This is reasoned by the given assumption of conditional independence. When there is one probability of zero multiplied with other probabilities the result will be zero.
K-Nearest-Neighbour (KNN) is used for classification and regression. The input is compared to the k nearest neighbours to define the output. All features are treated equally. Since there are areas of certain labeled data it is very robust to noisy training data. This can also be improved by weighting the distance. KNN is very efficient when we have a lot of features and are provided with enough data (curse of dimensionality). Before training KNN we have to make a decision concerning the number of the nearest neighbours has to be made. In a runtime environment it is a very expensive algorithm in comparison to other algorithms because each query has to be computed.
Support Vector Machine (SVM) is a supervised machine learning algorithms that is used for classification and regression tasks. The underlying technique is the kernel trick which transforms the data into the boundaries for the possible outputs. The boundaries are depending on the used kernel are either linear or non-linear. On one side the non-linear boundary benefits much more complex relationships of the data. On other side the computation of the training set does have an higher latency. The algorithm is very effective in high dimensional spaces because of its ability to model complex relationship. If the number of the features is much greater than the number of the samples in the data set, SVM will have a poor performance. The limitation is due to the choice of the kernel. If we have complex data relationship, SVM does a very good job. Its behaviour is a little bit like a black box, because it is very difficult to interpret the boundary plane.

I am using sklearn for training and testing the models. We will use GaussianNB as an implementation for naive bayes. For the KNN, we are using a neighbours size of 5. For the SVM we are using the Support Vector Classifier (SVC) provided by sklearn.

We are interested into different metrics like training time, prediction time and the F1 score. F1 score is the harmonic mean of precision and recall. If you never heard about F1 score, it is important to know that the closer the F1 Score is to 1.00 the better the model is. For further information, I suggest to take a look on this wikipedia article. The table below shows the results of training and testing the algorithms.

The GaussianNB trained model has the best f1 score on the test data. Followed by the KNN model. The SVC has the worst performance on the test data. The training and prediction time is very low across all algorithms. The root cause is the small amount of data samples which we are using. If we improve the data size, we will probably came up with the same order of the algorithms concerning training and prediction time.

Lessons learned & looking ahead

In this blog post, we learned a lot about machine learning and its widely available applications. We explored simple dataset about the performance of students which included social, demographic and school related features. Based on this dataset we did some transformations and build a model that was trained to predict the help for a student.

Concerning the machine learning, we could also try out other algorithms to compare the results. For example we could use decision trees or neural network. Also, we could look for other datasets and combine these with the existing data.

Now, we could build a service that can be used by teachers to predict which students need more help. In today classroom it is hard for every teacher to build a deep relation to a student. By using data and such a system, teachers may be able to spend more time with the students that are not doing that great in school. It is time to bring more innovations to the Education sector!

If you are interesting in the source, don’t hesitate and take a look into my repository at github.

I hope you enjoyed reading this article. I appreciate any feedback in the comments. Have a great day! Nico

Get Best Software Deals Directly In Your Inbox