Using regression techniques to predict a student’s grade for a course

Danilo Najkov
5 min readJun 22, 2022

--

This post explores different machine learning models to predict the grade for a Probability and Statistics course based on previous grades. We will also take a look at how to deploy these models to production with python for free.

Contents:

  • Exploring the dataset
  • Preparing the data
  • Exploring different models:
    – Deep learning regression
    – Random forest regressor
    – Support vector regressor
  • Comparing and evaluating the models
  • Creating a web API that uses the model

The dataset

The dataset includes 226 rows(different students) each with 8 features and the class value. As data about final course grades isn’t available, it is taken from the initial theory grades with some modification.

Features

Features are the student's grades for past courses from the first and second semesters. 8 in total: VVKN, SP, DM, ONVD, OOP, K, AOK, AIPS

Missing values

Some students don’t take all these courses, so the value is generated as an average from similar courses.

Example: For students that don't take the DM course, the value is generated as an average from DS1 and DS2, where the same material is studied but in more depth in 2 courses.

Preparing the data

Fortunately, the data is relatively clean since a grade can only take six values 5–10 (F-A). We don’t need any kind of encoding because the values are integers and we don’t need to scale the data.

Let’s start with the imports that are needed, read the dataset and remove the unnecessary columns.

import pandas as pd
from tensorflow import keras
import tensorflow as tf
from sklearn.model_selection import train_test_split
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import math
# Load and shuffle the data
dataset = pd.read_excel('/content/ocenki.xlsx', header=0)
dataset = dataset.drop('Name',axis=1)
dataset = dataset.drop('Id',axis=1)
dataset = dataset.iloc[np.random.permutation(len(dataset))]
dataset.head()
Output of the code cell
Part of the dataset

Correlation between the features and the class

corr = dataset.corr()
corr.style.background_gradient(cmap='coolwarm')
high correlation bewteen the features
Correlation between the columns

As we can see there is a high correlation between the features. This would worsen the speed of training the model, but since they are also highly correlated with the class we are predicting we can’t remove them. Also, the dataset is relatively small so that shouldn't be a big issue.

It makes sense that students that got a better grade on the VIS course, do care about their grades and they tend to score higher on all the other courses.

We can see that there is a high correlation between the Probability and statistics course (VIS) and other mathematical courses like DM and K.

# Split the data
features = dataset.drop('VIS',axis=1)
classes = dataset['VIS']
X_train, X_test, y_train, y_test = train_test_split(features, classes, test_size=0.2)Models

Deep learning regression

Although the dataset is very small for using deep learning models, we can still get good results with this technique.

I will be using Keras and TensorFlow to train a deep neural network to predict the grade using 2 hidden layers, mean squared error loss, and an RMSprop optimizer.

# build the model with 2 hidden layers
model = Sequential()
model.add(Dense(12, activation='relu', input_shape=(8,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1))

model.compile(loss='mse',optimizer=tf.keras.optimizers.RMSprop(),metrics=['mae', 'mse'])
# fit with 200 epochs
history = model.fit(X_train,y_train,epochs=200,validation_data=(X_test,y_test), verbose=1)

Let’s graph the error and the loss during training and evaluate the model

import seaborn as sns
import matplotlib.pyplot as plt


def plot_graph_loss(history):
plt.plot(history.history['mae'])
plt.plot(history.history['val_mae'])
plt.title('model error')
plt.ylabel('error')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# graph error and loss
plot_graph_loss(history)
model.evaluate(X_test, y_test, batch_size=128)
MAE during training
Loss during training
1/1 [==============================] - 0s 22ms/step - loss: 0.8523 - mae: 0.6958 - mse: 0.8523

We are getting a 0.69 mean absolute error with this approach.

We also need to save the model to deploy it in an API. Since I am using google Colab I can easily save it to google drive.

# save to google drive
from google.colab import drive
import joblib
drive.mount('/content/gdrive').
model.save('/content/gdrive/My Drive/VIS_Predictors/deep_learning')

Random forest regressor

Initialize a random forest with 100 decision trees and train it on the same data.

from sklearn.ensemble import RandomForestRegressor# init model with 100 trees
forest_model = RandomForestRegressor(n_estimators=100)
forest_model.fit(X_train,y_train)

Helper function to calculate Mean Absolute Error on the test set.

def mae(real,pred):
score = 0
for x,y in zip(real,pred):
score += abs(x-y)
return score/len(real)
mae(y_test,forest_model.predict(X_test))

We are getting an MAE of 0.58.

Let’s also save this model.

joblib.dump(forest_model,'/content/gdrive/MyDrive/VIS_Predictors/random_forest')

Support vector regressor

Initialize a Support Vector Regressor with C=1 and epsilon=0.2.

from sklearn.svm import SVRsvr = SVR(C=1.0, epsilon=0.2)
svr.fit(X_train, y_train)
mae(y_test,svr.predict(X_test))

We are getting an MAE of 0.57 which is the best so far.

# save the model
joblib.dump(svr,'/content/gdrive/My Drive/VIS_Predictors/svr')

Comparing the models

Considering this is a small limited dataset it makes sense that nearly all models output similar mean absolute error on the test set. Support vector regression has the best MAE of 0.57. Deep learning with some tuning of the hyperparameters can probably go even lower than 0.69.

These models can be further improved with feature engineering, like combining similar courses into one feature, extracting the gender of the student, or the type of courses the student has chosen.

A big drawback to all these models is the size of the dataset. 250 rows of data are too little to make meaningful predictions. Ideally, we would use much more information about students when training the models.

Deploying the model as a web API

Since we saved all the models it is really easy to deploy them for free.

We can create a flask web API that loads the model and responds to requests.

from flask import Flask, request, jsonify
from urllib.request import urlopen
app = Flask(__name__)
@app.route(‘/’, methods=[‘GET’])
def index():
return “Hello”
@app.route(‘/prediction’, methods=[‘POST’])
def predict():
lr = joblib.load(PATH_TO_SAVED_MODEL)
if lr:
json = request.json
predict = list(lr.predict([json]))
return jsonify({‘prediction’: str(predict[0])})
else:
return ‘Failed to load’
if __name__ == “__main__”:
app.run()

If you want to host it yourself just start the server and that’s it. However, you can also host it for free on Azure. It is pretty straightforward by following the guide here.

Final notes

In this post, we looked into creating and deploying basic regression models using different techniques.

I have hosted the web API and created a react web app where students can predict if they will pass the probability and statistics course here. It is using the random forest regressor from this post.

--

--