Is consistency a key to academic performance? Predicting student final grades using previous grades, regression and Python

David Leslie Wilkinson
4 min readJul 31, 2022

--

The student-mat.csv dataset

The student-mat.csv dataset is a well-known file containing 33 columns and 395 rows. Each row represents a different student and each column is a different attribute. This dataset is very interesting because, among other things, it tells a lot about what contributes to academic performance. Here are some of the columns (attributes) available in the file.

  • Age.
  • Family size.
  • Mother’s educational level.
  • Father’s educational level.
  • Study time.
  • Failures.
  • Health.
  • Absences.
  • First-period grade (G1)
  • Second-period grade (G2).
  • Final grade (G3).

What is regression?

Regression is a statistical model used to predict the relationship between independent and dependent variables — in machine learning independent variables are called, “features” whereas the dependent variable is called the “label.” Regression is used to answer questions like “how much?” and “how many?” For example, how much money could we make? There are, generally speaking, three types of regression problems.

  1. Single linear: one feature that is linearly related to a label.
  2. Single non-linear: one feature that is not linearly related to a label.
  3. Multi-featured: more than one feature but only one label.

In this post, I train the regression algorithm using three variables (G1, G2 and G3) from the student-mat.csv dataset — this project is a single featured non-linear regression problem.

TRAIN THE REGRESSION ALGORITHM USING MULTIPLE FEATURES (G1 AND G2) FROM THE STUDENT-MAT.CSV DATASET

Step 1: Open up a new Python Kernel from a folder containing the student-mat.csv file

Step 2: Import necessary libraries

import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
import seaborn as sns
import plotly.express as px

Step 3: Access the student-mat.csv dataset

data = pd.read_csv("student-mat.csv", sep=";")
data.shape
data.head(5)

Step 4: View the correlation between the variables

After visualizing the relationships I notice that columns G1 and G2 are significantly related to G3 — more so than any other attribute. This is interesting because it means that lower G1 grades mean lower G3 grades (final grades) and vice versa. In addition, higher G2 grades mean higher G3 grades and vice versa. Could this mean that consistency is a key to academic performance? In other words, those students who perform previously perform in the future. While more investigation is needed, this is an interesting insight.

sns.heatmap(data.corr())

Step 5: Plot G1, G2 and G3 on scatter plots to visualize their relationships

fig = px.scatter(data, x="G1", y="G3", color="G2", hover_data=['G3'],
width=600, height=400, title = "Visualizing first and final grades of all students",
labels={ # replaces default labels by column name
"G1": "1st Grade", "G3": "Final Grade"},
template="simple_white")
fig.show()
fig = px.scatter(data, x="G2", y="G3", color="G2", hover_data=['G3'],
width=600, height=400, title = "Visualizing second and final grades of all students",
labels={ # replaces default labels by column name
"G2": "2nd Grade", "G3": "Final Grade"},
template="simple_white")
fig.show()

Step 6: Select the features and label

data = data[["G1", "G2", "G3"]]
predict = "G3"
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

Step 7: Split the data by 10% at random into a “Train” dataset and a “Test” dataset

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

Step 8: Train the regression algorithm on the “Train” portion of the student-mat.csv dataset

linear = linear_model.LinearRegression()linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)

Step 9: View the coefficients and y-intercept of the model

print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Step 10: Make some predictions using the “Test” dataset

predictions = linear.predict(x_test)for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])

The final model

Final student grade = (First period grade * 0.159) + (Second period grade * 0.995) -2.0719

Let’s make sense of the model

While more work is needed to improve this model — such as increasing the number of features in the dataset — this model can be used to predict a student's final grade. The analysis above also shows that academic consistency is the best indicator of academic success.

--

--