Is consistency a key to academic performance? Predicting student final grades using previous grades, regression and Python
The student-mat.csv dataset
The student-mat.csv dataset is a well-known file containing 33 columns and 395 rows. Each row represents a different student and each column is a different attribute. This dataset is very interesting because, among other things, it tells a lot about what contributes to academic performance. Here are some of the columns (attributes) available in the file.
- Age.
- Family size.
- Mother’s educational level.
- Father’s educational level.
- Study time.
- Failures.
- Health.
- Absences.
- First-period grade (G1)
- Second-period grade (G2).
- Final grade (G3).
What is regression?
Regression is a statistical model used to predict the relationship between independent and dependent variables — in machine learning independent variables are called, “features” whereas the dependent variable is called the “label.” Regression is used to answer questions like “how much?” and “how many?” For example, how much money could we make? There are, generally speaking, three types of regression problems.
- Single linear: one feature that is linearly related to a label.
- Single non-linear: one feature that is not linearly related to a label.
- Multi-featured: more than one feature but only one label.
In this post, I train the regression algorithm using three variables (G1, G2 and G3) from the student-mat.csv dataset — this project is a single featured non-linear regression problem.
TRAIN THE REGRESSION ALGORITHM USING MULTIPLE FEATURES (G1 AND G2) FROM THE STUDENT-MAT.CSV DATASET
Step 1: Open up a new Python Kernel from a folder containing the student-mat.csv file
Step 2: Import necessary libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
import seaborn as sns
import plotly.express as px
Step 3: Access the student-mat.csv dataset
data = pd.read_csv("student-mat.csv", sep=";")
data.shape
data.head(5)
Step 4: View the correlation between the variables
After visualizing the relationships I notice that columns G1 and G2 are significantly related to G3 — more so than any other attribute. This is interesting because it means that lower G1 grades mean lower G3 grades (final grades) and vice versa. In addition, higher G2 grades mean higher G3 grades and vice versa. Could this mean that consistency is a key to academic performance? In other words, those students who perform previously perform in the future. While more investigation is needed, this is an interesting insight.
sns.heatmap(data.corr())
Step 5: Plot G1, G2 and G3 on scatter plots to visualize their relationships
fig = px.scatter(data, x="G1", y="G3", color="G2", hover_data=['G3'],
width=600, height=400, title = "Visualizing first and final grades of all students",
labels={ # replaces default labels by column name
"G1": "1st Grade", "G3": "Final Grade"},
template="simple_white")
fig.show()
fig = px.scatter(data, x="G2", y="G3", color="G2", hover_data=['G3'],
width=600, height=400, title = "Visualizing second and final grades of all students",
labels={ # replaces default labels by column name
"G2": "2nd Grade", "G3": "Final Grade"},
template="simple_white")
fig.show()
Step 6: Select the features and label
data = data[["G1", "G2", "G3"]]
predict = "G3"X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
Step 7: Split the data by 10% at random into a “Train” dataset and a “Test” dataset
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
Step 8: Train the regression algorithm on the “Train” portion of the student-mat.csv dataset
linear = linear_model.LinearRegression()linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
Step 9: View the coefficients and y-intercept of the model
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
Step 10: Make some predictions using the “Test” dataset
predictions = linear.predict(x_test)for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])
The final model
Final student grade = (First period grade * 0.159) + (Second period grade * 0.995) -2.0719
Let’s make sense of the model
While more work is needed to improve this model — such as increasing the number of features in the dataset — this model can be used to predict a student's final grade. The analysis above also shows that academic consistency is the best indicator of academic success.