High School Math Performance Prediction
Introduction
In the world of math education, one of the major issues that universities and educators have is that students do not succeed at mathematics at a satisfactory level and at a rate that is satisfactory. Universities and educators complain of the high failure, drop, and withdrawal rates of their students. This is a problem for students because low performance in math prevents them from pursuing their degrees and careers. It is a problem for universities and educators because it means that the university or educator is not successfully teaching students, not retaining their students, and not satisfying the needs of their students — these problems hurt the profitability and attractiveness of the university and educator.
If we can gain some insights into what factors most contribute to or hurt student performance in math, we have the potential to solve the above-mentioned problems. If we can produce predictive models that can predict whether a student will pass or fail, that can predict the numerical score of students on math assessments, and that can predict the overall strength and promise of a student, then universities and educators will be able to use these models to better place students at the appropriate level of competence, to better select students for admission, and to better understand the factors that can be improved upon to help students be successful.
In this paper, we will perform data science and machine learning to a dataset representing the math performance of students from two Portuguese high schools.
In a previous article, which can be found at High School Math Performance Regression, I applied regression methods to the dataset to predict the value of G3. In the present paper, I would like to separate the G3 scores into five classes and try to classify a student as falling into one of five classes depending on their G3 score. This becomes a 5-class classification problem, and we can apply machine learning classification methods to this problem.
Data Preparation
The data file was separated by semicolons rather than commas. I replaced the semicolons by commas. Then, copy and pasted everything into notepad. Then, convert to a csv file using the steps from the following link:
Now, I have a nice csv file.
There are 30 attributes that include things like student age, parent’s education, parent’s job, weekly study time, number of absences, number of past class failures, etc. There are grades for years 1, 2, and 3; these are denoted by G1, G2, and G3. The grades range from 0–20. G1 and G2 can be used as input features, and G3 will be the main target output.
Some of the attributes are ordinal, some are binary yes-no, some are numeric, and some are nominal. We do need to do some data preprocessing. For the binary yes-no attributes, I will encode them using 0’s and 1’s. I did this for schoolsup, famsup, paid, activities, nursery, higher, internet, and romantic. The attributes famrel, freetime, goout, Dalc, Walc, and health are ordinal; the values for these range from 1 to 5. The attributes Medu, Fedu, traveltime, studytime, failures are also ordinal; the values range from 0 to 4 or 1 to 4. The attribute absences is a count attribute; the values range from 0 to 93. The attributes sex, school, address, Pstatus, Mjob, Fjob, guardian, famsize, reason are nominal. For nominal attributes, we can use one-hot encoding. The attributes age, G1, G2, and G3 can be thought of as interval attributes.
I one-hot encoded each nominal attribute, one at a time. I exported the dataframe as a csv file each time, relabeling the columns as I go. Finally, I reordered the columns.
Here is the python code:
import numpy as npimport pandas as pddataset = pd.read_csv(‘C:\\Users\\ricky\\Downloads\\studentmath.csv’)X = dataset.iloc[:,:-1].valuesY = dataset.iloc[:,32].valuesfrom sklearn.preprocessing import LabelEncoder, OneHotEncoderlabelencoder_X = LabelEncoder()# Encoding binary yes-no attributesX[:,15] = labelencoder_X.fit_transform(X[:,15])X[:,16] = labelencoder_X.fit_transform(X[:,16])X[:,17] = labelencoder_X.fit_transform(X[:,17])X[:,18] = labelencoder_X.fit_transform(X[:,18])X[:,19] = labelencoder_X.fit_transform(X[:,19])X[:,20] = labelencoder_X.fit_transform(X[:,20])X[:,21] = labelencoder_X.fit_transform(X[:,21])X[:,22] = labelencoder_X.fit_transform(X[:,22])# Encoding nominal attributesX[:,0] = labelencoder_X.fit_transform(X[:,0])X[:,1] = labelencoder_X.fit_transform(X[:,1])X[:,3] = labelencoder_X.fit_transform(X[:,3])X[:,4] = labelencoder_X.fit_transform(X[:,4])X[:,5] = labelencoder_X.fit_transform(X[:,5])X[:,8] = labelencoder_X.fit_transform(X[:,8])X[:,9] = labelencoder_X.fit_transform(X[:,9])X[:,10] = labelencoder_X.fit_transform(X[:,10])X[:,11] = labelencoder_X.fit_transform(X[:,11])onehotencoder = OneHotEncoder(categorical_features = [0])X = onehotencoder.fit_transform(X).toarray()from pandas import DataFramedf = DataFrame(X)export_csv = df.to_csv (r’C:\Users\Ricky\Downloads\highschoolmath.csv’, index = None, header=True)
At this point, the final column of our dataset consists of integers for G3. Scores 16–20 will form class 1, scores 14–15 will form class 2, scores 12–13 will form class 3, scores 10–11 will form class 4, and scores 0–9 will form class 5. We can create a final column of classes 1–5 by converting each score to one of the classes.
Here is the python code for doing it:
#Defining a function that converts G3 to one of the five classesdef filter_class(score):if score<10:return 5elif score<12:return 4elif score<14:return 3elif score<16:return 2else:return 1#defining a new column called 'class' and dropping column 'G3'dataset_trap['class'] = dataset_trap['G3'].apply(filter_class)dataset_trap = dataset_trap.drop(['G3'], axis=1)
Now, our dataset is ready for us to apply classification methods on it.
Logistic Regression
We define X and Y using dataset_trap. Then, we split the dataset into a training set and a test set, apply feature scaling to X_train and X_test, fit logistic regression to the training set, and predict the test set results.
Here is the python code:
#Define X and Y using dataset_trapX = dataset_trap.iloc[:,:-1].valuesY = dataset_trap.iloc[:,-1].values#Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)#Feature Scalingfrom sklearn.preprocessing import StandardScalersc_X = StandardScaler()X_train = sc_X.fit_transform(X_train)X_test = sc_X.fit_transform(X_test)#Fitting Logistic Regression to the Training setfrom sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression()classifier.fit(X_train,Y_train)#Predicting the Test set resultsY_pred = classifier.predict(X_test)
Now, we have the predicted Y values for the test set Y values. We can see how accurate our model is by looking at the confusion matrix:
The numbers in the diagonal of the confusion matrix count the number of correct classifications. So, to find how accurate our model is, we would add the diagonal entries and divide by the total number of test set results, which is 79.
Here is the python code for creating the confusion matrix:
#Making the confusion matrixfrom sklearn.metrics import confusion_matrixcm = confusion_matrix(Y_test, Y_pred)cm.trace()/79
The accuracy of the model can be measured by the number of correct predictions divided by the total number of test set results. In this case, the accuracy is 50%. This is not very impressive.
K Nearest Neighbors
The k-nearest neighbors model was trained on our training set using the Euclidean distance and k=5 neighbors. The python code is pretty much the same as the one for logistic regression except we replace the logistic regression model with the k nearest neighbors model. Here is the full python code:
#Importing the librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt#Importing the datasetdataset = pd.read_csv("studentmathdummified.csv")#Avoiding the dummy variable trap#Dropping GP, Male, urban,LE3, Apart,mother_at_home, father_at_home, reason_course, guardian_otherdataset_trap = dataset.drop(dataset.columns[[0,2,4,6,8,10,15,20,26]],axis=1)#Defining a function that converts G3 to one of the five classesdef filter_class(score):if score<10:return 5elif score<12:return 4elif score<14:return 3elif score<16:return 2else:return 1#defining a new column called 'class' and dropping column 'G3'dataset_trap['class'] = dataset_trap['G3'].apply(filter_class)dataset_trap = dataset_trap.drop(['G3'], axis=1)#Define X and Y using dataset_trapX = dataset_trap.iloc[:,:-1].valuesY = dataset_trap.iloc[:,-1].values#Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)#Feature Scalingfrom sklearn.preprocessing import StandardScalersc_X = StandardScaler()X_train = sc_X.fit_transform(X_train)X_test = sc_X.fit_transform(X_test)#Fitting K nearest neighbors to the Training setfrom sklearn.neighbors import KNeighborsClassifierclassifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p =2)classifier.fit(X_train,Y_train)#Predicting the Test set resultsY_pred = classifier.predict(X_test)#Making the confusion matrixfrom sklearn.metrics import confusion_matrixcm = confusion_matrix(Y_test, Y_pred)cm.trace()/79
The accuracy of our model is 28%, which is really bad.
Support Vector Machines
Support vector machines are designed to perform classification when there are only two classes. However, there is a way to use svm’s when there are more than 2 classes, as in our case with five classes. One way is to use something called one-versus-one scheme. In the one-versus-one scheme, we build an svm model for every possible pair of classes. For K-class classification, there are K choose 2 pairs. So, in our case, there are 10 pairs of classes. We can apply build 10 support vector classifiers and predict the class for a given test point by applying all 10 support vector classifiers to the test point and choosing the class for which the number of times the test point is classified as that class is highest. The python code is the same as for previous classifiers except we replace the classifier with the support vector classifier:
#Fitting support vector classifier to the Training set using one-versus-onefrom sklearn.svm import SVCclassifier = SVC(kernel='linear')classifier.fit(X_train,Y_train)
The accuracy of our model is 62%.
Another way to use svm’s when there are more than 2 classes is to use one-versus-all scheme. In this scheme, K models are built by pairing each class with the rest. In our case 5 models are built. To predict the class for a given test point, we apply all 5 models to the test point and choose the class for which the perpendicular distance of the test point from the maximal margin hyperplane of the class’s corresponding model is largest. In other words, we choose the class for which the corresponding model most confidently classifies the test point as of that class. Here’s the python code:
#Fitting support vector classifier to the Training set using one-versus-restfrom sklearn.svm import LinearSVCclassifier = LinearSVC()classifier.fit(X_train,Y_train)
The accuracy of our model is 56%.
I tried using the rbf kernel and got 44% accuracy.
#Fitting support vector classifier to the Training set using one-versus-one and rbf kernelfrom sklearn.svm import SVCclassifier = SVC(kernel='rbf', random_state=0)classifier.fit(X_train,Y_train)
The default regularization parameter C is 1. I raised the regularization parameter to 4 and got an accuracy of 45.5%. Raising the regularization parameter to 10 gives an accuracy of 48%.
#Fitting support vector classifier to the Training set using one-versus-one and rbf kernel and C=10from sklearn.svm import SVCclassifier = SVC(kernel='rbf', C=10, random_state=0)classifier.fit(X_train,Y_train)
Increasing the value of C means we’re being more strict about how many points are misclassified; it corresponds to having a smaller-margin separating hyperplane. Put another way, by increasing C, we’re decreasing the leeway for violations of the margin. Lowering the value of C corresponds to being more lenient with misclassifications; it corresponds to having a larger-margin separating hyperplane. As we increased the value of C, we saw that the accuracy went up.
I also tried the sigmoid kernel, with C=30, and got a 62% accuracy. Lowering the value of C to less than 30 or raising the value of C to higher than 30 gives poorer accuracy.
#Fitting support vector classifier to the Training set using one-versus-one and sigmoid kernel and C=30from sklearn.svm import SVCclassifier = SVC(kernel='sigmoid', C=30, random_state=0)classifier.fit(X_train,Y_train)
Decision Trees
In this method, we’re going to grow one tree. The splitting criterion is chosen to be entropy, and feature scaling is not necessary. When I applied the decision tree classifier to the test set, I got an accuracy of 72%.
#Fitting decision tree classifier to the Training setfrom sklearn.tree import DecisionTreeClassifierclassifier = DecisionTreeClassifier(criterion=’entropy’)classifier.fit(X_train,Y_train)
I set the minimum number of samples required to split an internal node to 10 and the minimum number of samples required to be at a leaf node to 5. This improved the accuracy to 77%.
#Fitting decision tree classifier to the Training set with minimum number of samples required to split 10 and minimum number of leaf samples 5from sklearn.tree import DecisionTreeClassifierclassifier = DecisionTreeClassifier(criterion='entropy', min_samples_split=10, min_samples_leaf=5)classifier.fit(X_train,Y_train)
Random Forests
In this method, we’re going to grow a bunch of trees. The splitting criterion is chosen to be entropy, and feature scaling is not used. When I applied the random forest classifier to the test set, using 10 trees, I got an accuracy of 62%:
#Fitting random forest classifier to the Training setfrom sklearn.ensemble import RandomForestClassifierclassifier = RandomForestClassifier(criterion='entropy', n_estimators=10)classifier.fit(X_train,Y_train)
I set the minimum number of samples required to split an internal node to 10 and the minimum number of samples required to be at a leaf node to 5. I also increased the number of trees to 100. This improved the accuracy to 74%:
#Fitting random forest classifier to the Training set with 100 trees and minimum samples required to split 10 and minimum samples at a leaf 5from sklearn.ensemble import RandomForestClassifierclassifier = RandomForestClassifier(criterion='entropy', n_estimators=100, min_samples_split=10, min_samples_leaf=5)classifier.fit(X_train,Y_train)
To improve accuracy even more, I set the max_features to None:
#Fitting random forest classifier to the Training set with 100 trees, min_samples_split=10, min_samples_leaf=5, and max_features=Nonefrom sklearn.ensemble import RandomForestClassifierclassifier = RandomForestClassifier(criterion='entropy', n_estimators=100, max_features=None, min_samples_split=10, min_samples_leaf=5)classifier.fit(X_train,Y_train)
I got an accuracy of 78%.
Model Selection
In order to determine which model is the best, we will perform k-fold cross validation (k=10) for each model and pick the one that has the best accuracy.
For logistic regression, I got an accuracy of 57%.
For k-nearest neighbors with k=5, I got an accuracy of 43%.
For support vector classifier, one-versus-one, I got an accuracy of 64%.
For support vector classifier, one-versus-rest, I got an accuracy of 57%.
For support vector classifier, one-versus-one with rbf kernel, I got an accuracy of 53%.
For support vector classifier, one-versus-one with rbf kernel and C=10, I got an accuracy of 57%.
For support vector classifier, one-versus-one with sigmoid kernel and C=30, I got an accuracy of 60%.
For a single decision tree, I got an accuracy of 66%.
For a single decision tree with min_samples_split=10 and min_samples_leaf=5, I got an accuracy of 69%.
For random forest with 10 trees, I got an accuracy of 64%.
For random forest with 100 trees, min_samples_split=10, and min_samples_leaf=5, I got an accuracy of 70%.
For random forest with 100 trees, min_samples_split=10, min_samples_leaf=5, max_features=None, I got an accuracy of 76%.
Here is the python code for applying k-fold cross validation:
#Applying k-fold cross validationfrom sklearn.model_selection import cross_val_scoreaccuracies = cross_val_score(estimator=classifier,X=X_train, y=Y_train, cv=10)accuracies.mean()
Comparing the accuracies of each model, we see that random forests with 100 trees, min_samples_split=10, min_samples_leaf=5, max_features=None has the highest accuracy. We might wonder whether entropy is the best criterion for splitting and whether 100 trees is the best number of trees to use; we might also wonder about what the best value for max_features is. I performed a grid search for criterion among entropy and gini, for n_estimators among 10,100,500, for max_features among ‘auto’, None, ‘log2’, and 1.
#grid searchfrom sklearn.model_selection import GridSearchCVparameters = [{'criterion':['entropy'],'n_estimators':[10,100,500],'max_features':['auto',None,'log2',1]},{'criterion':['gini'],'n_estimators':[10,100,500],'max_features':['auto',None,'log2',1]}]grid_search=GridSearchCV(estimator = classifier, param_grid=parameters, scoring='accuracy',cv=10)grid_search=grid_search.fit(X_train, Y_train)best_accuracy=grid_search.best_score_best_parameters=grid_search.best_params_
The result is a best accuracy of 76.9% and best parameters criterion=’gini’, max_features=None, and n_estimators=500.
For random forest with criterion=’gini’, 500 trees, min_samples_split=10, min_samples_leaf=5, max_features=None, I got an accuracy of 77.5%. I increased min_samples_split to 50 and got an accuracy of 78.2%.
Conclusion
In this paper, we applied logistic regression, k-nearest neighbors, support vector classifiers, decision trees, and random forests to the 5-class classification problem of predicting which of five classes each student’s third year score would fall under. We found that the best performing model, among the ones we examined, is the random forest classifier with criterion=’gini’, 500 trees, min_samples_split=50, min_samples_leaf=5, max_features=None. The accuracy achieved was 78.2%.
In our regression analysis of the dataset in a previous paper, we found that some of the most significant attributes were grades in years 1 and 2, quality of family relationships, age, and the number of absences. The random forest regression with 500 trees turned out to be one of the best performing models with 87–88% accuracy (R squared). We also saw a strong linear relationship between the grade in year 3 with the grades in years 1 and 2.
Whether or not the attributes G1, G2, quality of family relationships, age, and number of absences are always significant in every school, in every time period, and in every country is an open question. Can the insights gathered here be generalized beyond the two Portuguese high schools we considered? What other attributes, beside the ones we considered, might be significant in determining math performance? These are open questions worth pursuing to further understand and resolve the issue of poor math performance.
The dataset can be found here:
https://archive.ics.uci.edu/ml/datasets/student+performance
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5–12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978–9077381–39–7.
[Web Link]