Predicting Student Performance

Published in

Analytics Vidhya

5 min readMar 6, 2020

The nature of trying to predict student success is tricky. It’s difficult to account for the various factors in home life, personal life, mental health, quality of education, and study habits. Nevertheless, I feel that there are certain universal factors that help and hinder students. While these factors may grow or diminish in the face of cultural and personal differences from student to student, there is a still non-zero correlation between student success and these factors that makes them worth study.

It is with this in mind that I began this project. My data was provided by UCI, sourced by Paolo Cortez, Universidade do Minho, and consisted of student demographic, educational, and social features collected from two secondary schools in Portugal. My work with this data was inspired by the work of Paolo Cortez and Alice Silva, available here.

The Process

My goal was to predict a binary class, whether a student will pass or fail at the end of the year (represented in the data as whether G3, the final semester grade of the year, is greater than or equal to 10), and to do so in a way that could be valuable in an actual classroom; It is my belief that an educator could make use of a process similar to this in order to better facilitate early intervention for students who may be at risk academically.

During this process, I fit both XGBClassifier and LogisticRegressionCV models, and did so twice; once without the earlier semester grades (G1 & G2), and once with, to compare how effective early intervention could be with and without taking into account concurrent academic performance.

My chosen metric for the performance of these models is Precision, represented as the number of True Positive Predictions over the Total Positive Predictions.

This means I’m minimizing False Positive Predictions, the number of students who are likely to fail, but are predicted to pass, and therefore may not receive the help they may need; there is far less harm in falsely predicting the failure of student than in falsely predicting the success of a student, and far less help in accurately predicting students who will pass than in accurately predicting those who will fail.

The Results

For my baseline, I chose the ZeroR or Majority Class Baseline. The class with the most observations is used as the result for all predictions. This gives us a Precision of .78, not bad for what is essentially a guess, but with plenty of room for improvement.

I applied cross-validation to both of my models, and from both received significant improvement upon this initial score. My XGBClassifier, however, consistently returned equivalent or lower scores than LogisticRegressionCV, and was prone to overfitting, and so won’t be included going forward.

Without Accounting For Earlier Grades

My model returned a consistent Precision of ~.85 across validation and testing rounds. This is a significant increase in Precision, given no prior knowledge of a student’s grades. This increase comes solely from information a teacher may have on an incoming student, e.g. attendance records, age, previous failures, etc, before that student has done any work for that teacher.

With Accounting For Earlier Grades

When first and second semester grades are included, validation scores climb to ~.96, and my final test score climbs to ~.95; again, a significant climb in Precision, though in a real world scenario that increased Precision may be less valuable given that it may be too late at that point to effectively provide early intervention for struggling students.

The Insights

Through the process of exploring this data and creating this model, I discovered several interesting, if unsurprising, things.

First, the level of education a student’s parents had was a dramatic indicator of their own academic success. The students of educated parents had consistently higher scores and a significantly higher rates of passing. This could possibly be an indicator of the greater means available to educated families, or of the greater ability educated parents have to help their children learn.

Effect Of Parent’s Education On Student Probability Of Passing

Second, the effect of weekly study time on student success displays a very interesting characteristic; while greater amounts of time spent studying does on average increase the chance of a student passing their exams, past a certain point further study is strongly correlated with increased variability of student scores.

Effect Of Study Time On Student Probability Of Passing

For many students it seems there may be a point of diminishing returns, where further time spent in study may not be leading to better retention, possibly coming at the expense of other factors (sleep, stress, etc.) that have an impact on performance.

Third, number of absences, while on average an indicator of student success, approaches ludicrous levels of variability at the extreme end of the spectrum.

Effect Of Absences On Student Probability Of Passing

Very large numbers of absences are associated with decreased performance, as would be expected; it’s hard to learn from a class you’re rarely in. However, large numbers of absences are far less useful in predicting student success than small ones.

Frequently missing school can be an indicator of destabilizing factors in home life, or a lack of interest or desire to succeed. However, it could also be the result of, among other things, chronic illness or injury which may not necessarily preclude a student from doing the things necessary to achieve success regardless of attendance.

Conclusion

There seem to be a number of factors, academic and otherwise, that could be used to predict student success with a reasonable amount of Precision. I believe that these factors have a certain universal nature that would allow an educator to apply a model like the one detailed above in order to better aid students that may be struggling. This model could be applied at varying times to varying degrees of efficacy, but should in all cases provide greater value than mere guesswork.

I’d like to acknowledge again that the data used above, available here, was made available courtesy of Paolo Cortez, Universidade do Minho.

My work, available here, was inspired by the work of Paolo Cortez and Alice Silva, available here.