GradeGuardian: ML to predict student and school performance

Second Prize @ OpenEd.ai Hackathon 2017

Andrew Arpasi

Published in

OpenEd.ai

5 min readNov 15, 2017

Before you dive in to this post, why not check out this demo video of our awesome application:

Motivation/Problem

There are many varying levels of school quality across India, as well as many different factors affecting student performance.

As you were probably a student at one point (or maybe you are now), have you ever wondered how different factors in your place of education actually affect your performance? Have you wondered what specific things were causing you to perform a certain way academically?

As university students ourselves, we realized that there are many levels of influence, ranging from the students themselves, to academic advisors, to policymakers. How can we bring all of these different groups together and provide an easy solution for all sides to boost education levels?

The solution

We wanted to build an intelligent platform for everyone including policymakers, educators, advisors and students to get involved in education.

We began by drafting out ideas, and first came up with an advisor chatbot for academic advisors to predict their students’ risk status using an ML model. We then expanded our project by creating an interactive map for policymakers to predict how different factors influence dropout rates across Indian schools.

Thus GradeGuardian was born, a suite of tools for predicting educational performance and risk level for students, while also utilizing student data to provide deeper insights for education policymakers.

How it works

Here you can see a layout of how our application’s components work together:

Machine Learning within GradeGuardian

GradeGuardian has two main implementations of Machine Learning. It used models generated by SciKitLearn to validate data sets and AWS ML for model production deployment.

SciKitLearn and Dataset validation

Datasets were gathered from various places including: The Center for Disease Control, Kaggle, The Ministry of Statistics.

DataSets Deep Dive

CDC Dataset: Attempted to use as our predictor of school performance initially had over 90 questions to ask students. Training models on this dataset gave inconclusive results and asking students to answer so many questions though a chat bot seemed unnecessarily rigorous. We tried cleaning the dataset to only include complete entries, but that still did not yield a model with sufficient accuracy. (We wanted a model with 85% accuracy or better)

Kaggle Dataset: Used as our predictor of school performance. Asked only 31 questions about socioeconomic factors and study habits. Cross validation of a linear regression model showed satisfactory accuracy and was therefore used in the project.

The Ministry of Statistics Dataset: Compiled from each state report card on dropout rate and amenities data. Used to show the effect of each amenities on each dropout rates for the policies makers

Cross Validation and Dataset Validation Example

Sample code for model generation is available to view in the MLModelGeneration repository.

Models were then cross validated against the test partitioned

Code Example of MLModelGeneration/trees.py:

Models used

Linear Regression
Logistic Regression
K Nearest Neighbors
Random Trees and Forests (MLModelGeneration/trees.py)
SVM (MLModelGeneration/SVN.ipynb)

AWS ML

AWS ML was easy to implement. Here are the steps we followed:

We loaded the CSV files containing all of the educational data from both datasets
We created a new data source
Create a new Logistic Regression Model — AWS makes this very easy
After the model has finished training, set up the end point

Frontend

We have an elegant frontend that nicely organizes the different tools in our application. It was built using Angular 4 and Material Design Bootstrap. We also utilized Chart.js for data visualization and Raphael.js to make an interactive SVG based map. Our user interface tries to embrace the elegance of Google’s Material Design, and we feel that this enhances the user experience.

Master Plan

We have created a Master Plan which shows trends we have found within our data sets, which are also clearly reflected in the predictions tool. This can be seen here. Our white paper with our findings can also be found here.

Predictions for Policy Makers

Using our ML models and data sets, we have created an interface for policymakers to predict dropout rates. Using each percentage from 29 different factors, policymakers can see how certain factors affect dropout rates. The map is based on the states/territories of India, and the colors reflect student performance and dropout rate.

Chatbot and Advisor Console

We wanted to have a streamlined and personable interface for collecting student information, and we found that a chatbot is one of the best ways to do this. Our Chat Bot starts off by asking all of the necessary questions, then using API.ai to process a variety of user inputted answers. It then has a small talk feature afterwards if the users wish to continue speaking with it.

This chatbot collects all of the data needed by our student risk model. It gets stored in a MongoDB collection, which has endpoints accessible by the advisor console. This data is sent from the database to our ML server, which makes a prediction with AWS ML to determine whether or not a student is at risk, as well as predicted future grades based on habits.

The advisor console is a place for an academic advisor or other school administrator to view which students are at risk, and look at their data to find out specifically which areas are causing their risk.

What’s next for GradeGuardian

We have many ideas for GradeGuardian and believe that the idea has a lot of potential. We could possibly improve our algorithm using more datasets to add more features to predictions, as well as update the chatbot and user interface to make it more user friendly and school specific.

We could also expand to other education systems around the world, and possibly even market our product to colleges and universities looking to gain insight on student performance.

The Team