GRE Admission Prediction using Machine Learning

srinathn
7 min readJul 22, 2020

--

Introduction

This blog basically gives an idea about which features hold top priority in getting admission in different universities across the world. The reason why we are making this blog is because we too are students appearing for GRE and this will help us out.

Dataset

The data set includes all the vital features required for profile building. The features it includes:

  1. GRE Scores ( out of 340 )
  2. TOEFL Scores ( out of 120 )
  3. University Rating ( out of 5 )
  4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  5. Undergraduate CGPA ( out of 10 )
  6. Research Experience ( either 0 or 1 )
  7. Chance of Admit ( ranging from 0 to 1 )

You can access the dataset from here.

Main Insights

  • CGPA is by far the most important factor in Graduate Admissions.
  • An additional point on the GRE increases chances of graduate admission by about 0.14%.
  • An additional point on the TOEFL increases chances of graduate admission by about 0.26%.
  • Performing research is only reasonably beneficial if the college rating is low.

This article will dive into how these results were achieved using the following machine learning techniques:

  • Creating a Logistic Regression model to predict the chances of admission, analysis and visualisation of the coefficients.
  • Used Permutation Importance and a Decision Tree Regressor model to find feature importance.
  • Single-variable and grid Partial Dependence Plots (PDPs) to see how a single or multiple variable(s) affect the chance of admission.
  • Used heat-map, pairplot to visualize the relationship between the features.
  • RandomForest Classifier and Logistic Regression helped us in calculating MAE(Mean Absolute Error).

Tools Used

  1. seaborn
  2. pandas
  3. numpy
  4. matplotlib
  5. scikit-learn

Models

  1. Linear Regression
  2. Logistic Regression
  3. RandomForest
  4. Permutation Importance

Data

The data head:

We can also use panda’s built-in describe():

Linear Regression

It is a statistical method which is used to obtain formulas to predict the values of one variables from another where there is a relationship between the 2 variables.

The formula for simple linear regression is that of a straight line y =mx + c

The variables y and x in the formula is the one whose relationship will be determined.

Both the variables are named as below:

  1. y : Dependent variable
  2. x : Independent variable

The above equation is more equivalent to the slope intercept form in which the dependent variable is denoted by y, and c denotes the intercept, m denotes the slope, and x is the independent variable.

So, if we are given a particular Independent Variable x, the regression model would basically compute the results of c and m which would minimize the absolute difference between the dependent variable y which is the actual value we have and the predicted value of y.

X-axis has GRE Score and Y-axis has Chance of Admit and the red line denotes the predicted values.

Logistic Regression

The MAE(Mean Absolute Error) for Logistic Regression was calculated to be 4.2% in predicting the graduation admission chances (out of 100%)

Random Forest

I have used random forest algorithm for solving this regression problem. Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data.

n_estimators is the number of trees to be used in the forest. Since Random Forest is an ensemble method comprising of creating multiple decision trees, this parameter is used to control the number of trees to be used in the process.

The Mean Absolute Error for the random forest model is 0.0512 (5.1%)

As you can see CGPA is the most important criteria for graduate admission followed by GRE and SOP score.

Permutation Importance

Permutation Importance is a method of evaluating feature importance by randomly shuffling columns and seeing a corresponding decrease in accuracy. The columns that have the biggest decrease in accuracy should be more important than those whose shuffling does not decrease the accuracy much.

As with before, CGPA is very important, followed by GRE and TOEFL scores.

Visualisation

Pair Plot

A pairs plot allows us to see both distribution of single variables and relationships between two variables. The pairs plot builds on two figures, the histogram and the scatter plot. The histogram allows us to see the distribution of a single variable while the scatter plots shows the relationship between two variables.

HeatMap

The heat-map is a way of representing the data in a 2-dimensional form. The data values are represented as colors in the graph. The goal of the heat-map is to provide a colored visual summary of information.

SHAP Force Plots

SHAP Force Plots allow us to see why a certain example achieved the score they did.

In this example, the final chance of admission was 0.75. Factors in red explain what pushed the chance up, and factors in blue explain what pushed the chance down. The length represents how big of a force it had.

PDP Plots

PDP Plots, or Partial Dependence Plots, show how a single feature can affect the target variable.

This Partial Dependence Plot suggests that the optimal GRE score would be around the peak at a score of 320. A possible explanation for this is that time spent studying for the GRE could take up time on other things (say, CGPA) that are more important. Additionally, note the large error range.

The PDP for CGPA stands out the most because of its very small error bars. There is a clear and obvious trend that a higher CGPA correlates with a higher chance of admission.

PDP Grid /Multidimensional PDP Plots

PDP Grid / Multidimensional PDP plots are a special gem — they show how the interaction (hence, why they are also called PDP Interaction plots) between two variables results in a certain chance of admission.

In the grid plot, the more yellow areas have a higher chance of being admitted. A low GRE score does badly for an average CGPA. As seen before in the single-variable PDP, the range around a GRE score of 320 seems to be the ‘golden area’ to be, with the high CGPA Score.

In the grid plot, the more yellow areas have a higher chance of being admitted. A low GRE score does badly for pretty much every university rating. As seen before in the single-variable PDP, the range around a GRE score of 320 seems to be the ‘golden area’ to be, with universities rated a 5 accepting people with around that score in the highest numbers.

We should be suspicious of lack of data — this dataset is not rich in data, as our discovery that the chance of admission decreases as GRE score increases (after a certain point) does not make sense.

Another repeated result — CGPA matters significantly no matter what university rating.

Code

Let’s get started with the code. The complete project on GitHub can be found here.

Errors you may come across

  1. Check whether you the installed packages are well updated with the latest version.
  2. Make sure the versions of all packages are compatible with each other.(You may face this error with numpy and Tensorflow).
  3. While plotting Shap Force Plots you might face this error “Visualisation omitted, Javascript library not loaded!Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on GitHub the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.”

If so write shap.initjs().

Conclusion

This so written is just a tutorial because it’s cool and up to date topic. This article shows how ML could be used to calculate probabilities for admission and does not attempt to get the exact results.

Contributor

SALONI PATADIA

--

--