
Quiet Confidence
Researching the heart of graduate school enrollment:
They say at the heart of each fear is the unknown, and that the cure for all fear is preparation; if you’re built for anything, you are afraid of nothing.
After public speaking, death, and shark attacks, I imagine grad school applications have been terrorizing peoples dreams for millennia. Today I hope to alleviate some of that anxiety.
Scraping
Naturally, admission/exam data is not part of the public domain, however if you look hard enough, good samaritans/prospective students provide all the data points one might need get an empirically defined position of where we stand.
Search and submit to the largest database of graduate school admission results. Find out who got in where and when in…thegradcafe.com
- A decade of exam data.
Forum posts are laid out in a structured format, new pages can be accessed by writing a ‘next page’ query in the url, and GRE/GPA data is presented as a hover point, but is extractable by parsing the html.
Easy Money.
Cleaning
Since the Program and University names are all user generated, spelling mistakes, differing nomenclature, and changes in exams (old GRE -> revised GRE), we need a little cleaning before we can analyze the data.
- Separate university names by index of common acronyms plus, values found within brackets (UCLA, USC, MSU etc.)
- Regex search for all acronyms/contractions
- Separate by location using proper nouns from Capitalized Letters (UMich Dearborn is not the same as UMich Ann Arbor)
- Hard Part: Cluster by top university*, identify synonyms for the same university (acronyms, internal graduate department name, program name, misspellings) and aggregate.
- Use 80–20 rule:
- Get counts of most common universities per location
- Regex comment section for those locations
- Repeat with lower case and removed punctuation
- Finally append to synonym list and dedup (U of A is both Arkansas, and Arizona, but they are separated by locations, we can use that to dedup).
5. Repeat for Major Names (Skipped this part, hence the 80–20 rule).
6. Filter GPA to `less than or equal to 4.3` to eliminate foreign GPAs that will affect our results, same with GRE totals <= 340 to focus this exercise of the revised and latest GRE.
Overview
Where are we applying? (Top 50)

- Bar Columbia University Top 10 most applied universities appear to reject more than they accept.
- Though it is nice to see most people aiming high and selecting elite learning institutions.
What are we applying for? (Top 50)

- A lot more PhD’s than expected, even for larger volume majors like Physics I expected more Masters Candidates.
- Most Universities offer Engineering degree’s as an MS not MEng
Where are we going? (Accepted)

- UIUC, UCLA, Madison, larger universities appear often for different majors.
- Nice to see that a few universities do not dominate any one major.
Now that we have an overview of the Universities prospective students are applying to, lets dig a little deeper into the students themselves.
Predictors

- Quantitative scores are overall higher than their verbal counter parts.
- Compared to the official GRE verbal average of 150, and quantitative average of 152, it seems self reported scores are significantly skewed positively.
- There are definitely many biases with creating a sample from only data posted on forums, as they are self reported, universities are skewed toward the top tier, many come from the same applicant, and generally those who post to forums are more serious about attending a higher education program.

- GPA and GRE score seem to align with the assumptions presented above, wherein the skewed GRE scores follow the skew in high GPAs (Those with high GPA’s get high GRE scores).
Grades and Admission Correlation
> sapply(c("gpa", "verbal", "quant", "total"), function(x) print(summary(lm(g3[,x]~g3$decision))$coefficients))Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.67261098 0.002039816 1800.46150 0.00000e+00
g3$decisionRejected -0.04457445 0.003253950 -13.69857 1.36217e-42
Estimate Std. Error t value Pr(>|t|)
(Intercept) 158.6258600 0.05004327 3169.77396 0.000000e+00
g3$decisionRejected -0.8462978 0.07982988 -10.60127 3.246318e-26
Estimate Std. Error t value Pr(>|t|)
(Intercept) 160.092758 0.05120544 3126.47930 0.00000e+00
g3$decisionRejected 0.494945 0.08168379 6.05928 1.38242e-09
Estimate Std. Error t value Pr(>|t|)
(Intercept) 318.7186178 0.07348505 4337.189906 0.000000000
g3$decisionRejected -0.3513528 0.11722460 -2.997262 0.002726237
- I Guess they aren’t lying when they say GRE is often disregarded as a measure for aptitude, as only GPA and verbal scores are negatively affected by a rejection.
- Although the P values suggest that the probability that GRE scores have nothing to do with the decision is low, the magnitude, and valence (quant) of the change leads me to believe that the relationship is not well defined here.
- Apparently a rejection decision decreases verbal scores by 0.85 points, and increases quant scores by 0.5 points. (Doesn’t pass the sanity check)

- Even Grouping by majors (ones with enough data) no definite conclusion can be reached, those with significantly lower score often are accepted instead of higher scoring applicants.
*UIUC’s computer science program is the most applied to, from the data extracted.
*Although the PhD program does show a trend with high GRE scores driving acceptance, it is mitigated by the unclear picture depicted by MS applicants.
- Perhaps the positively skewed GRE scores are so close together and close to the top, that a difference between them has no bearing on the decision (Possible area to look into but would require complete application for ceteris paribus).
Offset prediction with pessimism
Since it seems like we have too skewed a dataset to perform some Multivariate Analysis, we should get the to the point of this exercise and answer the ultimate question.
Who was the gangster that got into my dream school with those scores?
- Somewhat of a different picture than we are used to

Conclusion
- We scraped a quarter of a million grad application forum posts
- Did our best to clean, and dedup (No guarantees here)
- Visualized some of the basic statistics popular schools/majors and grade distribution
- Attempted to find some correlation between grades and admissions
- Celebrated the men and women who took on the house and won.
Train along side those who know you best, learn from those who you trust most, respect the wisdom of those who’ve been there before, and embrace the promise of those just on their way.