Mining the Common App: Part 1

Mike Yung
5 min readDec 1, 2016

--

The industry built around college admissions is a pretty saturated one — from private ‘elite’ tutors in Asia, to global test-prep powerhouses like Kaplan, and even to TaskRabbit-esque marketplaces for ‘essay consultation services’ (read: latent plagiarism). Yet it seems tutors, teachers, and consultants alike have yet to crack the code on admissions. Does a magic formula even exist? Probably not, but there are likely decision rules within admissions committees we don’t know much about, and there are most certainly underlying relationships across acceptances and denials that we can learn from.

AdmitSee: a college application resource.

For my final capstone project at Galvanize, I had the fortune of working with AdmitSee, an online college application resource where prospective students can browse the profiles and essays of real admitted students. With this unique data, I set out to answer two key questions:

I) Can we create a better predictive model than the existing probability-calculators for college admissions?

II) What insights can we glean from the Common App essay, both on an individual and an aggregate level?

This post addresses the first question. Part II can be found here.

One Model to Trump Them All

For starters, let’s be clear that there isn’t really a great way to validate whether a new model is better or worse than any of the existing calculators out there (e.g. StatFuse, Cappex), since that would require i) making thousands of predictions for actual, real students using those calculators; ii) waiting potentially months before those students have admission results, and iii) computing the accuracy for each model. Let alone getting their consent to record everything. It’s basically infeasible to conduct a test like that; however, we do know that these models all approximately use the same input factors: SAT/ACT scores, GPA, demographics… the basic information you would expect. If we assume that the performance of these models are also approximately equal, then adding more dimensions of information — i.e. new features, will likely yield better results.

Constraining the Problem

After doing a train-test split on my dataset, I had about 12k students to work with. Among these, even the schools with the most data were several hundred at most. Since modeling at a school level would be potentially quite inaccurate, I chose to constrain the problem to predict a single binary outcome: admission into a ‘top school’. In other words, if a student was accepted into any of the ‘top schools’ (defined as Ivies, Stanford and MIT) the outcome variable is 1, and 0 otherwise.

Building the Ensemble Model

Here is a simplified visual representation of how I built my ensemble model. Starting with a set of about 50 raw fields, I feature-engineered a handful of potentially useful predictors, such as varsity sport involvement, winning an award, taking on a leadership position, etc. On the essay side, I employed NLP techniques to find the topic distribution of each essay (I’ll go into more depth about how this was done in the subsequent post). Additionally, I created a variable called word_sophistication, a proxy of how many ‘fancy’ words a student used in his/her essay (measured as total occurrence of sophisticated words / total word count). One might hypothesize that both extremes are negatively correlated with admissions outcomes: a value of zero might indicate a lack of wordsmanship, while a high value could point to a loquacious writer exorbitantly flamboyant in his lexical verbiage (excuse the irony). If so, the optimum must exist somewhere along this spectrum — we then let the beauty of machine learning take over to find this point/range.

Visual representation of my model pipeline

Evaluating the Model(s)

To evaluate the model, I spoke with the folks at AdmitSee and we agreed that precision would be the most fitting metric to evaluate a model’s success here (refer to Aside 1 below for more on classifier metrics). Models that prioritize precision tend to be more conservative in their probability estimates, which aligns quite well with AdmitSee’s goal of encouraging students to use their product even though they might be ‘star students’ to begin with.

A Receiver Operating Characteristic (ROC) curve illustrates the performance of a model as we vary the threshold at which we discriminate two classes. Basically, the goal is to maximize the area under the curve. In the graph below, we compare the performance of four models: i) Logistic Regression, ii) Random Forest, iii) a basic Ensemble Model [LR+RF], iv) a Grand Ensemble that builds on the basic Ensemble and combines it with a new model that incorporates essay features. Ignoring LR, while the area under the curves look visually indiscernible, the Grand Ensemble takes the cake, with a precision of 62.8 (compared to Ensemble’s 61.9 and RF’s 57.7).

Receiver Operating Characteristic (ROC) Curve on the Test set

Aside 1: Classifier Performance

In an earlier post I wrote on Supervised Learning, I touched briefly on Type I vs Type II errors, and that one might choose to minimize one error over another depending on the situation. This tension can also be understood by looking at success metrics for evaluating a classifier’s performance. The three most common are Accuracy, Precision and Recall.

Accuracy = correctly predicted data points / total data points

Precision = correctly predicted positives / total predicted positives

Recall = correctly predicted positives / total true positives

In other words, accuracy measures how many data points you accurately predicted, regardless of class (+/-); precision measures how many data points were actually positive based on all data points you identified as positive; recall measures how many data points you labeled as positive, based on the data points that were actually positive. It’s important to note that an inherent tradeoff exists between precision and recall: increasing one comes at the expense of the other.

Interpreting the Model

Optimizing for precision is great, but what if we wanted to know how each variable affects your admissions chances? This is where Logistic Regression shines. In spite of its weaker performance, it is highly interpretable. More specifically, we can take the exponent of the coefficients to understand the marginal effect of each feature on the outcome variable.

Since I’m bound by an NDA, I can’t really disclose the details, but I can give a quick example. The coefficient for the binary variable leader is 0.82. Taking the exponent of that gives us 2.26. What that means is, if you aren’t already in a leadership position, taking one will more than double your odds of being admitted!

Final Thoughts & Caveats

One thing to note is that our model implicitly assumes that these top schools apply the same criteria to vetting applicants every year, whereas in the reality, they probably update (even if slightly) what they look for in students as time passes. In terms of next steps, I’d like to perform some more feature-engineering by looking at interaction effects (e.g. Varsity * Captain), and by exploring deeper effects intertwined across variables (e.g. a Hispanic student holding a leadership position in an Asian Student Society).

For a technical dive, feel free to check out my GitHub repo for this project.

Read on for Part II

--

--