Mining the Common App: Part 2
This is the second installment of Mining the Common App. As a quick refresher, I had the recent opportunity of working on a cool data project with AdmitSee, an online college application resource. In Part 1, I talked about the process of building a classifier that predicts a student’s probability of being admitted into a ‘top school’. In this post, we’ll address the following:
What insights can we glean from the Common App essay, on both an individual and an aggregate level?
Let’s start with a cool visual.
You might be thinking, how were the axes created/defined? Why are certain points colored the same within a shared cluster? Read on to find out how I turned the text of thousands of college essays into the graphic above…
Pre-Cleaning (discard stop words + lemmatize)
Before we do anything, we need to do some cleaning. Eventually we’ll want to measure the importance of words within an essay and try to draw patterns across essays, so first we need to remove words that might throw our analysis off. More specifically, we throw out ‘stop words’ — e.g. ‘you’, ‘me’, ‘him’, ‘of’, ‘the’, … words we use in our everyday life that contain basically no useful information for text analysis. (Unless you’re doing something like an analysis of linguistic evolution in a digital setting, in which case it might…) Next, we’d like to reduce words to their basic root form. Plurality should be singularized, adverbs reduced to basic adjectives, among other things. There are generally two ways to go about this: stemming and lemmatizing. Stemmers tend to be more ‘strict’ (i.e. chops off more of a word), and the resulting tokens are often not real words, e.g. familial → famili; arguing → argu. Lemmatizers are more ‘lenient (i.e. doesn’t reduce the word as much), and the resulting tokens are often real, interpretable words. For my dataset, I first used NodeBox’s Linguistics library to convert all verbs into present tense (NLTK isn’t great with tense-correction), and then used NLTK’s SnowballStemmer to stem words. I chose to use a stemmer over a lemmatizer because the preservation of real words was not as important to me as grouping words with similar roots together.
Vectorizing the Essays (using TF-IDF)
Next, we want to represent each essay as a numerical vector, so that we can make calculations and comparisons with other essays later on. We do this by constructing an enormous matrix where the rows are essays and the columns are words (the aggregate column space is basically the entire vocabulary across all essays). Each cell represents the ‘importance’ of a particular word in a given essay. For instance, cell (247, 1928) refers to the importance of word #247 in essay #1928. Note that the ordering of words and essays have no meaning here. The ‘importance’ value can be a simple word count, but a more robust approach uses something called ‘term frequency - inverse document frequency’ (TF-IDF). This basically computes a normalized word count for each essay, but weighs each word inversely proportional to the occurrence of that word in the entire corpus. Picture an essay in your head. If you see a word that’s also widely used in every other essay, that word should be given less importance. On the other hand, if you see a word that’s unique to that essay you’re reading (i.e. rare across all essays), it should be given higher importance.
Topic Modeling (using NMF)
After vectorizing the essays, our matrix takes the form of the square grey box on the left in the graphic below (note: in reality it is not square-shaped at all — in our case the number of words far exceed the number of essays).
It’s great that we’ve represented the essay as vectors, but in reality the matrix is highly sparse (i.e. mostly filled with zeros) so it’s still a little difficult to make meaningful calculations. Enter dimensionality reduction. The idea is to reduce the number of dimensions by some order so that we can more easily perform operations on the data. On the plus side, it helps reduce overfitting, lowers computational costs, and prevents the ‘curse of dimensionality’. On the flip side, we forgo some information as we are essentially ‘throwing out’ columns of data. There are many techniques to do this — in this case, I chose to use non-negative matrix factorization (NMF). The basic premise of NMF is to deconstruct your original matrix into two separate matrices: a ‘long’ one and a ‘fat’ one. When you multiply the two together, you get a reconstructed matrix that is approximately equal to your original matrix. It’s called ‘non-negative’ because we don’t allow any negative values, giving the benefit of interpretability especially in the context of text analysis and ratings/reviews data (e.g. Netflix, Yelp). In our case, the ‘long’ one is our new dimensionality-reduced matrix, and the ‘fat’ one is a reference guide that contains semantic information (see below).
I added the bolded blue text on the side after analyzing the words, but beforehand, this matrix had no labels. Each row is in a sense a latent feature, and we can derive meaning by looking at the words in that row that have the highest value (i.e. words most associated with that particular row).
Essay Topic Distribution
Why was all that important? Well, I built a tool for AdmitSee where you can upload your essay, and it will tell you the topic distribution of your essay, according to the seven topics the NMF algorithm learned. In addition, it also shows you the three most ‘similar’ essays to your essay. How is similarity calculated? You can choose from two options: Euclidean distance of the topic distribution, or Cosine similarity of the TF-IDF essay vector. I’ll leave the explanation of these concepts for another time/post…
The Landscape of College Essays
So we just took a deep dive into essays on the individual level, but what if we wanted to learn about the broader trends across colleges/universities? To do this, I first calculated the mean topic distribution for each school, resulting in a topic vs. school matrix. Then, I used PCA to reduce the dimensionality from 7 topics to 2 features, so that we could visualize the data in 2-D. Finally, I used k-means to cluster the schools in groups that can be heuristically supported. The result is as follows:
As one would expect, the liberal arts schools (e.g. Bowdoin, Middlebury, Wellesley, etc.) are clustered to the far left and slightly nudged under the horizontal axis. Schools with a STEM focus (e.g. CMU, MIT, CalTech) are clustered up top, trivially with the highest representation of science-related essays. Perhaps the most interesting insight I want to part you with, is that all the Ivy League schools are concentrated in the middle. This suggests that the top schools don’t look for one thing in particular, and reaffirms the general claim that they do seek for a diversified pool of student interests.
As exciting and rewarding as this project was, there is always room for improvement. To take this project to the next level, I would explore using Latent Dirichlet Allocation (LDA) for the topic modeling portion. LDA is a generative statistical model that assumes every essay has an underlying distribution of topics, and every topic has an underlying distribution of words. Recent literature has suggested that this probabilistic approach can yield better results, so it would be a natural next step to explore.
As AdmitSee continues to grow and collect more data, it would be interesting to see how the visualization of schools above differs between undergraduate and graduate essays, and to also look at trends over time (e.g. have certain schools shifted from more career-driven to more personality-driven essays?)