A Machine Learning Approach to Understanding Teacher Evaluations

Published in

techburst

5 min readOct 27, 2017

Whew! Where do I start? Anyone who’s ever taught a college class is familiar with that feeling. That feeling when, about two weeks after courses are over (after grades are submitted and the office hour bargaining has ended), an email from the department appears in your inbox telling you that… dun dun dun…evaluations are ready. Echoes of “but I worked so hard!”, “it wasn’t on the study guide”, “I didn’t understand the question”, “you just don’t like creationists” come roaring back. Memories of communal groaning and outrage over the length of the papers flash before the poor instructor’s eyes, and if they are wise, they settle in with a glass of wine and brace her/himself.

Controlled experiments have shown that when it comes to teacher evaluations, objectivity isn’t always on the table. Teacher evaluations are shown to be biased according to gender, personality, and ethnicity. That is not to say that students are not concerned with learning. Learning outcomes often do predict teacher evaluations, however evaluations vary according to the learning outcome measured. For instance, if the learning outcome is a great grade on the final (pejoratively known as “teaching to the test”), then evaluations tend to be good. But if learning outcome is measured according to performance in subsequent courses, effective teachers are actually rated lower.

That a lot to nosh on, right? But what does that say about what students actually privilege in their evaluations? Often course evaluations are constrained by what the department/institution chooses to evaluate and not what the students feel is important. To answer this question, I turned to the internet.

For this project, I used BeautifulSoup to scrape over 13k pages from ratemyprofessors.com. Unlike department ratings, ratemyprofessor.com allows students to publicly rate professors on metrics such as difficulty of the class and “hotness” (That’s right. How hot is the prof?), among other attributes. Prospective students often use these ratings to pick their classes.

...so it’s like Waze for learning.

Analysis

Once I scraped the data, it was time to engineer some features. The prepackaged variables scraped directly from the site were:

The outcome variable: Overall Quality of the Professor (5pt scale; this rating is independent of other ratings on the site).

And features: A class difficulty rating (5pt scale), endorsement tags of attributes (e.g., funny, boring, etc.), hotness (yep, they get a chili pepper if the student thinks they’re attractive), final grades, the number of ratings a professor received, narrative evaluations, location (city and state), and academic subject.

These were the features I started with, and because a professor’s gender has been shown to influence teacher evaluations but was not made available on the website, I gathered data from the Social Security Administration on the gender associated with popular first names dating back to the 1930’s and then assigned gender to professors according to their first name. For reference, this data can be found here.

To begin, I collapsed course subjects into their respective disciplines/departments. (This was a best guess approach for classes like “Decision Heritage” and other courses that could be cross referenced). Endorsement tags were averaged by the number of ratings the professor received, and US states were collapsed into US regions according to the US Census Bureau. This ultimately reduced the number of features from 512 to 54 and, subsequently, reduced the potential for overfitting my model.

Exploratory analyses showed that there are approximately twice as many male professors as female, professors are generally not hot (18%…sorry guys), and the most common disciplines reviewed, in descending order, were political science, foreign languages, math, behavioral sciences, and physics and engineering. The chart below provides a breakdown of courses taught by either men or women.

In order to make use of the narrative reviews, I performed latent semantic analysis by vectorizing the text using TF_IDF in scikit-learn, which weights terms according to how frequently they are referenced in given document (in this case documents are the set of narrative evaluations for a single professor), adjusted for how common the words are. From there, I extracted components from the term matrix using Singular Value Decomposition (SVD), which transforms the matrix into a lower dimensional, interpretable semantic space. In this case, I chose to extract 100 components from my matrix. This was an exploratory decision. Increasing the number of components did not improve my model and slowed my processing time down.

Once my features were combined (SVD components and ratemyprofessors metrics), I split my data into a test and training set and ran a series of linear and decision tree models to determine which model best predicted professor ratings.

Ridge and Gradient Boosting Tree models explained the most variance (75%, & 74%, respectively). A closer look at my data revealed multicollinearity among features (see below), which may explain why Ridge did so well. When multicollinearity exists, standard errors of the regression weights are huge, and the model is highly sensitive to new data points. Ridge addresses this problem by introducing a parameter lambda to the penalty term, allowing tuning of the bias-variance tradeoff.

In addition to performing well, using Ridge allowed me to explore the nature of the relationships between my features and outcome, which, again, was the purpose of this analysis.

The SVD-derived components had the highest coefficients. Specifically, reviews that mentioned math and intensive writing resulted in lower professor ratings. Consistent with academic studies, grades also mattered: the higher the grade, the better the rating. Mention of online classes resulted in lower ratings, and professors who had more Respected endorsements also had higher ratings. Fortunately, hotness and gender did not contribute much to the model.

So that’s it, data science community. If you’re thinking about turning your love of data science into a college teaching career you may want to learn to “teach to the test”. An important caveat to consider, though, is that this a self-selected sample that may not describe the broader student population. For the sake of data science, let’s hope so.

A Machine Learning Approach to Understanding Teacher Evaluations

Analysis

Written by Kelly Gola