The Average Student Does Not Exist

A look into student individuality in higher education

Liz Carlson
Gradescope Blog

--

In high-stakes testing, student performance is most commonly judged by one’s relation to the average total score. Data from 1.5k computer science finals graded with Gradescope suggests this may be an ineffective way to characterize student performance.

In the late 1940s, the US Air Force had a problem. Its pilots were crashing their warplanes too often. After ruling out pilot error and faulty mechanics, the main hypothesis became that the average American pilot had outgrown the cockpit, which was designed during the First World War.

In 1950, officials commissioned a new study to measure 140 dimensions of the human body to determine the new “average pilot.” Over 4,000 young pilots had their height, chest circumference, and other measurements taken for this endeavor. The “averagarian” thinking at the time was that a majority of pilots would measure near the average on most dimensions.

One researcher doubted this approach. Lt. Gilbert S. Daniels calculated the average of 10 physical dimensions believed to be most relevant for cockpit design and determined how many pilots measured near the average for all dimensions. Daniels himself was stunned by the actual number.

Zero. Out of 4,063 pilots, not a single one fell within the average 30 percent on all 10 dimensions.

Harvard Professor Todd Rose’s book, The End of Average, which features this story, debunks the idea that determining the average amongst a group of people will provide universal insight. As he puts it, “If you’ve designed a cockpit to fit the average pilot, you’ve actually designed it to fit no one.”

Rose believes rather that most human characteristics from size to intelligence consist of multiple dimensions which are weakly related to one another, if at all — a principle he calls “jaggedness.”

We sought to determine whether student performance, like pilot size, is “jagged.”

Our team analyzed the results from a past final exam taken by 1506 students in John DeNero’s UC Berkeley Computer Science 61A course. It consisted of 7 questions, 26 subquestions, and 154 rubric items*, with a mean score of 46 out of 80 total points.

Do “average” students exist at a question level?

We wanted to find out whether or not students were likely to score among the average across multiple questions on the exam.

Out of 1506 exam submissions, only one student scored within the average 20 percent on all 7 questions. Furthermore, only 60 students — less than 1 in 25 — scored near the average on 5 or more questions. In fact, 365 students, or nearly 25 percent, did not score within the average range on a single question.

We calculated whether or not students scored within the average 20% (+/-10% of the mean score) for each question.

Even among students with average total scores, ranging from 38/80 to 53.5/80, no less than 14 students did not score within the average 20 percent on any of the 7 individual questions.

While a polarized distribution of overall scores could explain why few students scored near the average on multiple questions, that was not the case. The score distribution was unimodal and the standard deviation was 17 points.

There is no average student.

For example, we looked at two students who both earned 47.5/80 (the median score) and determined that despite having average final grades, each were “A” through “D” students on individual questions. Furthermore, their question-level performance varied widely between one another, with a discrepancy of 25 points between their 7 question scores.

A question-level score comparison of two students who both earned 47.5 total points on the exam. The line graph represents their percentage score per question.

What’s really going on inside the “average”?

We looked at the discrepancies between students amongst the 154 rubric items on the exam to determine if “average” students understood the same things.

For context, Gradescope is a rubric-based online grading tool for assignments submitted either on paper or online. For each question, instructors create a set of rubric items, which each consist of a point value and a description. Instructors and TA’s can grade in parallel and remain consistent by grading students’ answers with the same rubric.

Example: Grading answers on Gradescope with a rubric
Rubrics can be built and modified as grading goes by, and scoring can be either negative or positive.

Example of a student’s submission, zoomed into their answer for Q1.1 (left) and a positively scored question rubric for Q1.1 (right). Click “Next Ungraded” to use the same rubric to grade the next student’s Q1.1 answer.

We considered the middle 20th percentile of students, corresponding to students who scored 44.0 through 52.0 total points on the exam. For each pair of students in the considered group, we measured their discrepancy by tallying how many rubric items were applied to one student but not the other.

When you hear that two students both scored close to the average on the final exam, you might reasonably assume that they have a similar understanding of the material. We were amazed to discover that among the 308 students with average final scores, even the most similar pair had a discrepancy of nearly 10 percent of all rubric item applications. In fact, we found the discrepancy among average-scoring students could be over 40 percent — a truly significant difference in exactly what each student learned.

Two students: Same score, significantly different understanding

We looked at two students who both earned 51.5 out of 80 points on the exam. Despite earning an identical score, they had 67 rubric item discrepancies between them, or nearly 44 percent of all rubric items. They essentially understood only half the same material. In fact, they both had more rubric item commonalities with the top scorer.

There were 67 rubric item discrepancies — in which a rubric item was applied to one student and not the other — between two students with the same score.

Our data shows high variability of understanding among students with average scores, yet the one-dimensional traditional grading system judges them the same. Rose specifically advocates for replacing grades — which fail to capture students’ strengths and weaknesses — with competency scores.

If a pilot earned an 80% on their flight exam, you would hope they didn’t miss the 20% about landing.

Helping instructors see beyond the average

At Gradescope, we recently released a Question and Rubric Statistics feature, which allows instructors to see how students performed at a more detailed level. Instructors are able to tag questions with concepts and then view statistics for each concept.

Instructors can tag questions with learning concepts and then view class performance by learning concept, like in the example above.

Our intention with this feature is to provide instructors with better understanding of which questions or learning concepts students, in general, might have struggled with to inform future lessons and assignments. (Was a question written in a confusing way? Does more time in class or office hours need to be devoted to teaching a specific concept?)

Our next challenge will be how we can help busy educators track individual progress. That way, they can better help each student understand the exact concepts he or she might need help with at the right moment in time, rather than just teaching to the “average student” — who does not actually exist.

*For our study, we removed redundant rubric items from our calculations. For instance, we eliminated the 0-point rubric item “Blank/Incorrect,” since it was the equivalent of the absence of any applied positive rubric items.

Learn more about Gradescope, and let us know your thoughts about our study below.

--

--

Liz Carlson
Gradescope Blog

Content @Gradescope. Interested in STEM education & how AI can improve learning outcomes.