Why Automated Scoring of Student Essays Will Not Help Students
I am a writing professor who makes software. So when I saw a recent NPR story that quoted Peter Foltz, a professor at UC Boulder and also a VP of Research at Pearson talking with great optimism about using machine learning (or AI as the media likes to call it) to score student essays I read the story and wrote a response on Twitter. And that tweet took off.
I don’t expect Ari Shapiro to slide into my DM’s anytime soon to invite me on the air. So let me offer a few words to expand on my key point which is that no matter how good we get at automating summative scoring, it will do little to move the needle for student learning in writing.
The thing that really matters for helping students learn is what makes “grading” essays a challenge: offering the kind of quality formative feedback students need to improve.
I have now used two words — summative vs. formative — that don’t appear at all in the NPR story. And that’s part of the problem. You can think of summative evaluation as a final score. The time on the clock in a race, or the grade on an exam. Formative evaluation is feedback that describes and evaluates based on criteria, and makes suggestions for improvement. It is much more valuable and a lot harder to give (for humans and robots)!
But here’s the important thing: it is essential for student learning in writing.
So…my tweet to NPR was really not so much a critique of the central premise of the story — that robots are getting better at scoring essays after analyzing thousands and thousands…maybe millions? of human-rated samples. Rather, it is a request to consider that machine scoring will, at best, do nothing to help students learn and at worst could shift the focus of students’ practice routines in ways that prepares them poorly for the kind of writing they will need to do in their lives. Critics of automated scoring worry that if we only train students to produce what a machine can reliably recognize, we reduce the goal of “good communication” to just a few traits.
What We Agree On: What Matters Most is If (and How) Students Revise
Before I answer the question that is the title of my piece, let me say that I suspect the title and the focus of the NPR piece on scoring essays is, more than anything, an editorial decision designed to attract some attention. I say this because I know that Professor Foltz understands the value of formative feedback quite well. I’ve read some of his work and found it very useful, in fact. In a paper he presented at the Association of Educational Researchers Annual meeting, he summed things up quite nicely this way:
“Practice can help improve students writing skills, most particularly when the students are supported with frequent feedback and taught strategies for planning, revising and editing their compositions.”
Practice planning and revising matters most. Feedback that scaffolds that kind of practice is the most valuable kind of assessment. On this I (and, truly, most writing teachers everywhere) agree.
The really interesting thing that happens when teachers respond to student writing and when peers give one another helpful feedback is that learners get “revision fuel.” Without it, learning sputters and stalls. Even Foltz’s work has shown that when students revise enough (focusing on making substantial changes and not just editing grammar and spelling), they improve (see Foltz, P. W., Lochbaum, K. E., & Rosenstein, M. B. (2011).
So the name of the game for getting students to improve is to get them revising. It’s not easy work. It makes sense we might want to give the work to a robot to do. Foltz and Pearson clearly see a business opportunity there, which some worry will further devalue teachers’ labor and expertise. And there are a few other important cautions.
So What’s the Harm of Asking Robots to Grade? Two cautions from Others…
There are a number of good reasons that we should be cautious about using automated scoring to grade student writing. Some of these are featured in the NPR article. Reporter Tovia Smith does a nice job featuring Les Perelman, an MIT Professor who has spent his career demonstrating the fragility of automated scoring. He does the best critique of the current state of the technology. Caution is warranted, for Perelman and others who demonstrate the accuracy of machine scoring as a problem, because we simply don’t see the machine performing its assigned task very well. The false positive rate, in particular, is very, very high.
Another good reason to be cautious is related to the way that norming students using automated scoring tends to group them in ways that can be discriminatory. Whatever else a machine-learning approach may be good it, what it does well (on purpose or by accident) is sort students into groups. This can be a good thing if we are trying to use scoring to help find students early enough in a semester to encourage changes in how they are practicing to help them learn and improve. That would be using automated scoring in a formative way, you will note. Automated scoring for state tests — the focus of the NPR story — doesn’t do this. It puts students, classes, schools, and districts into categories based on scores. But it provides little or no feedback about how to improve. Worse, the categorization may drive policy choices: access to funding, assignment of emergency managers, evaluation of teachers that exacerbate problems.
Put these first two cautions together and you should start to get nervous, no?
My own reason that we should exercise caution is related to these first two. But I want to close not with an argument about the (in)accuracy of machine learning techniques. Les Perelman does that quite effectively. Nor will I make the ethical case for responsible data stewardship. My colleague Chris Gilliard makes that case more eloquently than I do and you should read his work on that. The case I will make is that we have something better to focus on if we want to improve student writing than the drafts they produce . We should look at the way learners practice writing.
What is an Essay, Anyway?
The word essay means to “try.” An essay is an exercise. A workout. Whether assigned by a teacher or by the writer herself, we should understand it first and foremost as a verb — a thing someone is doing — rather than a noun, a text someone has written.
The analogy I want you to take up here is one that many of the folks in my Twitter feed invoked when they responded to my tweet: writing is an art that we improve with practice. Like dancing or playing music. An essay offers a chance for a learner to challenge their ability — to practice — extending vocabulary, technique, and repertoire. In music, this might mean learning new scales, working on bowing or pizzicato, and putting these together in a new etude. In writing, this can mean adopting a new register — say, science writing — and finding the key moves such as hedging, attenuating the strength of a claim to match the available evidence, that are essential to writing like a scientist. The true success of an essay is determined by whether the writer emerges better than she was before!
The goal of practicing, of essaying, is not to produce a single, textually stable signal but to build a flexible repertoire of possible moves one might make, along with the judgement to use them. Over time, learners do this with confidence in novel situations. But it only comes with practice. Lots of it. With others playing along — reading, writing, responding.
With students doing more and more writing in online systems, there is no reason we need to be focused on analyzing their draft texts alone to understand how they are practicing. When we do, we are looking in the wrong place for evidence of good writing behavior. Producing a “clean text” is usually the result of a few predictable things: 1) not trying much that is new or difficult, 2) getting good feedback, and 3) lots of revision. Essays written for high-stakes testing situations tend to be bad indicators of these things due to the constraints we put on writers in the test situation. As a result, we screen out all of the most valuable writing behaviors we want to encourage in learners. We may even miscategorize high-value “trying” as error. As writing researcher Mina Shaughnessey famously showed, the “errors” we see in drafts may in fact be indicators of growth…of a student reaching for a new expression or move but not quite getting there.
If we want to help students learn, we should use digital tools to create spaces with rich feedback opportunities and conduct our analysis in a formative way, with sustained engagement and practice routines adjusted based on performance. In those spaces, teachers and other learners are assets whose input can be used not just by Pearson for machine learning, but by learners themselves in real time!
Right now, teachers and quite often other students are better than robots at seeing and understanding what learners “try” to do when they essay. And they are far better at helping them try again, in a revision, to do better the next time.
In digital environments, we can have groups of student learners helping one another, giving great feedback and using it to practice revision. That’s currently the best approach we have. And there’s good evidence to support it (see Kellogg & Whiteford, 2009). Writing is best learned in a studio environment with engaged writers learning from and helping one another. Technology has a role to play in this for sure. But it isn’t scoring essays.
It may be that machine learning technologies have a place in feedback-rich practice environments. Foltz thinks so. My colleagues and I do too. And that’s where our collective focus should be: helping learners practice in order to get better.
I am a Professor in the Department of Writing, Rhetoric, & American Cultures and an Associate Dean for Research and Graduate Education in the College of Arts & Letters at Michigan State University. I am a Co-Inventor of Eli Review, a Peer Learning Service. I also had a hand in making ML-driven software such as the Hedge-o-Matic and the Faciloscope, among others. Some of research can be found on my Google Scholar profile.
Foltz, P. W., K. E. Lochbaum, and M. B. Rosenstein. “Analysis of student ELA writing performance for a large scale implementation of formative assessment.” In annual meeting of the National Council for Measurement in Education, New Orleans, LA. 2011.
Foltz, Peter W., Mark Rosenstein, Nicholas Dronen, and Scott Dooley. “Automated feedback in a large-scale implementation of a formative writing system: Implications for improving student writing.” In annual meeting of the American Educational Research Association, Philadelphia, PA, vol. 4, no. 4.25, pp. 4–5. 2014.
Kellogg, Ronald T., and Alison P. Whiteford. “Training advanced writing skills: The case for deliberate practice.” Educational Psychologist 44, no. 4 (2009): 250–266.