AI grades AI
by Pieter Abbeel
Grading is about carefully teasing apart student answers to assess and give feedback on their learning. While crucial, it’s unfortunately one of the least fun instructional responsibilities. Grading fairly and consistently can be extremely time-consuming. I have long been hoping artificial intelligence would speed up grading, in part because I teach the undergraduate AI class at Berkeley. This Spring, it finally happened!
We piloted an early alpha version of Gradescope’s AI-boosted grading feature for our CS 188 final exam and cut grading time by 75%.
Here is a quick summary of our exam day:
- 8:00 am — 11:00 am: Students take the exam
- 12:00 pm — 1:00 pm: We feed the 632 (18-page) exams through scanners (today’s scanners are amazingly fast — just don’t forget to cut off the stapled corners) and Gradescope facilitates batch uploading and easy matching of uploads with students.
- 3:00 pm — 7:00 pm: Grading
- 7:00 pm: Done grading. Time for dinner and drinks to celebrate the end of the semester. Normally we’d be ordering pizza about now to fuel us through several more hours of grading.
How AI graded our multiple choice problems
Onto the AI! The AI analyzes student answers, and groups them based on similarity. Here is a screenshot of the kind of grouping it provides:
For the multiple-choice questions, the AI has been trained on tens of thousands of past student answers, and it managed to understand some pretty complicated things.
For example, some students choose to mark by bubbling in, others by check-marks, and yet others by crossing off — the AI has no problem with any of these. As another example, when there is a marked bubble with a cross through it and another marked bubble, it knows that second bubble is the one actually selected.
The AI is very good, but not perfect. Our task was then to verify the grouping of student answers. On a typical question, the AI made about 20 mistakes (out of 600+), so we manually moved these into their appropriate group. This is super-fast to do, as the student answers are ranked by group membership confidence.
You might ask, ‘Why not use scantron?’ for multiple choice. The full story is for another blog post, but quickly, a couple things: I really like for students to be able to answer on the exam pages themselves, and I like the flexibility of assigning partial credit.
Clustering similar answers to grade faster
For non-multiple choice questions, the AI was still in training (as of May 12, 2016), but we were still able to use the same interface the AI presents to save significant time. Here’s a screenshot of a free-form question:
Here, we manually group students’ answers into clusters that need the same feedback. Once the grouping is done, only one representative of each group needs to be graded.
For example, for the extra credit question in the above screenshot, we graded three groups instead of 600 individual submissions (everyone got a point, of course).
Before the AI / grouping engine, Gradescope cut my staff’s grading times in half. Now, for questions with manual grouping I estimate it cuts grading times by 70%, and where the AI is already well-trained, it cuts grading times by 90% or more.
I asked my fellow CS 188 instructor and teaching assistants for their quick reactions after our exam:
“It was great that it identified each student’s style of marking an option.”
- Professor Anca Dragan
“The interface for manual grouping, where we would see thumbnail images of student questions consolidated onto one page and determine which cluster each answer belonged to, was much faster than any method of grading I’ve used before.”
- TA Davis Foote
“I can’t even imagine grading the old-fashioned way.”
- TA Greg Kahn