Building Better Writers With Machine Learning

Students in school today will be graduating into a world in which written communication is more important than ever before. That’s not only true for knowledge workers, but in skilled-industries like manufacturing as well. For more and more employees, email communication and general competence in writing are critical to success in the job.

State and national education standards and curriculum are increasingly requiring significant writing components. As a result, many school districts are increasing their investment to accelerate writing .

The trouble is, there is no fast and easy way to assess student writing. You can’t improve what you can’t measure, and writing is far more difficult to measure than reading or math — multiple choice quizzes just can’t capture it.

For teachers this means mounting pressure to invest additional time into evaluating student writing, providing feedback and addressing weaknesses in the classroom.

Making Time For Teachers

If there’s one thing that teachers can never get enough of, it’s time.

When you increase the demands on student achievement in writing, you increase the demands on teachers’ time as well.

Already, teachers spend hours grading assignments, and as a result many are reluctant to give students as much guided practice on writing as they probably need. As one teacher told us, “grading is the most painful, horrible thing. Kids are not writing enough.”

Teachers spend hours grading papers.

When students do receive feedback, it’s most commonly in the form of a single letter grade, which often leaves students feeling deflated and without clear direction on how to improve. When more feedback is given, it is most often focused on basic grammar errors and spelling conventions, rather than advice on deeper organization and higher order thinking.

Compounding the problem, grading takes time and thus feedback is often delayed — students may not hear anything for days or weeks after an assignment.

This is simply not what it takes to accelerate writing. Students would benefit from faster turnaround and more frequent, targeted guided practice.

One of the most important recommendations from the U.S. Department of Education is that teachers frequently monitor student progress so they can “provide timely feedback to students”. If we can find a way to automate written feedback on open-ended writing, then students can accelerate their learning to become better writers.

This is the challenge we are tackling at eSpark through our product Frontier.

How to Accelerate Student Writing

Imagine a world where students receive immediate feedback that gives them confident direction on how to improve — yet teachers spend less time grading than they do today. Imagine if, instead of cranking through dozens of papers to discern how their class is doing, teachers receive a daily report on the strengths and weaknesses of their students. Teachers could use the extra time saved to address individual needs of their students one on one.

Automated writing feedback would make this possible. And it’s coming.

There is ample demand for quick, automated assessment tools in the classroom. Among the most popular edtech products are tools like Quizlet and Kahoot, which give students opportunitites for gameplay and quick feedback — provided the content can be assessed using multiple choice questions. Those products don’t provide a lot of help for more open-ended content like free writing.

There has been some progress on automated scoring for free response text. Graduate level tests (like the GRE) have had some form of automated scoring for years, but only recently have we seen products attempting to automate feedback in the lower grades. Amazon TenMarks, for example, released automated writing feedback last year — but sadly will not be around to see it to fruition. While some other companies, such as Turn It In and Gradescope, are also making progress in this area, the majority of classroom time teaching still requires teachers to laboriously score all the responses.

TenMarks offers immediate writing feedback — but is shutting down next year.

We are building a tool that uses software to let teachers assess and give feedback to their students much more quickly than is currently allowed.

Our Machine Learning Experiment

Is it possible to save teachers time and improve student writing at the same time? We’ve shown that it is possible thanks to our team of experienced human graders. But this is only the first step of a two phase strategy that leverages the rapid growth of the machine learning ecosystem.

In the past year we’ve been working to apply the tools of machine learning and natural language processing to automate open-ended writing assessment.

Our system provides feedback in two ways: directly to students via the product, as a summary report to teachers. Human graders use an in-house tool to score student essays, and that feedback is rapidly routed to teachers via email reports. Teachers that use our product, Frontier, are therefore able to act quickly to praise good writing and intervene in areas of confusion.

Our strategy follows two stages:

  1. first, build a system that provides high value to teachers but relies on a “human in the loop” to score essays efficiently.
  2. Then, we use the data that is generated to train a model to automate increasingly difficult parts of the writing feedback process, while still providing benefits for student learning in real-time.

With this approach, we plan to provide early value to our users, while learning and iterating on a machine-driven approach over time as our dataset and domain knowledge grows.

Inputs and Outputs

Any machine learning model is only as good as the data that feeds it. Working backwards from the user needs, we needed to determine the inputs (features) and outputs (labels) that would closely represent what a real teacher would need in their classroom.

Much of the research in this space has focused on scores for summative assessments rather than direct use in the classroom for teachers and students. For example, a Kaggle competition dataset assigned a single score to each of 12,000 student essays. Automated systems from AES and others tend to provide only a high level score as well.

This single-dimension score may work for a standardized test, but falls apart for our use case. Teachers and students need data that can direct instruction — and without knowing the specific area of the essay that was good or bad, there’s not much they can do with a single score. If a student receives an 8 out of 10, what do they do with that?

We need deeper understanding — for a given prediction, we want to know why the algorithm thinks the essay is good or bad and what the student can do to fix it.

Instead of a single score on one dimension, we created a series of binary classification questions about specific aspects of student writing. By splitting the problem up into many smaller pieces, we gain explainability — and on the technical side, we also get the ability to ship incrementally and to use different techniques on different parts of the problem.

Our rubric evaluates four categories of essay writing, each of which is broken down into multiple subcategories to make scoring more explainable:

  • Spelling and grammar errors (this is the easiest piece to evaluate; its based on the same rules as products like Grammarly or your standard word processor)
  • Purpose (here we try to determine whether a student has addressed the prompt or gone off topic)
  • Organization (we look for transitions between ideas, introductions and conclusions and a clear beginning, middle and end)
  • Evidence (we check to see whether students have included insight from the text in their essay — rather than just making things up)

Some aspects of the grading rubric have straightforward automated solutions (we have had automated spelling and grammar checkers for years), whereas other aspects of great writing are hard, if not impossible, for computers to assess, such as whether an essay is organized in a way that flows well.

Putting People in the Loop

Human judgment is the linchpin of great writing assessment, but the more feedback that we can process automatically, the more teachers can focus on tackling higher level issues rather than mundane errors.

We send an email to teachers with a report of their scores. To generate that report, we have hired professional graders to score writing samples against our defined rubric. By itself, this has tremendous value, allowing teachers to get snapshots of student progress the same day as the assignment, without many hours of grading.

But for our long term strategy, the human grading has another role to play: building the dataset. Over time, as human graders work through student essays, we can feed that data into machine learning models.

We started on this project basically from scratch, as nobody on the team has a strong background in machine learning. One key challenge we have tackled has been defining the tools and processes to ensure a reliable dataset. How much data do we need? And how do we make sure it is reliable?

Our review of published research suggests that we should achieve good results with more than a few hundred labeled data points for each prompt, which could mean anywhere from 5,000 to 50,000 labeled data points. But we don’t really know how much we will need until we collect a decent amount and see how it performs.

First we need an initial dataset. After an initial prototype using Google Sheets, we tried out Prodigy, a python package designed for rapid annotation of machine learning datasets. Prodigy asks annotators only one question at a time, offers a simple interface for quick scoring, and incorporates active learning so that annotators score the most ambiguous data points to train the model more quickly.

This is an example of our first scoring interface for student writing. We asked graders one question at a time. We later built our own tool that provided more context to graders.

However, while Prodigy is great for simple decisions that can be made without much context — like whether an image is a cat or dog — it doesn’t work as well when the questions are more complex. Our dataset requires more context, such as the prompt that the student received, or the raw text from sources the student should have read. These are hard to provide in a constrained interface such as that offered by Prodigy.

We built an in-house annotation tool that gives graders all the context they need to fully evaluate an anonymized student essay; their scores are stored alongside our application data so that we can easily incorporate them into our automated emails and even the student experience.

Our in-house tool, called Predictor, gives graders full context on each student submission, and allows graders to submit notes to help with quality control.

So far we’ve used this tool to collect scores for about 3,000 essays — we’re not at our goal of 5,000+, but it’s enough to get started. We also feel confident that our data will be strong thanks to our decision to use very reliable graders with domain specific knowledge of student writing, including professional graders and our internal learning design staff.

Now it’s time to grow the dataset to meet the needs of our algorithms. To gather and label as many data points as we aim to is a huge undertaking, but we’re taking it on for good reason.

First and foremost, we are uniquely positioned to gather this data while fulfilling an immediate need within the education system. Our human-powered tool currently provides teachers with such valuable insight while saving so much time, we’ve been told it feels like magic.

This tells us that there is a real need to provide this type of support for writing assessment at a far greater scale — the kind of scale that makes machine learning truly worth the investment.

With humans in the loop, real-time feedback is out of reach, and there are significant costs associated with scaling. Over time as we figure out how to automate more and more parts of the assessments with machine learning, we’ll be able to make the whole process faster, more efficient and potentially even more accurate.

A second motivator is that we have the opportunity to apply machine learning techniques to a meaningful real world problem. Our team is fortunate to have the ability to collect our own data, build a machine learning protocol and actually put it all into action.

If you’re familiar with machine learning, you’ll likely know that this is an uncommon opportunity in the the field, which is still predominantly studied in academic settings, often using the same existing datasets.

This is only the beginning. We have much more work to do to fully automate feedback for students, and we continue to research emerging libraries and techniques for text classification, including the latest use of word vectors, deep learning, and new libraries and models as they are released.

We’re also looking to hire or collaborate with people who have expertise in the field. If this sounds like an interesting challenge to you, drop us a line.