How to adapt static models to measure continuous learning in ed tech
When students use eSpark, they expect to receive the highest quality materials, personalized to their learning needs. But what does highest quality mean? At eSpark, we start with expert teacher insight, and then improve and validate with hard data to define the best learning resources for each skill. In this post, we discuss Hobbes — a major component of our learning system that allows us to link educational outcomes with the resources that we curate, which enables us to iteratively build a world-class curriculum.
eSpark curriculum covers pre-K to eighth grade in both reading and math. For the sake of simplicity, in this post we’ll look closely at our work on literacy in early grades. For example, for students to gain the prerequisites necessary for reading in first grade, they need to learn a variety of pre-reading skills such as distinguishing long from short vowel sounds in spoken single-syllable words. Although as adults we may take this skill for granted, it can be quite complex for young students as they have to master the ability to recognize specific vowel sounds, as well as the ability to distinguish the characteristics of these sounds.
Our expert learning designers scour the iPad App Store in search of the best apps for students who need additional practice in each skill, and then they use data to help them quantify the impact these activities have had on student learning. The App Store does not have metrics that signify educational quality, and the metrics that do exist — app reviews, for example — have little to do with it. Instead, we look to real data about how students perform when using an app to better understand how it works in classrooms.
We wanted to develop a metric of student learning outcomes that was consistent, precise, and available in real time to guide curriculum development. We initially relied upon standardized tests, such as Northwest Evaluation Association (NWEA) MAP tests, to inform our understanding of curriculum. However, while this data is very useful in shaping curriculum, it has two major limitations. First, standardized tests vary across districts and thus can introduce bias if we compare them nationally. Second, standardized tests typically only report data once or twice a year, which inhibits continuous improvement through data.
We evaluated numerous psychometric item response models developed over the last few decades to measure latent variables such as learning. Unfortunately, traditional item response models were designed to take a snapshot of a student’s ability. Our focus is on the student’s progress over the entire school year, so we are not just interested in a snapshot of a student, but in their continuous change through learning. So, in the spirit of innovation, we have iterated on these latent variable models to develop our own approach to measuring learning. We named this measurement Hobbes, after a beloved tiger known for his rational approach to solving problems.
Hobbes is derived from the Rasch model, an item response model used since the 1960s to measure a test taker’s ability and the probability that they would answer a question correctly. NWEA, a nonprofit testing provider used by districts across the country, adopted this model to generate RIT scores for each student on their MAP Assessment. The model is more robust than calculating the average percentage correct because it accounts for both varying student ability and differing question difficulty. It uses a logistic curve (see below) to translate those two parameters into the probability that a student would answer a question correctly (a value between 0 and 1).
How does the Rasch Model work?
This curve is described by the function below. Given a certain question difficulty i and a certain student ability n, the function estimates the expected probability that the student would answer the question correctly.
Notice that as the student ability decreases relative to question difficulty, the probability function approaches 0. Conversely, as student ability increases relative to question difficulty, the probability function approaches 1. Both the question and student ability parameters are normalized around the mean, so a value of beta = theta = 0 would translate to a 50% chance of answering an answer correctly.
If we sample 3 answers from students answering questions on the “distinguish vowel sounds” learning standards, we might see that struggling student A encountering an easy question (theta= -1.2) would still have a 50% chance of answering it correctly, but if they encounter a harder question (theta=1.5), they would have a much lower probability of answering it correctly. On the other hand, a student who has already mastered this skill (beta=1.7) has a pretty high chance of answering that same question correctly.
When we feed student quiz data into the model, it can estimate the ability parameters beta and difficulty parameter theta by minimizing the difference between actual and predicted values. These parameters provide a snapshot of how difficult the questions were and the relative ability of the students at the time the data was collected.
So how did we adapt the Rasch model to measure student growth?
The typical student working on eSpark takes a pre- and post-quiz for each standard. If a student needs practice in reading foundational skills, we assign them to practice activities in these learning standards. Before each set of activities, the student take a pre-quiz with 5 questions drawn randomly from K unique questions. After their practice activities, they are given a post-quiz with a new set of 5 questions.
These pre- and post-quiz questions are all developed to measure their ability to master the relevant standard. If all the questions in our quiz bank had an identical difficulty level, we would be able to take the difference in average correct answers to measure the student growth. However, that is not the case, nor do we want that to be the case — having varying difficulty levels allow us to measure a range of ability levels.
To control for question difficulty, Hobbes finds the difference between a student’s post-quiz score and the Hobbes prediction of their post-quiz score. The Hobbes-predicted post-quiz score is calculated given a student’s pre-quiz ability and the post-quiz question difficulty.
By using this new metric, we create a measure that is comparable across all ability levels and across all difficulty levels.
We are also aware that different questions may target different skills that contribute to the same learning standards. For example, the distinguishing vowel sounds learning standard can only be mastered if students can identify both short and long vowel sounds. By generating a Hobbes growth by question, we are able to aggregate across these different skills. When we reviewed the Hobbes results for this same reading standard, we noticed that students are much more adept at learning to identify long vowel sounds than they are at learning to identify short vowel sounds.
This type of information helps us decide which apps and activities we should focus on providing students and how to frame their learning experience. Being able to accurately detect the magnitude of learning not only allows us to continuously iterate and improve on our curriculum, but it also allow us to give students pointed feedback on what skills they have mastered and what skills they need more practice on. Finally, it opens up the potential for us to more quickly adapt learning content to their needs.