Exam scores and computer science classrooms

With so much momentum behind computer science in K-12, how do we know what’s working and what’s not? How does one measure or reward teachers, schools, and districts that expand access to computer science, delivering quality education while broadening participation, especially among women and underrepresented minorities? Code.org faced this question recently when we published AP exam pass-rates from our Computer Science Principles classrooms.

The topic of exam results is very sensitive in education. We regularly hear from teachers who feel pressured by a culture of test-taking, or about political fights about the proper role of exam results in teacher accountability. Because Code.org asks our CS Principles teachers to encourage every student to take the AP Computer Science exam, we are very sensitive about how their exam results are used, to make sure they’re not used draw inappropriate conclusions. We especially don’t want our nationally-reported scores to be used in a way that creates undue pressure on our teachers or their classrooms.

This is not a theoretical danger. With Code.org, the College Board, and others pushing to spread AP CS Principles to schools nationwide, there is a lot of attention on this new exam, and it’s important for the CS community to think carefully about how the exam results are used.

It’s certainly ok to celebrate exam results (ideally with standard academic caveats), to applaud the work of classrooms and students that have worked hard to pass a college-level exam.

But we enter murky waters when informal suggestions are made that these national reports can be used in other ways, such as to help schools choose between curriculum options or professional development for teachers. It sounds reasonable to suggest that these national averages speak to the quality of the curriculum that was used, or to the professional development offered to teachers. Curriculum writers constantly face the question “How do I know what you do is good?” and it is tempting to answer that question by looking at national reports. Maybe exam pass-rates aren’t the only factor to look at in isolation, but surely these national reports can play a part in the equation — after all, professional development and course providers need some measure of how well their resources prepare students for the exam, right? If all student groups score dismally on an exam, couldn’t one rightfully question the curriculum?

These suggestions may sound reasonable, but because the nationally-reported pass-rates are muddied by selection bias and other factors, Code.org believes that the exact opposite is true — that these ideas are not academically sound, and that it’s dangerously easy to draw incorrect conclusions from these national reports, with unintended consequences. This issue may appear subtle, but it’s important.

Student selection and other factors impact national test results

It’s easiest to explain our position by imagining two classrooms, in completely different schools. Teacher A had 90% of her students pass the exam. Teacher B had 70% of his students pass the exam. It’s easy to say that Teacher A is a better teacher, or that these test results could play a partial role in measuring her work, but educators know that this is a dangerously wrong conclusion to draw without first looking at differences between the schools, the students, their prior learning, and a whole host of other factors. It’s equally incorrect to suggest that the classroom scores speak to the quality of the curriculum that the teachers used. And these conclusions would be especially wrong if the exam itself was optional for the students.

Now let’s look at a real-world example, comparing three different groups of classrooms: Group A, Group B, and Group C, representing thousands of students in each. The teachers in these classrooms were prepared using different curriculum and professional learning programs. If Group A scored higher than Group B, and Group B scored higher than Group C, can we draw conclusions about the quality of the curriculum or professional learning program? The chart below suggests that we can’t draw any such conclusion. (we’ve removed the names of these programs, but this is real-world data)

One could also look at the chart above and conclude that race determines exam results, but that too would be academically unsound. Beyond factors such as gender or race (which are reported and can be disaggregated at a national level), the reported exam results are deeply impacted by factors that are unmeasured and unreported:

(1) What is the socioeconomic background of the schools or the students?
(2) How did students choose whether or not to take the computer science class?
(3) How did students choose whether or not to take the AP exam?
(4) Which teachers received professional development to prepare them to teach the course, and which ones continued on to teach the course?
(5) Did the teachers teach the course with fidelity to the curriculum? Or did they mix and match with resources?
(6) Are their geographical variances that could impact the scores?

These unknown factors create tremendous noise in the national exam data, and nobody has sufficient data to separate the signal from this noise. Even if all student groups score dismally on the exam, one cannot draw conclusions without understanding all these factors.

These nationally reported AP exam pass-rates, whether published by Code.org or by other providers, cannot help to measure the quality of curriculum and professional learning unless all the confounding factors listed above are controlled for. It is tempting to suggest that these exam pass-rates could play a role, as a partial input into a broader equation, but mixing noisy data with other factors only makes things worse and creates a false sense of academic rigor. To ensure that nobody applies such logic when looking at the Code.org reports, we’ve strengthened the caveat language in our original blog post.

The danger of unintended consequences

Computer science is not a mandatory subject — in most states it is optional for schools to offer, it is optional for students to take the course, and it is optional to take the exam.

If the quality of an optional computer science program is measured — even partially — using exam scores, this could result in undue pressure on curriculum providers, schools, and teachers, to deliver higher results. Teacher-prep could turn into exam-prep. Schools could resort to selecting the best students to try computer science, or encouraging teachers to teach to a test instead of toward creativity, expression, and love of computer science.

This danger scales at all levels. Teachers need incentives that reward them for encouraging every student, especially from traditionally underrepresented groups, to participate. Schools and districts should be encouraging all students to try computer science. And curriculum or professional learning programs such as Code.org should be encouraged to measure ourselves by the diversity and population of the students in our programs.

Because this is an optional course and an optional exam, if we measure ourselves — even partially — based on nationally reported test results, it would set up the wrong incentives for our team, our partners, and our schools.

This isn’t a small issue, it is a deeply fundamental issue. At Code.org, we work extremely hard to increase diversity of participation in computer science, and it’s important that our teachers feel supported in their efforts to support this mission. Code.org stands by our teachers, and we will resist academically incorrect suggestions to measure our work in ways that could cause undue pressure on their classrooms.

But exams are often used this way in math and English

In subjects that are required for every student, with exams that are administered to every student, it is easier to draw academically sound conclusions, without these unintended consequences — because there is less selection bias confounding the data. Students have no choice but to take the class and the exam. To use exam scores in an optional subject like CS requires a different research approach.

Proper uses of exam data in research

Of course, it is possible to use computer science exam results in a proper research setting. Just as teachers use exam results to see how students are performing, curriculum creators can use exam scores to improve their curriculum or to measure the impact of professional development. But to do such research requires proper controls, factoring school-level or student-level data which is not available in the national averages, in an experiment designed by professionals who have both the experience and the sound judgement to make academic claims based on exam results.

In the words of one such researcher (Michael Marder, of University of Texas): “The most meaningful way to make use of test scores is by carefully controlling for student characteristics prior to the beginning of the course. These incoming characteristics can include scores on mathematics and reading exams in earlier grades, free/reduced lunch status, and other measures. There are standard techniques for including such factors in a model of student success and trying to separate out the effects of particular curricula. Sometimes signal can be extracted from the noise. Studies are even more convincing if students are randomized between curricula but this is not easy to do.”

Without a perfect control group, it’s also possible to create “quasi-experimental” control groups by comparing results among similarly-situated students at similarly-situated schools. In fact, many organizations (including Code.org) are engaging in exactly this sort of research. These quasi-experimental studies can help, but not as a basis for strong statements of causality.

In the meantime, we caution teachers, schools, and districts not to factor in nationally reported exam pass-rate averages as a measure of the quality of curriculum. These results cannot help in curriculum choice, not even as part of a broader equation, because they do not speak to the quality of curriculum or professional development. Suggesting otherwise is just as dangerous as saying that Teacher A is a better teacher than Teacher B because her students scored higher on the test.

Then how should schools choose computer science curriculum?

Code.org is one of many options that schools can look at when making curriculum choices in computer science. There are many other fantastic options, and the people working on these resources are often long-time colleagues and friends. We are all part of a community that is larger than ourselves, and our movement draws strength from collaboration, not from division.

Just because Code.org is widely used does not mean we are the best option for every school. Some courses are designed with a greater focus on block-based coding, for students with less prior coding experience. Some are part of a larger pathway that integrates into other courses at the school. Geographic considerations may come into play when choosing different professional learning programs for teachers. Some computer science resources may be specifically designed for students with disabilities. Some course providers charge a fee, in return for additional services. And over time, proper research studies will use multiple factors, including exam scores, to aid in choosing between alternatives.

Classroom teachers can and should play a big role in deciding what curriculum options to look at, and the best place to start is by asking teachers in similar circumstances what has worked in their classrooms. National reports of average exam pass-rates should play no role, whatsoever.

Hadi Partovi, Code.org