hybrid hybrid hybrid hybrid hybrid hybrid hybrid

Dagstuhl trip report: People, programs, and programming errors

Amy J. Ko
Bits and Behavior

--

Over the years, I’ve been fortunate to have been invited to many Dagstuhl workshops. These workshops are wonderfully removed retreats in western Germany in a 19th century castle in which a few dozen computing researchers gather to discuss some topic and hopefully chart its future research trajectories. My first was when I was a senior doctoral student back in 2007, on end-user software engineering; I’ve since been to ones on practical software testing (2010), software development analytics (2014), human-centric development of software tools (2015), assessing learning in introductory CS (2016), programmer productivity (2017), designing and evaluating programming languages (2018), and learning and teaching programming languages (2019). And this past Sunday through Wednesday, I was fortunate to join another on human factors in programming error messages (2022!), organized by Brett A. Becker (University College Dublin), Paul Denny (University of Auckland), Janet Siegmund (TU Chemnitz), and Andreas Stefik (University of Nevada — Las Vegas).

The beautiful thing about Dagstuhl workshops is how immersive they are. There’s basically nothing to do at the castle aside from talk with the other attendees, eat and drink, and play games, and so they can be incredibly transformative to nascent communities. I have fond memories of early morning coffee with new friends, playing board games late at night, getting into philosophical arguments about the nature of programming, code, and correctness, and going for adventures into rural western Germany for food and sights.

Of course, we were still in a global pandemic and so the format wasn’t quite like the others. I opted not to travel, and so like the other ~15 attendees online, mostly lived in Zoom while we imaged the other ~15 attendees doing all of the fun things above together. And I also needed to manage time zones: the workshop stuck with the full Central European Time schedule, from 830 CET to evening sessions until around 2100 CET, which for me meant midnight Pacific Time to about 12:30 pm, with a half day of work after each workshop day. I had to sleep some time, so I skipped the 1:30 am-5:30 am sessions and ran on fumes for a few days. (Not something I want to do regularly!)

Because I didn’t attend the entire workshop, and I was pretty loopy from a lack of sleep, this won’t quite be a chronological trip report. Instead, I’m going to share several themes that emerged from my fuzzy consciousness as I listened to my wonderful new colleagues share their experiences, insights, theories, and ideas, and I struggled to share my own late at night and early in the morning. I hope my fellow attendees will recognize many of these ideas, as they are a messy synthesis of all of our thoughts. Apologies for not crediting everyone for their ideas—I didn’t do a good job tracking who said what and many of them are a messy assemblage of our collective conversation anyway. (For collective credit, see the attendee list). I hope they’re a useful record for the other attendees, and more importantly, everyone who was not invited to attend.

Why error messages?

The first and most obvious question the workshop had to address was why talk about error messages at all? Take, for example, this Python error:

while True print(‘Hello world’)
> SyntaxError: invalid syntax

How could the simple statement above possibly be more important that the language design itself, the myriad diversity, equity, and inclusion issues in CS, and the grand challenges of writing correct software at scale?

The case the group made was a subtle one, woven together through numerous observations and research evidence about their hidden importance:

  • Error messages can affect self-efficacy, leading learners to give up
  • Error messages can mislead, leading to fruitless debugging sessions
  • Error messages are predominantly English, excluding much of the world
  • Error messages are frequently incomprehensible, halting productivity
  • Error messages are judgmental, leading learners to blame themselves

These and many other observations lead these little messages to actually carry great power in many aspects of programming, productivity, diversity, and and inclusion. For example, Paul Denny talked about a study in which students were attempting dozens of solutions to error and never getting their program to compile, ultimately giving up entirely. Others demonstrated how errors were often so misleading that it led students to completely modify their program in search of something that would resolve the error. And others still showed studies in which even slight improvements to error message quality dramatically reduced debugging time and reduced errors overall.

Over the course of the workshop, some of us theorized that the reasons for their power have much to do with the outsized authority that they have in shaping learners’ problem solving strategies and activities, coupled with their poor and misleading qualities.

What are error messages?

It might seem obvious — they’re just text shown when there’s a parsing or type error, right? But the group surfaced several questions about their essential nature and whether “error” and “message” are the right words.

For example, the word “error” itself implies that a person has done something wrong or mistaken. That might be factually true with respect to the rules of the language, but there’s nothing morally, ethically, or existentially wrong about what a programmer has done by not complying to a language’s grammar. I began to realize how often “error” messages are just gaslighting (e.g., “You didn’t define this”, “But I did…” “No you didn’t…”). But far from a person violating some natural law, errors are just code that is not consistent with a grammar or is unambiguous. Talking about “errors” as “ambiguities” or “inconsistencies” might be more accurate and less judgmental.

The group also talked about how the word “message” is unhelpfully narrow. For example, it implies that error “messages” are simply some text that is transmitted to a user for reading. But the reality of error messages is that they are often situated in a richer interactive context, such as an IDE, which might be highlighting parts of a program, which might offer quick fixes, and which might also link to other definitions or documentation. These are more than messages, they are often complex user interfaces with a rich set of potential interactions for addressing the error. Talking about error “messages” as error “interfaces” might be more accurate. (“Inconsistency interfaces”, what a mouthful!)

This interactive aspect of errors was also visible in our analysis of some real student data around errors at the workshop. In many cases, the presentation of an error at a particular time often completely derailed students’ work and seemed to even lead to new (and inaccurate) interpretations of the language syntax and semantics. More than just text, students at least seem to view them more as certain guidance, using their contents and affordances to completely change their programming strategies, rather than reflecting on their meaning before acting.

Who’s reading?

One of the most fascinating threads throughout the workshop was our reflections on the audience of a message. Some of the attendees began with a fairly rigid view that all errors should be terse, factual feedback and that such feedback was universal, as it was grounded in the reality of a programming language’s rules.

But many attendees chipped away at this reductive idea. For example, one of the simplest observations was that not everyone speaks English. An English error message to a fluent English reader might mean one thing, but to someone who knows know English it might simply be something to copy and paste into a search engine, a simple cue that something is wrong, or worse, it might mean nothing.

Error message readers also vary in their knowledge. Someone who knows a language well might only need one or two words to identify the problem, whereas someone just learning the language might have a very brittle understanding of the language syntax and need more instruction in a message about the rules of a language. And the learning progression from seeing an error for the first time to seeing it for the 1,000th time may demand different kinds of instruction and context along the way.

People who use screen readers (e.g., most commonly people who are blind, low vision, or dyslexic) also experience messages differently. Unlike sighted reading, which can be “random access” around the content of a message, screen readers convert text into a serial audio stream. And so when the most critical content is buried into the middle of a 15 second audio clip, it can greatly slow down navigation, problem solving, and fatigue someone who is listening closely. So a helpful verbose message for a sighted person might be agonizingly slow for a blind person. Some code is also unambiguous as text (“3”, “three”) but ambiguous spoken), posing unique challenges for audio representations of text.

Sighted, English-fluent people with motor impairments might not have any issue reading a message, but they might have an issue accessing the message if it’s hidden behind a tooltip or requires opening some other window in an IDE. These interactive properties of error message can fatigue people with motor impairments, making an already hard task harder.

And of course, all of this is intersectional: all of the factors above, plus culture, age, literacy, and dozens of other factors, all interact in unique ways, leading any given error message to be more or less comprehensible or helpful to specific individuals depending on their identity. This means that it’s quite hard to generalize to universal requirements for error messages: it depends on context, culture, identity, and more.

What makes a good message?

Despite the diversity of audiences of error messages, the group had many opinions about what makes a good error message. Some hypothesized that terse messages were best; others valued instructive messages; we identified the inherent tensions between these two principles. Some talked about comprehensibility as a product of clear writing, careful choice of terminology. We discussed some of my work with Mike Lee on personification of compilers. We talked about the inapplicability of readability scores and their English-bias. I suggested many surface qualities of messages including typographic legibility, the number of accessible modalities in which a message is available, the parsability of a message’s typographic structure, and its proximity to the code to which it refers. We revealed these factors and more, but were far from agreement about what mattered. And the decades of work studying what matters had not provided much guidance.

Some hypothesized that the complexity of what makes an error good has much to do with context, and context is so richly diverse that it is hard to predict in any aggregate or universal way what might make a message “good”. This is perhaps tied to the diversity of the audience explicated above, and many programming language designers’ bias toward designing messages for people who are like them, rather than everyone else. Some called for a science of error messages to better organize all of these factors and begin theorizing about their relationships. Others seemed to be resistant to any response to diversity, insisting that purely factual messages in English were ideal and that any attempt to personalize would only confuse and increase complexity.

Must we have just one and must it be text?

The attendees included many programming language designers and so a major theme was lamenting the difficulty of supporting diverse needs and audiences. Localization is straightforward technically, but actually writing and maintaining messages in multiple languages is far from easy. Languages are often only built to support one canonical message of each kind, rather than multiple diverse levels of detail or distinct audiences. Even languages that have some of this built in, like Pyret, pose hard choices to language designers about how to ensure consistent terminology across language and API barriers and how to synchronize languages, messages, textbooks, and other instructional materials.

There was also some discussion about the tensions between the level of detail and context in a message and the performance costs. Even for small programs, some runtime errors, for example, could benefit from retaining a lot of context from execution, but such context might be prohibitively large to capture and store on slower devices with less memory.

Tying tools to all of the considerations above, we also discussed the ways that many error messages try to offer suggestions, but have to do so out of some inference about intent. Predicting intent, but also accounting for incorrect inferences, is something that no tools or languages have really succeeded at.

Some of the more radical ideation at the workshop noted some rich interactive error message features in current languages like Pyret, but also imaged even more radical alternatives to simple text error messages. For example, maybe there is value in just having errors be unique codes and then supporting a diversity of renderings and explanations of those codes based on the user (even comical ones, like memes!). Or perhaps there are ways of explaining error messages without words at all, through animations, physical metaphors, or rich displays of possible modifications to address an error.

Why aren’t messages better by now?

Many at the workshop lamented that programming langauge error messages are still terrible. And for some languages, they are. But others noted significant improvements in quality in languages like Rust, Elm, and Python, due to community efforts. Some of these are due to market forces, but others are just due to different communities having different values. Many pointed to the perils and paradoxes of capitalism and open source: if the language is a revenue source, how do better error messages increase profit? And if it’s free, why improve anything?

Some of us remained optimistic that advocacy could make change; others were more cynical and believed profit and power would win, as it usually does. Others still wondered if researchers and their research had any role to play in any of this, or if our time might be better spent just designing new languages with better messages (whatever better means).

What’s next?

In our closing sessions, we met in small groups and brainstormed about what our community might do to advance the future of research and teaching on error messages. Some key themes emerged:

  • There is a massive, unexplored design space of error message presentations and interactions. Animations, personalized scaffolding, and more are largely unexamined and unimplemented. Part of this has to do with the difficulty of changing existing languages and creating new ones.
  • There are many unexplored forms of instruction about how to respond to error messages. Some notable ideas that emerged are peer instruction, error resolution strategy modeling. But there are many open questions about what strategies are good and whether there are any universally effective strategies.
  • Many noted the lack of theoretical explanations of the mechanism behind student error behavior and the rabbit holes they often lead to, and of the diversity of audiences reading messages. In particular, the group viewed error messages as a symptom of “upstream” confusion about language semantics and other cognitive problems. The community would be smart to invest in this analytical work, creating a stronger foundation to support and inform future studies and designs.
  • While we briefly talked about error prevention at the workshop (e.g., through block-based editors that prevent syntax errors), there is much more to do to examine the tradeoffs of static typing and type inference, which can prevent defects, but only by introducing yet more error messages.
  • There is much work to do to understand the longitudinal, expert experiences with errors; most research has been done with novices, but there may be important constraints and requirements for people who encounter errors hundreds or thousands of times.

Personally, I think that the most important, impactful, and profound next steps would be

^
Unexpected end of file

Zzzzzzz

--

--

Amy J. Ko
Bits and Behavior

Professor, University of Washington iSchool (she/her). Code, learning, design, justice. Trans, queer, parent, and lover of learning.