Designing and evaluating programming languages: Dagstuhl trip report

Amy J. Ko
Bits and Behavior
Published in
14 min readFeb 9, 2018


The attendees of the Dagstuhl workshop “Evidence About Programmers for Programming Language Design”

Researchers like to meet and geek out about ideas. It’s fun, its social, but importantly, it’s essential to progress, because research is inherently a community endeavor. Conferences are the main way we do this, and they’re great for sharing ideas that are either well-understood or under development.

Occasionally, however, we need to meet to talk about ideas we just barely understand, to help scope out how to investigate them, and develop community, collaborators, and partnerships to study them.

In computer science, Schloss Dagtsuhl is one of the best places to do this. It started in 1990 as a dedicated place for scholarly retreats, bringing together disparate fields in computing to bootstrap new discovery. I was invited to my first Dagstuhl back in 2007 as a senior graduate student, and have come to multiple retreats since, each time investigating a new topic, and leaving with new ideas, new collaborators, and new research directions for my lab.

This year, I was invited to a 5-day retreat titled Evidence About Programmers for Programming Language Design. The goal of the retreat was to tackle a simple problem: programming language designers design a lot of programming languages (PL), but most of them are designed by intuition, which leads to a lot of hard-to-learn, error-prone, and sometimes incoherent designs. These design flaws matter for many reasons:

  1. PL that are hard to learn pose barriers to engaging youth in computing
  2. PL that are error-prone cause software defects
  3. PL that are used to analyze data in science lead to errors in scientific discovery.
  4. All of the issues above can have immense economic changes, because retraining teachers and developers in the world is expensive.

These challenges aren’t limited to language designs from particular language designers. Industry languages like JavaScript are full of unfortunate choices, yes. However, many of the languages designed in more principled ways by academia, while more safe to use, are nearly unlearnable. And there are entire genres of languages no one has yet to create that might better serve diverse populations of people intimidated by or uninterested in current designs. While we know how to implement languages, designing them is still a art.

The retreat included researchers from programming language, software engineering, human-computer interaction, and computing education, as well as people from for and not-for profits concerned with language design, including Google, Microsoft, and While Dagstuhl retreats are usually diverse, bringing together two or more areas of computer science, this retreat was particularly diverse, bringing three areas of CS, but also industry, statistics, and psychology.

The feeling of being at a Dagstuhl is one of vibrant intellectual inquiry. We met 9–5 every day, giving short talks, asking questions, discussing countless issues at coffee breaks, arguing about foundational questions and terminology at lunch and dinner. Debates raged on in the wine cellar, the coffee lounge, and the game room until well past midnight each night. Even when we played games together, our jokes and banter still probed the boundaries of our knowledge about PL design.

Playing Codenames after dinner while making PL jokes and drinking.

Everyone came with their own goals and with their own insights. Some came to learn. Some came to advocate for evidence standards in language evaluation. I personally came to understand the common ground across these fields and problems, and build a deeper understanding of the foundational issues on PL design and it’s impact on learning. Here’s what I discovered:

We know little about PL design as a process

Jonathan Aldrich and Michael Coblenz talk about type theory and PL design.

One of the most salient ideas I learned was that while we have many language designers, the processes they use are still driven largely by intuition and formal reasoning. The formal reasoning from mathematics is utterly precise, but the intuition is profoundly imprecise. Language designers often make vague arguments about the “simplicity” or “intuitiveness” of language features, without ever saying what they mean by these terms, or testing whether these claims are true.

These claims often cover a vast space of qualities. PL can be efficient not only in computation time, but also in developer time. PL can be learnable. They can be teachable. They can vary in simplicity, expressiveness, and comprehensibility. And yet, we have few precise definitions of any of these qualities, let alone methods for measuring them, or even reasoning about them. We also lack a precise understanding of how these qualities tradeoff with one another; for example, an expressive language might be more complex, which might decrease learnability. Or it might not. We don’t know when or why these tradeoffs would occur.

PL designers have the challenging task of thinking about all of these issues as they sketch, prototype, and iterate on a language design. Throughout, they engage in arguments about the merits of a language in all of these dimensions. For example, I might argue that “if” is a good keyword for conditionals, because in English, the word has no reasonable synonyms and has natural language semantics that are well-aligned with a conditional’s control flow semantics. Many of them view the quality of the process and resulting language as an outcome of the quality of the arguments behind it.

And yet, the lack of clarity about these qualities makes having these arguments hard. The ambiguity prevents PL designers from interrogating their efficacy in precise or systematic ways. Sometimes, it might be necessary to gather data to resolve a debate. This requires precise measurements of the qualities above, which we often do not have.

The methods we might use to evaluate language features are many, but the group had a large bias towards positivist methods like randomized controlled trials. Some of the HCI researchers in the room advocated for a broader set of both high and low ceremony methods, each supporting the evaluation of different qualities. Many, including our organizer Andreas Stefik, called for more explicit evidence standards to ensure that the quality of our methods, and therefore the quality of our conclusions, would be sufficiently high to support evidence-based language design choices.

That said, there was broad disagreement about what role evaluations might play in guiding a PL design process. Some of the PL designers in the room saw great value in relying on intuition and were unsure what studies would tell them. Some of the strong positivists in the room believed that the only valid way of knowing the true efficacy of a choice was to run an experiment to test them, and that a series of studies would allow one to hill climb from less to more efficacious designs.

I argued that a meaningful middle ground would be to create a bridge between intuition and empiricism through theory. By theory, I don’t mean grand theories with fancy names that try to explain the universe, but small theories that are try to explain the strengths and weaknesses of the language features we might argue about.

Take, for example, the idea of variable names found in most modern languages. Languages didn’t always have user-defined names; we used to write assembly language with a fixed set of registers with predefined names or no names at all. Why are names useful? One theory is that they are conceptual shorthand for the semantics of a program’s behavior, accelerating our ability to read and reason about it’s behavior. For example, the name kitten_count clearly refers to some quantity of kittens, and implies that we are counting them. In contrast, the name x could refer to anything. Having names that evoke concepts help us reason about the role of the variable in the larger purpose of an algorithm, accelerating our inference of that purpose. On the other hand, bad names might serve to build a wrong model of the role of a variable and behavior of an algorithm, slowing program comprehension, debugging, or modification of a program.

This theory of variable names can serve many roles:

  • PL designers can use these theories to analyze the benefits of their choices to make predictions about alternative design decisions.
  • Empiricists who want to rigorously test language features can use theories like this to derive predictions, test hypotheses, and refine theories of language features.
  • Teachers can use these theories to explain to learners why a language feature was designed the way it was, and why it might be useful.

Of course, even with well-tested theories of language features, it’s still up to language designers to design. Theories and studies testing those theories will only inform a choice, they won’t make it for the designer. This is true for any decision-making settings: doctors can use theories and evidence to help make decisions, but they still have to make a decision on limited information.

We know little about the relationship between programming languages and culture

Professor Felienne Hermans of TU Delft talked about programming languages and culture.

Another interesting idea that emerged was that as designed artifacts, PL designs both emerge from culture, but also create culture. For example, being a Python developer means engaging with a “Pythonic” culture, which Python values power, simplicity, learnability, and openness. In contrast, Google’s Go culture values reliability and efficiency. These difference in values lead to different experiences, and PL designers are just as much espousing values as they are defining language semantics.

Researchers haven’t thought much about what these cultures are, how they effect the experience of using a language, and how they influence who engages with a language and who doesn’t. For example, despite decades of work on teaching Java in schools, there haven’t been any studies reflecting on what the culture of Java is, and what it means to expose students to it. Similarly, when a company adopts PHP, what values is it brining in to its organization? And what happens when people are “polyglot” programmers, or engage in “polyglot” programming? How do developers reconcile the value tensions between these cultures? If languages create cultures, and languages fall out of use, that means their cultures die. What happens to those cultures? Is the code that remains written in that language like archaeological artifacts, requiring code archaeologists to reverse engineer the original culture of the language to comprehend it? Another interesting set of questions why people love and hate languages. Is this also about language culture? Or is it about identity, brand, or community?

The role of sociocultural factors in shaping experience of a language seems undeniable, and virtually no one has really tried to understand this factor. This understanding could have many practical impacts on language design process. For example, it might compel a designer to be more explicit about the values in their culture, and identify other artifacts, such as tools, IDEs, documentation, and other media that need to be consistent with that culture.

We have no theory of language learnability

There were five Andrews at the meeting! At this dinner, Andy Stefik and I had a great arguments about the value of theory in language design.

Most of the attendees were specifically concerned with PL learnability, even the language designers in the group. After all, when one designs a language, one wants it used, and that requires learning. Unfortunately, like most PL qualities, we have no clear definition of language learnability, or theory of what mediates how people learn languages.

One idea that emerged is that learnability appears to concern the complexity of abstractions embedded in a language. For example, consider indefinite loops, which require someone to reason about potentially infinite executions of a block of code. What about this possibility of the infinite is hard and why is it hard? We need methods to discover these hard concepts in languages, so that we can precisely define what is irreducibly difficult about them.

Abstractions also interact with people’s prior knowledge, making the mental modeling of that complexity more or less difficult. For example, when one sees the symbol = in mathematics, it means one thing, but when one sees it in JavaScript, it means another. Are these conflicts purely syntactic, or are they also semantic?

There appear to be many ways to mitigate abstraction complexity. One way is to reduce the complexity of an abstraction. We don’t really know what complexity means, but intuitively, we know that simply eliminating a feature from a language removes a set of difficulties, and adding them adds difficulties. One can also reduce the formality of language abstractions, which may reduce some kinds of complexity, but increase others. It’s a fascinating open question.

Another way to mitigate difficulty is at the tool level, reducing complexity by creating tooling that either hides complexity, or resolves a difficulty through better information or feedback. For example, more detailed error messages can help clarify a language’s semantics.

Interestingly, the computing education researchers in the room pointed out that complexity can also be reduced pedagogically, by providing learners with alternative representations of abstractions that mask or reduce complexity. For example, when explaining = in JavaScript, one can say “We’re not in Math Land anymore!”, which makes salient the important difference.

Many of these ideas reminded me Papert’s ideas around different representations of the same idea. Learners may need a series of representations, progressing the learner from a shallow to deep, informal to formal understanding of a language abstraction. Inventing, testing, sequencing, and personalizing these representations may be a fruitful pursuit for computing education researchers.

We have no theory of language teachability

Baker discussing how has to create micro-languages to avoid complexities and inconsistencies in more ubiquitous, authentic PL designs.

An even more surprising idea I encountered was that language learnability isn’t the same thing as language teachability. Baker Franke from shared this idea, arguing that teachability concerns the the ability to decompose a language into curriculum and pedagogy, and that PL design choices constrain this decomposition.

One idea that Baker used to explore this was the idea of a consistent narrative. This is the idea that to explain to a learner what a language is for, and why it is what it is, there must be a story that stitches together pedagogy about its parts into a larger whole. Languages vary in their ability to support a consistent narrative; more “opinionated” languages appear to have more consistent narratives. For example, Python often aims to have one way to do everything, and so the narrative around each language feature can consistently say, “this is the only way to do this, just like with everything else in Python, so don’t go searching for more.” In contrast, the rationale for a lot of JavaScript language features is historical: for loops work the way they do because of C, but we don’t know why C works that way, and we have no idea why JavaScript objects have a prototype-instance model. Baker argued that consistent narratives help learners build larger models about languages, which help them make predictions about how to use them, help them generalize knowledge, and help them retrieve resources about the language. Narratives also imply a learning progression, moving a learner through the design rationale for the language and it’s semantics.

Baker argued that when languages don’t have a consistent narrative, that tools and pedagogy have to mask inconsistencies in order to simplify curriculum and facilitate effective learning. Because of this, IDEs are often an essential part of conveying and reinforcing a curriculum designer’s narrative about a language. The tools highlight, surface, and convey the narrative.

We took a brief field trip to Trier for walking and dinner, but the conversations continued!

We have no theory of language error-proneness

Many of the PL researchers in the group were less interested in learnability and teachability than the HCI and computing education researchers, and so many of my conversations concerned qualities that experienced users of PL encounter, such as error-proneness.

I participated in one group that set out to theorize about error-proneness. We came up with an explanation of what makes a language feature error-prone centered around the idea of language features abstractions.

Take division, for example, a simple abstraction of a division operation found in most PL. Abstractions hide certain aspects of execution for the benefit of simpler reasoning. Division hides things like remainders, integer truncation, division by zero, which are details about division that we don’t usually reason about in real numbered mathematics.

If in hiding that complexity, a language feature allows the developer to rarely have to reason about that internal complexity, error-proneness is low. For example, when dividing 10/2, one never has to reason beyond the mathematical idea of division. However, if in hiding that complexity, the language feature occasionally forces the developer to have to reason about that internal complexity, that reasoning will be even harder because the complexity is hidden, and therefore errors-proneness is high. For example, 10/3 suddenly requires developer to reason about integers versus float division. Dividing 10/0 suddenly requires developers to reason about runtime errors from a divide by zero error.

These situations of “forced reasoning” about “hidden semantics” cause errors because developers need to be able to reason correctly about the behavior of an abstraction to avoid unintended behavior, otherwise known as defects. If they do not have access (e.g. via training or tools) to a correct model of that behavior, they will make mistakes. Because abstractions hide detail, they occasionally make obtaining a correct model hard.

There are many examples that can illustrate this theory. For example, many languages offer constraint abstractions, which make it easy to express declarative properties between variables. These are powerful and simple when they work, but when someone expresses a circular relationship, they force a developer to reason about the hidden semantics of constraint satisfaction algorithms to be able to debug, control, and work around the unintended side effects of cycles.

Another example is memory-safe languages, which allow programmers to think in terms of abstract objects and fields instead of the linear memory on which those objects are imposed. This abstraction eliminates many non-local interactions (or alternatively, “safety rules”) that programmers have to consider in unsafe languages. The abstraction rarely breaks down in terms of correctness; the main cases where it does are interaction with code written in unsafe languages (e.g. the native code interface in Java). In these cases, memory management is even harder because one has to also reason about the abstractions imposed by memory-safeness in addition to memory management.

Ultimately, these examples suggest that error-proneness really emerges from semantics of an abstraction that are hidden or complex that a developer must nevertheless reason about correctly in order to use the abstraction correctly. We can use this theory as a thinking tool in a PL design process and as a way to generate hypotheses to test the theory. For example, one could compare fine grained differences in what an abstraction hides and compare developers’ defect production.

I’ve never been to a Dagstuhl meeting so intellectually productive. How did we make so much progress in five days? I attribute a lot of it to the intellectual and experiential diversity of the attendees. Conversations with curriculum designers forced me to think about PL as a design to be deconstructed. Conversations with computing education researchers forced me to think about the models learners create about PL. Conversations with PL researchers forced me think about accessible sources of evidence they could acquire and use in their design processes, such as versatile, tested theories of the developer experience of language features.

There are many places for this group to go. While I don’t believe we formed a new community, we did form new relationships, and we absolutely formed new ideas about designing PL. I think every individual who came will be transformed for the better, armed with broader, deeper, sounder ideas about what PL are, how we should design them, and how we can know they are good.



Amy J. Ko
Bits and Behavior

Professor at the University of Washington Information School, curious about programming + learning + design + justice. Trans, queer, she/her, parent. Meow.