Designing Malloy 1 — The Syntactic Shell

Michael Toy
10 min readMay 16, 2022

--

sea shell on a tabletop
Photo by Lloyd Tabb

One of the things which has changed the most since the very first day in Malloy is the “syntactic shell”, which is a phrase which I just invented right now. The “syntactic shell” of a programming language is the rough surface of the language. Code written in two languages which share a “shell” kind of look the same when glancing at them.

I’m going to walk through some stages of the evolution the Malloy syntactic shell, because I am the kind of computer language nerd who thinks this makes a really interesting story.

Tohu wa-bohu

The Hebrew scriptures start with the phrase, “ In the beginning the world was tohu wa-bohu, or “empty and without form”. I enjoy thinking of the primordial, pre-Malloy moment in this way. There was a data language, it was about to be spoken as words and formed, but in the beginning, it existed “tohu wa-bohu”.

The first syntax for Malloy was essentially JSON. Lloyd was attempting to unearth a foundational query gesture dictionary. Using JSON to write language-like structures, he could evaluate and categorize the gestures needed by gradually transforming conceptual fragments. This allowed him to design gestures, and then attempt to write and run code using a proposed dictionary, and eventually validate that a vocabulary would actually describe a large enough set of queries to be useful. All of this without actually committing to a language.

Here’s a query written in SQL

-- SQL
SELECT
color,
COUNT(*) AS thing_count
FROM things

… and again in something like the original JSON Malloy

// JSON Malloy
{
type: "query",
sourceRef: "things",
operations: [
{
type: "reduce",
fields: [
"color",
{ name: "thing_count", type: "count" }
]
}
]
}

Once we were pretty sure we had the gestures, the quest began for a language which would match up well with this new vocabulary.

The Tao of SQL

Our first intuition was that our language would need to appeal in some way to the people who work with data, and essentially all of them were already experts in SQL.

Digression: Ideally someone who is just learning Malloy would be able to look at sentences in the language and be able to guess what operations are happening. Our users are data experts. They are not people who needs answers to questions but know nothing about databases. One day, possibly soon, those people will describe queries in their native tongue, and a very smart AI will write a query for that person. Malloy is not trying to be that query language. We’d like to imagine that the very smart AI would generate Malloy code, because the it will be much more readable than SQL for someone wanting to know “how did that answer get computed?”

Initially we thought Malloy might be an extension of SQL, where code might have a SELECT statement on one line and a Malloy query on the next. For our first versions of the language, we simply mapped the Malloy gestures onto nouns and verbs that fit in the SQL language-shell. Here’s some code from that early Malloy:

DEFINE aircraft IS (
EXPLORE (
IMPORT lookerdata:liquor.aircraft
JSON_SCHEMA '../lib/faatest/aircraft.json'
)
PRIMARY KEY tail_num
COUNT(*) AS aircraft_count
);

A lot of the decisions made in this syntax are simply based on asking how other similar SQL statements are designed. This looks different than the Malloy that exists as I write this article, and I think there are two moments in that evolution that I want to talk about.

AS vs. IS

That definition code fragment is important, because reading code written in this version of Malloy was maybe the first time I gave myself permission to assert my own opinions about the language instead of designing the language to meet the opinions of an imaginary person who loved SQL.

Reading code in this language, I was bothered by the variance in how things are given names. In SQL, definitions are written in the form definition_of_thing AS name_of_thing. (like COUNT(*) AS aircraft_countin the example above). In the MalloyDEFINE statement, I used DEFINE name_of_thing IS definition_of_thing, without thinking too much, because it looked kind of right.

After using it for a while, it occurred to me that I had probably followed the Algol language-shell pattern for definition, which is name before value, which is reversed from the SQL pattern. That observation then led to the thought, maybe Malloy should not use two different name/definition orders to assign names to things.

Which should go first, the name or the definition? Do I change the DEFINE statement so it uses AS to match SQL? That mostly followed our very short initial design specification, which was vaguely “Should be easy for SQL users”.

I tried it out … it bothered me, looking at that change, and I tried to think about why it bothered me. Trying to put words to that feeling, I found as I peeled away layers of discomfort and examined them, that I had an opinion about something that I had never spoken out loud before. It felt like I had discovered a hidden design principle.

Readers of a piece of code can look at the same code in multiple ways. In addition to reading each line carefully, it is important to be able scan code, and ignore some details in order to understand some larger things about the structure of the code. An English reader is going to tend to skim the left edge of a piece of code, and so if the information they need to perceive when scanning is at the left edge, the code will be easier to understand.

Definitions written “name ISsome_complex_expression” scan very naturally, “this line defines a thing with a name and an expression”, there is no need to read the expression, you can skip to the next line unless you really want to know what the expression is.

We decided to consistently use this ordering in Malloy. Every time you name something, the name comes before the definition. This produces something which slightly violates the “looks like SQL” goal, but is absolutely in line with the “is a language which is easier to use than SQL” goal.

This principle “Try to imagine someone wanting to learn something about your code while reading as little of it as possible”, which I didn’t even know existed, is now one of the things I think about every time we change the language.

This principle also plays a major roles in the next change I want to talk about.

The Day My Beautiful Language Died

The other milestone change in the development of Malloy is the decision to make the Malloy shell look a little bit more like the JSON or LookML shells, with : being used to make some structure more evident.

This is a hard change for me to talk about. Maybe this change is why I am writing this article, so I can grieve, document, release, and move on. To explain, I need to rewind time to the version of Malloy that existed just before the shell changed.

Here is some actual Malloy code from that time …

define flights is (explore 'malloy-data.faa.flights'
carriers is join on carrier
flight_count is count()
by_carrier is (reduce
carriers.nickname
flight_count
destination_count is destination.count()
)
)
-- run the "by_carrier_ query
flights.by_carrier

We had a number of interesting models and queries running on real world data written in this version of Malloy. We were starting to feel like our experimental language was good enough to actually use. You could write short, readable, powerful sentences in Malloy, and those sentences did real work.

I particularly like this version of the language because it used very few special characters. From the SQL shell-heritage, it used parenthesis to group things, and some structures would allow comma to separate items in lists, but mostly you wrote code with short, carefully selected words. It wasn’t wordy though. It felt calm to me, like reading poetry. I loved it, I still really love it.

It was time to start showing this language to people who weren’t us. We had been trying to distance ourselves from SQL enough to think differently. Maybe we went too far, maybe we hadn’t gone far enough. We asked them to try it out, tell us what worked and what didn’t.

The feedback was amazing. Yes, these users were so excited about the way that Malloy queries were simple and powerful, and the composition and refinement approach to data programming in Malloy made people want to trade in their old tired query language for the shiny new one. Except where it didn’t.

The main complaint was really the same issue that caused us to make the leap from AS to IS. When these experienced data scientists looked at code written in Malloy with intention of to change the code, even code they themselves had written, it was too difficult to determine the structure of the code so they could focus their concentration on the one part they wanted to change.

There were two key problems. The first was that the type of a statement was hidden somewhere in the text of a definition. Look at these two lines …

carriers is join on carrier
flight_count is count()

The first statement is extending the graph of available data, with the keyword or “type” of statement(join ),obscured in the middle of the sentence. The second is adding a new aggregate calculation to the current node in the graph with the type actually invisible, because it is inferred from the definition. Someone searching the model, curious about the joins, for example, would have to carefully scrutinize every line to find all the joins. Users looking for computed measures would have read every single line carefully.

Users are smart, and they were working around this by adding commented sections, one for joins and one for measures for example, and placing all the statements which created a certain type of object in one section.

This is one we should have figured out for ourselves. We kind of knew this, and in the models we wrote, we tended to use “section” approach, but because we did that, we were’t fully aware of the problem that we had worked around.

The other problem was a revelation about how a data scientist reads a query. If a data scientist is reading a query while asking “Wow Malloy is so cool, how is that interesting result computed?” one of the first things they want to know is “What are the grouping fields and computations on the grouping fields?”. In Malloy at that time, in a query like this one …

by_carrier is (reduce
carriers.nickname
flight_count
destination_count is destination.count()
)

that information is only available if you read each line carefully.

So we reached back into our bag of tricks, remembering to put the important information people need to scan for to the left of the details they might not need (our shiny new design principle at work!). We started brainstorming, mixing in various new syntax ideas, and at some point we landed on something which looked like JSON and/or LookML

explore: flights is table('malloy-data.faa.flights') + {
join: carriers on carrier
measure: flight_count is count()
query: by_carrier is {
group_by: carriers.nickname
aggregate:
flight_count
destination_count is destination.count()
}
}
query: flights -> by_carrier

Here we see that if the most important thing about a definition is the type, you can place it to the left most. However if you want to create a section where everything is of the same type and the most important thing is the name, you can also write your code that way (as in the aggregate: section)

This new syntax made everyone on the Malloy team happy, except me. I completely agree that the problems our users had with the old syntax were real problems, and that our new syntax solves them very well. However I still fall asleep many nights, trying to shuffle the syntax and grammar of Malloy around in my head, searching for a solution which accomplishes that without adding extra keywords and : and + to the syntax, because this new shell feels less peaceful and less friendly to me.

This event helped me have more sympathy for all computer language designers. There are a number of language features in the world which I thought were rushed decisions, that someone should have taken a couple of days to think more carefully about. The reality is that there are tradeoffs and compromises and it is really hard to get things exactly right.

I don’t think we are done changing Malloy. There may be another seismic syntax shift before we reach Malloy 1.0. I don’t think it will be that I have finally figured out a better way to solve this problem. As much as I would like there to be another choice, I think the other team members are right, and the part of me who doesn’t like this is wrong.

Doesn’t mean I have to like it though.

I do kind of feel better having written this. I guess it is kind of a letter from one side of my brain to the other, or sitting on a therapy couch while the wise counselor listens while I explain the troubling emotions.

Thanks for listening.

Afterwords

More About Malloy

This article is part of a series of articles I needed to write in order to get my brain back. If you got here without reading the previous articles and you would like to read more of the brain dump about my experiences while designing Malloy, go read “Designing Malloy 0 — Introduction” for links to the other articles and more general information about Malloy.

ANTLR is Amazing

I just want to call out the amazing parser generator that we use in our Malloy implementation, antlr. I used to know something about compilers, decades ago. I don’t remember much of that, except that there are things that I could know. I didn’t have to go back and re-learn anything, because antlr and the lovely Typescript runtime for antlr allowed me to imagine new syntaxes and ship them to users with an amazingly short turnaround time.

The Google Disclaimer

I work for Google. The opinions stated here are my own, not necessarily those of my employer. I have feelings about this required statement which exist in parallel with that statement. If you want to waste a few minutes of your life, they are in an article I wrote called “The Google Disclaimer”.

--

--