Designing Malloy 2 — Filters

Michael Toy
8 min readJul 19, 2022

--

Photo by Lloyd Tabb

When deriving information from data, perhaps the most important tool is selecting which data is relevant to the computation you are describing. In Malloy we call this filtering, and we are trying to make filtering sit in the language in a way which is natural and powerful.

This article is a little different because I am not entirely ready to defend our choices here. I think we are moving towards something better, I am not certain we have arrived yet.

Also, it turns out to be a little hard to think of the perfect image for an article about filtering. Lloyd came through again on this one, after I complained that there was absolutely nothing that worked.

Filtering in Malloy

Filters are really just boolean expressions. For example, here’s some Malloy to display the name of all things that are red, in an imaginary data set of things which have a color property. The boolean expression color = 'red' is a filter.

query: things -> {
project: name
where: color = 'red'
}
// SQL for this would be SELECT name WHERE color = 'red' FROM things

Filters everywhere

The first interesting thing about filters in Malloy is where they can be used. Malloy can filter a query, a measure inside a query, or the source for a query. Here are those gestures written for our imaginary data set, each of which computes the “red_count”, or number of things that are red.

// QUERY: This is the most similar to an SQL queryquery: things -> {
where: color = 'red'
aggregate: red_count is count()
}
// MEASURE: Compute the red count by filtering the actual count.
// This is useful when other parts of the query might
// need data from non-red things.
query: things -> {
aggregate: red_count is count() { where: color = 'red' }
}
// SOURCE: When it is useful to refer to a set of filtered data
// by a name, you can simply add a filter to a source
source: red_things is things + { where: color = 'red' }
query: red_things -> {
aggregate: red_count is count()
}

The method of filtering the dataset before counting is probably the most common gesture, but the ability to define filtered measures naturally in the language creates some lovely code for things like this measure which would help you determine if some values of scent are more likely than others to be found in red things.

query: things -> {
declare: red_count is count() { where: color = 'red' }
group_by: scent
aggregate: red_pct is (red_count / count()) * 100.0
}

Multiple boolean expressions with the same value

We also have an interesting extension to the syntax of boolean expressions in Malloy. Consider a more complex filter than color = ‘red’ from the previous example. For example, we might be interested in which things are colored with one of the primary colors. Most languages (including Malloy) allow you to express that as a boolean expression.

color = 'red' or color = 'blue' or color = 'green'

However this expression starts in the mind of the query writer as “is the color one of these three values”, and that intention is hidden by the time that expression is written down. You need to look at the expression for a while to see that all three comparisons are against the same field to get to the point where you understand what the query writer was trying to say.

In Malloy, the expression language is extended, so that the above expression can be written without repeating the reference to color, and accessed as a boolean with this.

dimension: isPrimary is color ? 'red'|'blue'|'green'

SQL has a similar gesture which would work for this example …

color IN('red','blue','green') as isPrimary

… but in Malloy, you can use create partial comparisons from any binary boolean comparison operator, as in …

// range comparison, all events during the crime
where: eventTime ? >= crimeStart & < crimeEnd
// range with an exclusion
where: answer ? > 0 & < 100 & != 42

Still Haven’t Found, What I’m Looking For

So that’s filtering in Malloy, it’s ubiquitous and powerful and readable. Winner! Everyone have piece of cake at the party and then let’s go work on something else.

… except I’m writing this article, because it still feels to me like we haven’t understood filtering well enough to write a syntax for it where the presentation in text matches the operations in a natural and readable way.

We used to have filters expressed like an operator on a source, using [] to contain the filter expression.

things[color = 'red']

… which I liked a lot, it was concise and readable, I would have kept this forever except we really needed to save the []syntax for actual array access. We then moved to a two character sigil,

things :[ color = 'red' ]

… but we decided that was a bit too cryptic, the Perl language uses these quite freely and it makes a language which is very hard to read for someone just starting out. It feels like we should avoid them as much as possible …

So we ended up with filtering as a set of properties which a source or a query might contain, which has all the power, but perhaps not quite the beauty. Which then lead to a quest …

Looking For … Extent

An important piece of information needed to understand a filter is the domain of the filter, and if I could wave my hand and wake up to the perfect design, the natural bracketing or containment in the language would also clearly mark where the filtering happens. Something structured like this … (but not this exact syntax) …

source: red_things is FILTER(color = 'red', things)

... because then you could use the parenthesis matching of your IDE to see what the extent of the filter was, similarly, using the same ugly but sample syntax

  measure: red_count is FILTER(color = 'red', count())

Except after a lot of writing things on whiteboards, trying out crazy ideas, and staring and different constructs, I have yet to come up with a syntax which has the property of “parens, or braces or something wrap everything affected by the filter”, which I also think looks “right” in Malloy.

I am not happy with this, it feels like the filter event is more intrinsic to data and it should somehow be more of an operator than a property, and I continue to hope I will eventually find a lovely syntax for that.

WHERE and HAVING

Another interesting design consideration is the distinction in SQL between filters applied with WHERE and filters applied with HAVING.

In SQL, a WHERE clause filters rows of the dataset constructed for the query, it references data that exists before the query is run, and the HAVING clause filters the results of the query and references data defined in the query. In an earlier version of Malloy, the author of a query was not required to make that distinction explicit, because Malloy could easily compute it from the filter expression.

It would be possible for Malloy to only have one filter keyword, say when: and you could write something like this below, and Malloy would be able to move each filter clause to the appropriate part of the query.

query: things -> {
group_by: scent
aggregate:
red_pct is count() {when: color='red'} * 100.0 / count()
when: size = 'large', red_pct > 50
}
// roughly equivalent toSELECT
scent
SUM(CASE WHEN color='red' THEN 1 ELSE NULL END) as red_pct
FROM THINGS
WHERE size = 'large'
GROUP_BY 1
HAVING red_pct > 50

After much discussion, the general consensus among us was that the clarity gained by forcing people to write

where: size = 'large'
having: red_pct > 50

was helpful for people coming from SQL, and the loss of brevity was much better than having to explain in an error message the difference between the legal filter, which can be split into a WHERE and a HAVING

when: size = 'large', red_pct > 50

and the naive but reasonable attempt at a query written by someone new, which would require this educational error message

when: size = 'large' and red_pct > 50

I was on the losing side of this argument, but I also feel like I clearly lost to the better answer. I grieve though, because in the Tao one of the principles is to have a minimal language surface, and I really wanted there to be only one, not two filtering properties.

This ties in to the previous section about the extent based syntax. Maybe if filtering was an operator, you’d use the same operator for WHERE style and HAVING style filter expressions, and the position of the operator in a query would made it clear whether the operator was an input or an output filter.

Extended Boolean Expressions

name = ‘Bob' | ‘Robert' is a simple expression using extended boolean syntax, and is pleasing to read. However as these extended boolean expressions get more complicated, it isn’t entirely clear that the partial comparisons continue to be readable. Maybe a benefit which only makes short expressions more readable isn’t really a benefit, and there is a lot of complexity in the compiler introduced to support this feature.

I don’t regret implementing this, because it is a really interesting and attractive idea. I am keeping an eye out for how users adopt this feature, and I have a fear that it will cause more problems than it solves.

The Future of Filtering

There is also a coming language feature, which there is still considerable discussion about, where a filter can be defined as an object, and passed around and applied. So if I was only interesting in red things, I could make a red filter, and pass it to a query which expects a filter, but the query would know how to best to apply the filter to all the sources it consults.

Afterwords

More About Malloy

This article is part of a series of articles I needed to write in order to get my brain back. If you got here without reading the previous articles and you would like to read more of the brain dump about my experiences while designing Malloy, go read “Designing Malloy 0 — Introduction” for links to the other articles and more general information about Malloy.

The Google Disclaimer

I work for Google. The opinions stated here are my own, not necessarily those of my employer. I have feelings about this required statement which exist in parallel with that statement. If you want to waste a few minutes of your life, they are in an article I wrote called “The Google Disclaimer”.

--

--