# Building a (not so simple) expression language part II: Scope

Apr 28, 2015 · 9 min read

(This is part II of a two part series of posts, you can find part I here)

One of the most powerful parts of the Mixpanel query language is the operator, which allows you to select events or profiles based on the value of any element in a list. The operator is just a bit more magical than the other operators in our query language, both in its power and in its implementation.

We’ve already written about building the Mixpanel expression language — the language we built inside of the Mixpanel data store to allow you to query and select data for reports. The model we built in the last post can do a lot of work, but parsing and interpreting the query takes the language to another level, both metaphorically and syntactically.

Like the basic expression language post, we’ll be using Python and JSON to talk about procedures and data, but won’t assume you’re a serious Pythonista. It will also be worth taking another look at the simple expression language post, since this post elaborates that model.

# Refresher- Mixpanel people records and Query Language 1.0

Mixpanel people records can be thought of as JSON objects that look something like this:

The key thing about JSON Objects that we glossed over a bit when we were setting up our initial expression language was that JSON is a nested, recursive format — All keys in JSON are strings, but values can be lists and dictionaries in addition to simple scalar types. To query effectively, then, we need tools that can not only handle simple scalars, but also ask and answer questions about dictionaries and lists. In the last version, we developed an Index operator, that could handle dictionaries, but we didn’t really have much support for lists.

By way of example, the albums property of my example musician record has a list describing the albums associated with that musician. What if we want to select all musicians that have albums released before 1970?

In pure Python, that’s not hard. We can do something like this:

For the non-Python speakers in the audience, it’s worth paying attention to this expression

any(a["release_year"]) < 1970 for a in albums)

The bit between the parentheses is a Python generator. It binds the variable a to each value in albums, and produces a value. The details of the expression aren’t really all that important- if Javascript and underscore.js is more your cup of tea, you can read it as:

_.any(albums, function(item) { return item.release_year < 1970; })

If you prefer Ruby, you could spell it albums.any? {|item| item["release_year"] < 1970 }

Or in Smalltalk: albums anySatisfy: [:item | (item at: “release_year") < 1970]

Or Haskell: any (\item -> (item ! “release_year") < 1970) albums

Or Perl with List::Util: any { \$_{"release_year} < 1970 } @albums

A number of other languages spell any as some, but the sense remains the same. In all of the languages above, any as written above means “return True if the given function is true for any value in the list”. What all of these constructs in all of these example languages have in common is the notion of binding. All of them, through iterators or function calls or block invocations, assign a new name to each element of the given list in turn, and then operate on the named object.

In the Mixpanel query language, any looks like this:

any(properties["albums"], item["release_year"] < 1970)

The expression that forms the first clause of any ( properties["albums"] in our example) should be something that evaluates to a list. The expression on the right side of any is where the magic is — the special expression item is bound to every element in the list on the left side. If the expression on the right evaluates to true for any binding of item , then the whole any query also evaluates to true.

The idea that item is bound means we have to take another look at how we might implement our language. In particular, the meaning of item in the right hand side of the any expression only makes sense if it’s part of an any expression. The expression item[“release_year”] < 1970 by itself doesn’t mean anything, because item means “some element in the list on the right of the enclosing any expression”, and there is no enclosing any expression. item is the first part of the Mixpanel query language that is sensitive to lexical context – that is the value of item can only be understood by looking at the expressions around and outside of the expression itself. item expressions in different parts of a query can refer to different things. Put another way — we’ve introduced scope into our little language.

We can see the scoping in action with another query — suppose we want to find all musicians who have an album track over 15 minutes long?

any(properties[“albums”], any(item[“tracks”], item[“duration”] > 15))

In this example, the item in item[“tracks”] refers to an element of properties[“albums”] , but the item in item[“duration”] refers to an individual element of item[“tracks”].

# An any expression, version 1.0

We’ve seen how programming languages handle any, and the examples we saw suggest a straightforward implementation strategy: we treat item as an expression like any other when we build the query. When we encounter an any expression, we create a new context, just like a scope frame on a programming language stack, and pass it to our children. Then, when we evaluate expressions (including item) we just return the value of the context that we were passed. In this way, we can use the stack of our host programming language as a stack for our item expressions.

Of course, we’ll also have to add any and item to our grammar (see the last post for details about the rest):

Since we want any and item to be able to appear anywhere a sub-expression can appear, we’ll need to add an additional context argument to evaluate for all of our other expression types else, too, but for the most part expressions will just pass things along to their children without doing anything.

# An any expression, version 2.0

Our 1.0 implementation works, but it has a few big drawbacks.

First of all, if the query fails because item isn’t inside of an any, the query will fail during evaluation. A Mixpanel query might be run on hundreds of machines, over millions of records, and not every subexpression is evaluated for every record. If badly formed queries are only detected at some point during the run, we might only detect a failure after a lot of work (on a lot of machines) had already been wasted.

The context of an item expression comes from the structure of the overall query expression itself, not the circumstances of its execution. We can tell what the context of a given item is just by looking at the query, regardless of the data we apply it to. That means we can tell whether or not item expressions are valid at parse time. However, we can’t know that context until we’ve parsed not only the item expression itself, but also its surrounding expressions. In fact, we’ll always finish parsing a well formed item before we finish parsing the containing any query, since a good item is part of an any query! That means we’ll need a second pass.

Since we have the context during the second pass, we can also use a little indirection to get rid of the mostly-useless context argument to evaluate. For this we’ll use the Reference class we developed in the last post- a reference is just a holder for a value (which translates to a pointer in our C implementation of the language). We’ll pass in the context in the second pass of our parse, and then update its inner value during evaluation of the query.

In our model, I’m going to call this second pass resolve. First we parse, then, once parsing is complete, we resolve the parsed expression, passing the scope context down into its nodes. If there is an item that doesn’t have an appropriate environment, we detect that before we begin to run the query, otherwise we create and bind a reference we can use to look up the value of item once we’re running. In Python, the new code might look something like this:

The big difference is that resolve runs once per query, while evaluate runs once per potential subject of the query.

As long as we’re introducing references as holders for scope, there is another issue we can deal with as well. In the last post, we discussed that indexing dictionaries by name was relatively expensive — in the worst case, we have to scan every element in the table, and even in the common case dictionary lookups involve a bit of indirection and calculation as we generate hash codes, look up buckets that may not be stored in fast memory, and finally find the value in the hash bucket. While dictionary lookup costs are small enough to ignore in many contexts, when we’re running a query against millions of records we want to save every cycle we can. A simple way to save a lot of time in queries we observed was to cache hash table lookup results in the case that any other part of the query needs the value of the same key.

# Refresher — References and Reference Binding

In the last post, we built out a caching scheme for hash table lookups by introducing PropertyReference to our language- instead of parsing property[“name”] into

Index(Properties(), “name”)

We chose to parse it as

PropertyReference(name_reference)

We introduced the PropertyReference expression so we could cache the results of lookups. Since properties never changes during a query, we could do all the lookups in advance.

We also developed a tool for resolving references, ReferenceBindings, that would manage the binding step by keeping a table of named references and bind them all at once. Using ReferenceBindings, we could create the expression for a more complicated query like this:

It turns out that the most common useful operation on an item in an any query is index lookup, and it also turns out that expressions that look up the same key more than once are also pretty common (consider item[“status”] == “Completed” or item[“status”] == “Pending” ). Since item references are always in a loop, (and since we already have a caching technique using references) item indexing is a great spot for us to optimize if we can.

Our reference scheme for properties relied on us knowing the key before we ran the query- instead of waiting to do the lookup when we evaluated the properties expression, we found all the references we would need during the parse, and created references for them. Then we bound the references as part of a separate step before we executed the query. We actually limited the expressive power of the language, forcing all property lookups on literal keys, in order to accommodate this optimization into our grammar.

# References for items

The item case differs from the property case in two ways: first of all, while in the common case we use literal keys to look up values on item, we can’t say that we do this in all cases. Also, unlike properties, which never changes during evaluation, the whole usefulness of item is that it changes its value.

The good news is if we’re going to introduce resolve as a formal second pass, we can use the same, more general mechanism for both sorts of references and both cut some complexity from our implementation and return to using a more general grammar for properties .

In our final version, we’ll pass two sets of ReferenceBindings in during our resolve step, in addition to our item reference. One set will be the property bindings, and the other will be a fresh set of bindings associated with the item scope if that scope is present. Like always, we’ll resolve the property bindings once, before we evaluate the expression, but now we’ll resolve the item bindings once for each iteration of the loop.

In Index.resolve we’ll try to use those references only if we know the literal key, and otherwise we’ll fall back to our usual behavior.

In addition to working for both items and properties, this also allows us to remove the limitations on PropertyReference in our grammar! Here’s our cleaner grammar — we can go back to treating properties like its own expression in our parses!

This is as complicated as we need to get!

any went into production as part of our query language in late 2013, and the model of iteration is now part of our core query engine. It’s particularly useful when combined with our revenue tracking features - we model revenue as a list of objects associated with a profile, and the any expression allows us to query all of the profiles with particular purchase history characteristics easily. It’s also my very favorite bit of the Mixpanel code. It’s simple and straightforward (and still really fast), but in some ways it represents a whole new dimension of power for our users.

Originally published at https://engineering.mixpanel.com on April 28, 2015.

Written by

Written by