Fun times writing data query languages

(who knew spreadsheets could be such fun, my peeps)

7 min readAug 7, 2017

I thought I would write something that described my experience implementing an innovative data query language in Hypernumbers (a next generation spreadsheet) which I designed with some help from Phil Wadler who was on our advisory board.

This design greatly influenced my understanding of how to use CRDTs in a query language design at Basho — the query language being a sub-set of standard SQL.

The background

I won’t give you the whole history of Hypernumbers, only those bits relevant to the data query language.

Primarily we wanted to make the internet programmable by allowing people to put functions into URLs, functions that could consume other URLs.

You could PUT a function into a URL and the GET the URL and it would be the result of evaluating that function:

PUT the expression = 1 + 2
GET the result 3

This can be extended to functions:

PUT the expression =sum(1, 2)
GET the results 3

But critically the function could take arguments which could in turn be other URLs — voila the user-programmable internet.

Conventionally the URLs were organised in sets like:

/some/page/A1
/some/page/A2

Doing a PUT with =sum(A1:A2) into the URL http://example.com/some/page/A3 would be equivalent to =sum(http://example.com/some/page/A1, http://example.com/some/page/A2 or using a URL that represents an array of URLs http://example.com/some/page/A1:A2.

Permissions were conventionally applied at the collection URL (or in spreadsheet parlance the page level not the cell level) eg http://example.com/some/page/

In this world a URL then can be made to represent an arbritary n-dimensional hypercube of data and we needed a mechanism to query that hypercube — and ways to apply structure to it, to make it even queryable in any realistic sense. Which is where the data query language comes in.

Hypernumbers was not a spreadsheet like any you know — it had some unusual behaviour — which I won’t labour here. If you are interested here is a video of me building a call centre using the create.phone(...) function to create a softphone in the spreadsheet.

(If you are very interested you could read the manual but I don’t think you are maybe?)

As you were — back to the spreadsheets…

Spreadsheets are a come-from expression-based functional programming language.

A ‘cell’ contains an expression like so:

=3*A1

This expression will read the current value of the cell A1 and calculate its current value. Spreadsheets are build of chains of expressions that depend on each other — and when a particular cell changes all the cells that depend on that will receive a change notification (which comes from the changed cell) and will re-evaluate themselves.

VisiCalc invented the model and Lotus-1–2–3 stole the business from them by copying the feature set exactly and reading VisiCalc file natively. When Excel pulled the same trick on Lotus-1–2–3 they decided to make their product harder to copy. By introducing Visual Basic scripting they ensured that any Excel competitor would need to clone the internals of Excel in order to be compatible — and that would be very hard, so hard that OpenOffice still has not managed to replicated Excel 97 scripting to an appropriate degree.

Once we realised that scripting was a defensive measure we switched to a pure functional approach in Hypernumbers — everything was a function.

As a consequence we needed to design a data query language and approach that could span an arbitrary set of hierarchical spreadsheets with the constraints that is could be used in spreadsheet functions.

I will tell the tale in the traditional linear way but be aware that I am lying to you…

We essentially had to solve three problems before we could build the data query language:

using schemas to provide structured data
representing relationships between schemas
generating new records that represented rows of the schema

Problem 1 — enforcing schemas

Spreadsheets are untyped structures and the various ‘database’ functions are baroque as a consequence, to put it mildly.

You can describe a spreadsheet with an infinite schema (well in our case limited by the maximum length of URLs):

CREATE TABLE "blank pages" (
       A1, any,
       A2, any,
       ...
       ZZZZZ9999 any,
       ...
       PRIMARY KEY(URL)
);

The way we enforced schemas was by forcing pages to operate as forms and saving those as templates:

Three views of a spreadsheet page — enforced by permissions

In addition to restricting the user to be able to edit some cells only we also restricted the type of entry. There were three mechanisms for this.

For general editable fields you could only enter astring | integer | float | boolean effectively the normal entry was sharp quoted in Excel terms. The string =sum(1,2,3) has to be entered as '=sum(1,2,3) if it is not be taken as a function in a normal spreadsheet.

The second mechanism was to declare a cell a drop-down — the equivalent of an enumerated data type.

The third mechanism was to declare a cell a tick-box — a degenerate form of enumerated data type — where the allowable values are booleans.

When you save a page created like this as a template you can regard it as equivalent to:

CREATE TABLE "mytemplate" (
       "Name"       value,
       "Department" enum("Marking", "Finance", "IT"),
       "Permanent"  boolean,
       B7,          derived_field,
       B15,         constant,
       PRIMARY KEY (Url)
);

where value = string | integer | float | boolean

By adding functions and values to the spreadsheet you can create derived and constant columns. Note that I have referred to the fields by their labels on the form — programmatically they would need to be referred to by their cell addresses.

This is a slight simplification — you can read the documentation.

Problem 2 — representing relationships

Thirdly we needed to represent relationship. Users traditionally represent relationships in desktop spreadsheets via the file system:

In this example the company has a one-to-many relationship with customers and customers have a one-to-many relationship with invoices and payments.

So in normal spreadsheet terms the challenge was to write a query that would say “show me all the invoices for all the customers that are more than 30 days late and where the outstanding balance is over £300”.

In Hypernumbers we would model this inside the URL heirarchy. An invoice would have a URL which would look something like this:

http://spreadsheets.acme.com/company_0001/invoice_000001

You end up with a series of projected table joins:

http://spreadsheets.acme.com/company_0001/invoice_000001http://spreadsheets.acme.com/company_0001/invoice_000002http://spreadsheets.acme.com/company_0001/invoice_000003http://spreadsheets.acme.com/company_0001/invoice_000004…http://spreadsheets.acme.com/company_0099/invoice_000001http://spreadsheets.acme.com/company_0099/invoice_000002

We now need to find a way to connect the company pagescompany_0099 to the invoice pages company_0099/invoice_000000n in a reliable, safe and secure way.

Problem 3 — creating records

We now have a schema — but how to enforce it.

Having designed a page, for instance an invoice, we would save a copy of that as a template.

Because all things are functions we can create a function which returns not a value as a response, but a form element.

These two things, templates and buttons, can be combined to create a button which when you click it makes a new instance of a template, like so:

=create_button("My Button Title", "./[new_invoice, auto, incr, invoice_]/")

This function says:

create a button with ‘My Button Title’ on it which when pressed will create a page under the current page from the new_invoice template with an automatically incrementing page number and a prefix of ‘invoice_’

(The full list of options can be seen here with more detailed instructions here).

These three elements, based on user-existing work patterns with desktop spreadsheets, allow you to stamp out identically structured forms in an arbritrarily n-dimentional hypercube that can be represented as projected table joins with composite keys.

So how to query it?

Data Query Languages under extreme constraints

The paradigm we operated in was extremely constrained — the user could PUT an expression into a URL — and that expression could be a function that operated on URLs or collections of URLs.

We developed a pattern matching representation of sets of URLs that could be used by query languages, and that representation was in turn programmable using existing spreadsheet functions.

The format we can up with looks like this:

=zsum(path_selector) where the path_selector is a URL operator.

Just say I was trying to sum the invoices on a particular company in the example above I could use a path_selector like ./[TRUE]/b9.

That would say add all the b9 cells on all pages under the current one. The structure of a path_selector is /segment_expression/segment_expression/.../segement_expression/.

The function zsum(...) walks down the URL evaluating each segment_expression. If evaluates to TRUE that page (and the pages below it) are included. (If it evaluates to FALSE the page is excluded, if it evaluates to anything else an error is thrown).

The function when walks to the next set of sub-pages in the URL and evaluates the next segment_expression against them and so on, before ending at the leaf and collecting the cells (or ranges) from the pages what matched all the way down and apply the function to them.

The segment_expression is just a spreadsheet expression that evaluates to TRUE or FALSE and you can build elaborate queries like:

=sum(./electricity/2010/[datevalue(“1/”&segment()&”/2010”, “0”) < date(“1/4/10”)]/[true]/[true]/G21)
=sum(./electricity/2010/[A1=”YES”]/[B12 > 40]/[true]/G21)

You can read about these functions (including the query debugger, which is a function called debugz of course) in more detail here.

In conclusion

Our internal quality assessment was whether the biz guy (who was not a programmer) could build a small retail mortgage bank in Hypernumbers without any additional help (except for punch out functions to get things like credit scores from external bodies).

It turned out that he could do that with the following components:

a minimal schema with enumerated data types and a very primitive type system
projected joins represented by composite keys descending from a topmost ‘left hand key’
a very limited query functionality (here implemented in a super-unusual format, but hey! you gotta do what you gotta do…)

When we came to Riak Time Series at Basho these insights would come in very handy.