Searchlight: A domain-specific language for API design

Published in

CodeX

6 min readJun 3, 2023

Creating a customized query language with controllability features

There’s something satisfying about having your own domain-specific language — in my case, I decided to build one to make my REST API easier to use, friendlier to novices, and more powerful for advanced users. Let me take you through the design and implementation of Searchlight, a customizable query language for DotNet APIs.

When I built the AvaTax REST API, I wanted to offer a standard query language for my customers. In 2016, the most popular competitive query languages were GraphQL or OData.

But there was a problem: Avalara already had a legacy SOAP API and it had its own query language, which was very similar to SQL. Lots of existing customer code had been written using this SQL-like language, and we really didn’t want to require customers to rewrite all their queries.

Of course, SQL by itself is not a safe query language — so what could I do? The solution I came up with was Searchlight, a domain specific language that was similar to SQL but offered many features to make it more safe, performant, and convenient. Let’s walk through how it works.

Large domains are easier to navigate with a Searchlight (GetArchive.net)

Basic concepts of a domain-specific language

I wanted a language that could be compatible with legacy AvaTax query code, but could still be made friendly using my own extensions. It was also a requirement that we must be absolutely certain that no SQL injection attacks could occur, and ideally we would also be able to give clear errors if someone made a mistake while writing a query.

To make this happen, I needed to write an interpreted language that could parse text and convert a request into a safe query. Most interpreted languages have a few common data structures:

A tokenizer that analyzes the input text and converts it into a stream of individual elements.
A parser that consumes the token stream and interprets them as individual instructions or statements, often called “Nodes”.
An abstract syntax tree comprised of all the nodes, which represents a parsed program.

My tokenizer relied on a few simple rules and could be written in about a hundred lines of code. The tokenizer recognized three things: text, whitespace, and symbols. To include symbols or whitespace within text, you needed to use single quotes to surround it.

With this small tokenizer, I now had the ability to convert a statement like name > 'Alice' into three tokens: name, > , and Alice. I could detect many kinds of SQL injection trickery by enforcing firm rules around single quotes. The next step was to validate whether the tokens were correct.

Validating tokens and creating the abstract syntax tree

The next step in the language was to identify the meaning of the tokens. I chose to follow the conventions of SQL: I would assume that the first token would be the name of a field, then the second token would be an operation, and the third would be a value.

This worked well, even though there are some exceptions to this rule. For example, a statements could be surrounded by parenthesis, or a complex statement could contain multiple values like age between 37 and 49. The end result is that the second pass through the token stream could easily identify which tokens corresponded to which logic elements.

In this pass, anyone who tried to use an SQL injection attack like OR 1=1 would be rejected since I could assert that the name of each field must come from a list of valid elements in my data model. In fact, I went farther than that: I allowed the API designer to hand-pick which fields were safe to query and which were not. This functionality proved extremely useful later on when we discovered that some fields within the API were unsafe to query due to the performance characteristics of their indexes in the database.

After the abstract syntax tree step, I now had a collection of nodes, each of which represented a clause within a query. If I wanted to, I could examine the abstract syntax tree and apply logic rules. I could also maintain a separation between my database schema and my API schema: Searchlight could apply renaming functions to field names. This feature was the most important of all; many of my database field names were poorly chosen and confusing to the developer. By renaming my fields I could make the API sensible and friendly to novices.

Compiling the query into an executable form

The next step in the process is to convert the syntax tree into an executable format. In my case, I created multiple executors:

An executor for Microsoft SQL Server, which was the most important use case that I wanted to optimize.
An executor for LINQ that could be used to examine IEnumerable collections.
Finally, an executor for MongoDB which allowed me to work with MongoDB using a query language I liked for personal projects.

The option of using multiple executors was a powerful function. When using Microsoft SQL Server, I could now get multiple recordsets of data from the database in a single query. My query language allowed me to fetch subsidiary data, so that I could fetch companies and specify &include=contacts. This optimized the use of my database as well as increased performance for end users.

Another important performance optimization was the use of REDIS. Since Searchlight worked identically whether I was querying SQL Server or REDIS, I could fetch a list of static objects and present it to the user interchangeably.

Improving the query syntax

You may notice that the most popular query languages for REST APIs tend towards complexity. An OData query must be carefully encoded, and a GraphQL query usually comes in the form of a complex JSON object. I wanted my query language to be friendly to novices. What could I do?

The most important step to be friendly to novices was to avoid HTML and URI encoding wherever possible. To do this, I chose to support alphabetic alternatives to all symbols so that you could write a query that would not trigger the need to encode any symbols.

Another critical choice was that I wanted to eliminate SQL-specific and NoSQL-specific text searching functions. My query language prefers generic text search like startswith, endswith, and contains. These functions are simple enough to be implemented by each database engine natively, regardless of whether it uses regular expressions or like.

An example query from my Searchlight documentation page is here:

GET /customers/?query=name startswith A and CreatedDate gt '2019-01-01' 
  and (IsApproved eq false OR (approvalCode IS NULL 
  AND daysWaiting between 5 and 10))

As you can see, it’s possible to type complex queries in this language without using symbols that would require URI encoding. My documentation encourages users to choose the textual variants of expressions rather than the symbolic versions specifically to avoid the risk of encoding.

Of course I still support symbols since I know that old developers would continue to use the symbols they had learned in the past like &&, !=, and <>.

The end result of this language is that the language ends up appearing similar to Atlassian’s Jira Query Language, a happy coincidence. Anything that makes my language friendlier to novices is welcome!

Clarity of error messages

Another critical element of my query language is that exceptions must be clearly defined and must contain an explicit reason why the query was rejected. I defined a list of 18 exceptions with precise error messages so that an API using Searchlight can explain exactly what mistake a user made and how to correct it.

I encourage API developers to capture these exceptions and convert them into error messages customized for your developer audience. Because these are strongly typed errors, you can catch SearchlightException and use its information to customize your error output.

Support for date math

A fascinating feature supported by Searchlight was the need to support date math in queries. Rather than specifying date gt 2023-05-29, Searchlight supports queries with relative values such as date > yesterday or due_date > tomorrow + 90 .

This functionality is necessary to permit storage of queries. A user can store the text of a query in the database, and that query can be executed daily on their behalf. Each time the query is executed, the date math will be converted to relevant dates.

This functionality is especially useful for webhooks — you can specify that you want to be notified any time a task that is due within the next 30 days is modified, for example.

Available on NuGet

The Searchlight language is available on NuGet and supports all variants of DotNet from NetStandard 2.0 onwards. I welcome pull requests and bug reports.

Ted Spence heads engineering at ProjectManager.com and teaches at Bellevue College. If you’re interested in software engineering and business analysis, I’d love to hear from you on Mastodon or LinkedIn.