How to write your own Domain Specific Language in Python

Oleg Komarov
Mar 26, 2020 · 8 min read
Image for post
Image for post
Photo by Charl Folscher on Unsplash

In my quest to find investment insights I decided to automate the boring stuff and simplify backtesting of financial strategies.

I started creating a textual interface for financial backtesting with a few priorities in mind. I wanted it to be simple to use but adaptable to the complexity of financial strategies.

I decided to first define a formal grammar that will serve as a stepping stone towards a fully natural language interface. I ultimately want to be able to describe a trading strategy in words, allowing for missing elements, ambiguities and all the goodies of our spoken interactions, and obtain in return a well defined backtest.

Finally, the objective of this post is not that of discussing the benefits or fallacies of backtesting. But to give an overview on how to create your own Domain Specific Language using a Parsing Expression Grammar (PEG).

Why PEG

The choice fell on a PEG grammar and parser generator, for several reasons:

Also, since I wanted to focus on the language definition, I avoided hand-writing the parser. So, I used 竜 TatSu, which generates the parser from the EBNF grammar.

Why not regular expressions

Trading strategies have a finite set of defining components, so you might ask yourself why not use regular expressions to isolate them in a textual description?

While it sounds like an easy win, I decided against it and the reason is two-fold: first, I still needed to define which strategies I would support initially and how I would grow this set while controlling the complexity of the language.

Second, the premonition in “Why you can’t parse HTML with regex” by bobince strongly resonated with what I am trying to accomplish here 😅. Arguably, that’s the best answer on Stack Overflow and a foundational piece that any software engineer should at least read once!

In other words, I felt that regular expressions would have ended in a mess of edge cases stacked together in Jenga style, and waiting to crumble under the faintest change of a fundamental premise.

A simple grammar

The best way to describe what the formal grammar for backtesting can accomplish is through an example:

BUY TSLA IF SMA(5) > SMA(50)

That statement encapsulates the following cross-over strategy:

Buy Tesla if the 5-period Simple Moving Average of its prices crosses the 50-period moving average. Close the position, or go short, if the opposite condition is triggered.

The idea, is that a sharp change in the fast moving average indicates the beginning of a trend in the direction of the cross-over. So, if it crosses the slow moving average from below, then it should indicate an increasing trend, and a decreasing one in a downward cross-over.

And here is the final result in action:

Image for post
Image for post

An input box sends the text to the PEG parser which checks the syntax and provides meaningful feedback. Once the statement is validated, I create a backtest with plots and metrics. Network latency aside this happens in near real-time.

The grammar that defines statements like the initial example is listed below (a variation of the EBNF form):

@@grammar::BAAS
@@ignorecase::True
start = statement $ ;statement
=
buysell asset if_expr [qualifier]
;
buysell
=
'buy' | 'sell'
;
asset
=
/[\w\.-]+/
;
if_expr
=
'if' indicator binop indicator
;
indicator
=
/\w+/ '(' number ')'
;
number
=
/\d+/
;
binop
=
'<=' | '<' | '>=' | '>'
;
qualifier
=
'longonly' | 'shortonly'
;

To see it in action, save the above as base.ebnf (name and extension don't really matter) and have a python environment with TatSu installed (or pip install tatsu first) and execute from the command line:

tatsu --generate-parser base.ebnf --outfile base_parser.py

Then in python:

from base_parser import BAASParsertext = 'BUY TESLA IF SMA(5) > SMA(50)'
parser = BAASParser()
parser.parse(text)

Which produces the following output:

['buy', 'TESLA',
['if',
['SMA', '(', '5', ')'],
'>',
['SMA', '(', '50', ')']
]
]

The output is now organized but not truly usable without the knowledge of the statement grammar. I’ll show how we can improve on that in a later section that explains how to annotate the parsed elements.

Grammar in detail

Before annotating or refining the grammar, let’s go over its rules.

The first two lines are TatSu directives:

@@grammar::BAAS
@@ignorecase::True

They specify the name of the grammar — BAAS — which determines the name of the parser, i.e. BAASParser, and the case sensitivity of the parser, which I set to case-insensitive.

Then, the first rule, which needs to be called start is:

start = statement $ ;

and simply says that some text input is valid if it is a statement followed by the end-of-input $, or in other words by nothing else. I chose to follow TatSu’s examples and keep the start rule essential.

The statement rule is where I define the high-level structure of the whole language:

statement = buysell asset if_expr [qualifier] ;

that is, we buy or sell something if a certain condition is verified. The qualifier is an optional argument (enclosed by square brackets) to handle long-only strategies.

I spent most of my time deciding how a trading rule should read when spelled out in this language. I wanted it to be intuitive, comprehensive and extensible.

The buysell rule is:

buysell = 'buy' | 'sell' ;

that is the statement must begin with the literal buy or sell.

The asset is any supported ticker:

asset = /[\w\.-]+/ ;

which is a regular expression matching letters, number, underscores, dots or hyphens, e.g. BTC-USD.

While I stated that I want to avoid regular expressions, there is always a good measure and context for any tool. Specifically, I don’t want to handle parser context with positional anchors or look-ahead/behind assertions. Alternatively, I could rewrite the rule as:

asset = 'a' | 'b' | 'c' | ... ;

all the way to include the whole alphabet, numbers and some punctuation in the same fashion as the EBNF — Examples on Wikipedia show. However, this would make the grammar very noisy and I doubt it would actually be a performance gain in Python.

The if condition is the core of the trading strategy:

if_expr = 'if' indicator binop indicator ;

and it should start with an if and have a single comparison rule of two technical indicators. I kept it very basic on purpose to avoid introducing left-recursion.

The comparison is captured by binop which can be any of the literals:

binop = '<=' | '<' | '>=' | '>'

and it establishes an inequality rule.

The indicator rule in the conditional expression is defined as the name of the technical indicator and a numeric parameter enclosed in brackets:

indicator = /\w+/ '(' number ')' ;

And the number is:

number = /\d+/ ;

Both the indicator name and its parameter are regular expressions where the same caveats apply as for the asset rule.

Finally, the qualifier is simply:

qualifier = 'longonly' | 'shortonly' ;

In other words, the statement rule can end with the literal longonly ( shortonly) to denote a strategy that can only buy (sell) the asset.

Parsing errors

TatSu generates a parser that is capable of meaningful parsing errors. However, the error message depends on the organization of your rules.

For instance, when I parse the statement with the missing parameter:

BUY TSLA IF SMA( > SMA(50)

I get the following error:

tatsu.exceptions.FailedParse: (1:17) Expecting <number> :
BUY TSLA IF SMA( > SMA(50)
^

Notice how the arrowhead is exactly positioned under the missing parameter. That’s because the number is a separate rule. However, if I were to incorporate the number into the indicator in the following way:

indicator = /\w+/ '(' /\d+/ ')' ;

then the error would only recognize an incomplete indicator, instead of the missing parameter. The parser will fail at the indicator rule but won’t tell us exactly where:

tatsu.exceptions.FailedParse: (1:12) Expecting <indicator> :
BUY TSLA IF SMA( > SMA(50)
^

Notice how the arrowhead is pointing at the whitespace before the indicator name.

So, don’t try to be terse but break down rules whenever you need meaningful errors.

Annotating parsed elements

As I mentioned earlier, parsing the input text into a nested list has limited usability. Thankfully, TatSu lets us annotate rule elements and outputs a dictionary instead of a list.

So instead of getting the following Abstract Syntax Tree (AST):

['buy', 'TESLA',
['if',
['SMA', '(', '5', ')'],
'>',
['SMA', '(', '50', ')']]]

We can get an annotated AST:

{'asset': 'TSLA',
'condition':
{'left': {'name': 'SMA', 'parameter': '5'},
'op': '>',
'right': {'name': 'SMA', 'parameter': '50'}},
'directive': 'buy',
'qualifier': None}

A dictionary is not only self-explanatory, compare AST['asset'] vs AST[1], but in principle it does not require prior knowledge of the grammar. In fact, we could traverse the AST, e.g. with a walker utility class, and fetch the desired node by name.

You can annotate rule elements by prepending some_name:, so we can rewrite the relevant rules as follows:

statement
=
directive:buysell asset:asset condition:if_expr [qualifier:qualifier] $
;
if_expr
=
'if' left:indicator op:binop right:indicator
;
indicator
=
name:/\w+/ '(' parameter:number ')'
;

Notice the directive:, asset:, condition:, left:, right:, binop:, qualifier:, name: and parameter: annotations.

Open questions

In this post I try to give a practical example of how to create your own grammar using the PEG parser generator TatSu. I tried be explicit about some of the choices and issues I faced. However, some of those remain open and I would love to learn how they are usually dealt within the context of PEG grammars/generators or if alternatives exist.

How do you define custom error messages?

While TatSu already provides clear error messages, I prefer to tailor error messages to the end user. So, instead of saying:

tatsu.exceptions.FailedParse: (1:12) Expecting <indicator>

I opted for:

Expecting a technical indicator name followed by its parameter

To accomplish that, I capture TatSu errors and fetch a tailored error from a hardcoded dictionary.

How do you allow an element to take values from a set of literals?

For instance, I support a set of assets which I can widen at a later stage. I could write a very long exhaustive rule with alternatives hardcoded into the grammar, i.e. TSLA | AMZN | AAPL | ....

But, this approach feels noisy and a loss of abstraction. So, I approach it with a subsequent pass against a hardcoded set.

Conclusion

I give an overview with a real case example on how to build your own Domain Specific Language in Python. The approach is useful if you need a more structured solution but don’t have the time or need to go all in.

To strike a compromise between my needs and getting the job done, I focus on the grammar definition and let TatSu create a parser for me.

So, if you have some time to spare but not a huge amount of it and need to lay down the foundations for an extensible solution, this might be the right approach for you too!

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store