How to write your own Domain Specific Language in Python
In my quest to find investment insights I decided to automate the boring stuff and simplify backtesting of financial strategies.
I started creating a textual interface for financial backtesting with a few priorities in mind. I wanted it to be simple to use but adaptable to the complexity of financial strategies.
I decided to first define a formal grammar that will serve as a stepping stone towards a fully natural language interface. I ultimately want to be able to describe a trading strategy in words, allowing for missing elements, ambiguities and all the goodies of our spoken interactions, and obtain in return a well defined backtest.
Finally, the objective of this post is not that of discussing the benefits or fallacies of backtesting. But to give an overview on how to create your own Domain Specific Language using a Parsing Expression Grammar (PEG).
Why PEG
The choice fell on a PEG grammar and parser generator, for several reasons:
- friends and peers recommended it
- increasing popularity, see e.g. Guido Van Rossum’s series on PEG parsers
- tooling and their perceived ease of use, e.g. 竜 TatSu
- and finally, I liked the idea of using the Extended Backus-Naur Form (EBNF) meta-grammar to define my language, although this is not PEG specific
Also, since I wanted to focus on the language definition, I avoided hand-writing the parser. So, I used 竜 TatSu, which generates the parser from the EBNF grammar.
Why not regular expressions
Trading strategies have a finite set of defining components, so you might ask yourself why not use regular expressions to isolate them in a textual description?
While it sounds like an easy win, I decided against it and the reason is two-fold: first, I still needed to define which strategies I would support initially and how I would grow this set while controlling the complexity of the language.
Second, the premonition in “Why you can’t parse HTML with regex” by bobince strongly resonated with what I am trying to accomplish here 😅. Arguably, that’s the best answer on Stack Overflow and a foundational piece that any software engineer should at least read once!
In other words, I felt that regular expressions would have ended in a mess of edge cases stacked together in Jenga style, and waiting to crumble under the faintest change of a fundamental premise.
A simple grammar
The best way to describe what the formal grammar for backtesting can accomplish is through an example:
BUY TSLA IF SMA(5) > SMA(50)
That statement encapsulates the following cross-over strategy:
Buy Tesla if the 5-period Simple Moving Average of its prices crosses the 50-period moving average. Close the position, or go short, if the opposite condition is triggered.
The idea, is that a sharp change in the fast moving average indicates the beginning of a trend in the direction of the cross-over. So, if it crosses the slow moving average from below, then it should indicate an increasing trend, and a decreasing one in a downward cross-over.
And here is the final result in action:

An input box sends the text to the PEG parser which checks the syntax and provides meaningful feedback. Once the statement is validated, I create a backtest with plots and metrics. Network latency aside this happens in near real-time.
The grammar that defines statements like the initial example is listed below (a variation of the EBNF form):
@@grammar::BAAS
@@ignorecase::Truestart = statement $ ;statement
=
buysell asset if_expr [qualifier]
;buysell
=
'buy' | 'sell'
;asset
=
/[\w\.-]+/
;if_expr
=
'if' indicator binop indicator
;indicator
=
/\w+/ '(' number ')'
;number
=
/\d+/
;binop
=
'<=' | '<' | '>=' | '>'
;qualifier
=
'longonly' | 'shortonly'
;
To see it in action, save the above as base.ebnf
(name and extension don't really matter) and have a python environment with TatSu installed (or pip install tatsu
first) and execute from the command line:
tatsu --generate-parser base.ebnf --outfile base_parser.py
Then in python:
from base_parser import BAASParsertext = 'BUY TESLA IF SMA(5) > SMA(50)'
parser = BAASParser()
parser.parse(text)
Which produces the following output:
['buy', 'TESLA',
['if',
['SMA', '(', '5', ')'],
'>',
['SMA', '(', '50', ')']
]
]
The output is now organized but not truly usable without the knowledge of the statement grammar. I’ll show how we can improve on that in a later section that explains how to annotate the parsed elements.
Grammar in detail
Before annotating or refining the grammar, let’s go over its rules.
The first two lines are TatSu directives:
@@grammar::BAAS
@@ignorecase::True
They specify the name of the grammar — BAAS — which determines the name of the parser, i.e. BAASParser
, and the case sensitivity of the parser, which I set to case-insensitive.
Then, the first rule, which needs to be called start
is:
start = statement $ ;
and simply says that some text input is valid if it is a statement
followed by the end-of-input $
, or in other words by nothing else. I chose to follow TatSu’s examples and keep the start rule essential.
The statement
rule is where I define the high-level structure of the whole language:
statement = buysell asset if_expr [qualifier] ;
that is, we buy or sell something if a certain condition is verified. The qualifier
is an optional argument (enclosed by square brackets) to handle long-only strategies.
I spent most of my time deciding how a trading rule should read when spelled out in this language. I wanted it to be intuitive, comprehensive and extensible.
The buysell
rule is:
buysell = 'buy' | 'sell' ;
that is the statement must begin with the literal buy
or sell
.
The asset
is any supported ticker:
asset = /[\w\.-]+/ ;
which is a regular expression matching letters, number, underscores, dots or hyphens, e.g. BTC-USD.
While I stated that I want to avoid regular expressions, there is always a good measure and context for any tool. Specifically, I don’t want to handle parser context with positional anchors or look-ahead/behind assertions. Alternatively, I could rewrite the rule as:
asset = 'a' | 'b' | 'c' | ... ;
all the way to include the whole alphabet, numbers and some punctuation in the same fashion as the EBNF — Examples on Wikipedia show. However, this would make the grammar very noisy and I doubt it would actually be a performance gain in Python.
The if
condition is the core of the trading strategy:
if_expr = 'if' indicator binop indicator ;
and it should start with an if
and have a single comparison rule of two technical indicators. I kept it very basic on purpose to avoid introducing left-recursion.
The comparison is captured by binop
which can be any of the literals:
binop = '<=' | '<' | '>=' | '>'
and it establishes an inequality rule.
The indicator
rule in the conditional expression is defined as the name of the technical indicator and a numeric parameter enclosed in brackets:
indicator = /\w+/ '(' number ')' ;
And the number
is:
number = /\d+/ ;
Both the indicator name and its parameter are regular expressions where the same caveats apply as for the asset
rule.
Finally, the qualifier
is simply:
qualifier = 'longonly' | 'shortonly' ;
In other words, the statement rule can end with the literal longonly
( shortonly
) to denote a strategy that can only buy (sell) the asset.
Parsing errors
TatSu generates a parser that is capable of meaningful parsing errors. However, the error message depends on the organization of your rules.
For instance, when I parse the statement with the missing parameter:
BUY TSLA IF SMA( > SMA(50)
I get the following error:
tatsu.exceptions.FailedParse: (1:17) Expecting <number> :
BUY TSLA IF SMA( > SMA(50)
^
Notice how the arrowhead is exactly positioned under the missing parameter. That’s because the number is a separate rule. However, if I were to incorporate the number
into the indicator
in the following way:
indicator = /\w+/ '(' /\d+/ ')' ;
then the error would only recognize an incomplete indicator, instead of the missing parameter. The parser will fail at the indicator rule but won’t tell us exactly where:
tatsu.exceptions.FailedParse: (1:12) Expecting <indicator> :
BUY TSLA IF SMA( > SMA(50)
^
Notice how the arrowhead is pointing at the whitespace before the indicator name.
So, don’t try to be terse but break down rules whenever you need meaningful errors.
Annotating parsed elements
As I mentioned earlier, parsing the input text into a nested list has limited usability. Thankfully, TatSu lets us annotate rule elements and outputs a dictionary instead of a list.
So instead of getting the following Abstract Syntax Tree (AST):
['buy', 'TESLA',
['if',
['SMA', '(', '5', ')'],
'>',
['SMA', '(', '50', ')']]]
We can get an annotated AST:
{'asset': 'TSLA',
'condition':
{'left': {'name': 'SMA', 'parameter': '5'},
'op': '>',
'right': {'name': 'SMA', 'parameter': '50'}},
'directive': 'buy',
'qualifier': None}
A dictionary is not only self-explanatory, compare AST['asset']
vs AST[1]
, but in principle it does not require prior knowledge of the grammar. In fact, we could traverse the AST, e.g. with a walker utility class, and fetch the desired node by name.
You can annotate rule elements by prepending some_name:
, so we can rewrite the relevant rules as follows:
statement
=
directive:buysell asset:asset condition:if_expr [qualifier:qualifier] $
;if_expr
=
'if' left:indicator op:binop right:indicator
;indicator
=
name:/\w+/ '(' parameter:number ')'
;
Notice the directive:
, asset:
, condition:
, left:
, right:
, binop:
, qualifier:
, name:
and parameter:
annotations.
Open questions
In this post I try to give a practical example of how to create your own grammar using the PEG parser generator TatSu. I tried be explicit about some of the choices and issues I faced. However, some of those remain open and I would love to learn how they are usually dealt within the context of PEG grammars/generators or if alternatives exist.
How do you define custom error messages?
While TatSu already provides clear error messages, I prefer to tailor error messages to the end user. So, instead of saying:
tatsu.exceptions.FailedParse: (1:12) Expecting <indicator>
I opted for:
Expecting a technical indicator name followed by its parameter
To accomplish that, I capture TatSu errors and fetch a tailored error from a hardcoded dictionary.
How do you allow an element to take values from a set of literals?
For instance, I support a set of assets which I can widen at a later stage. I could write a very long exhaustive rule with alternatives hardcoded into the grammar, i.e. TSLA | AMZN | AAPL | ...
.
But, this approach feels noisy and a loss of abstraction. So, I approach it with a subsequent pass against a hardcoded set.
Conclusion
I give an overview with a real case example on how to build your own Domain Specific Language in Python. The approach is useful if you need a more structured solution but don’t have the time or need to go all in.
To strike a compromise between my needs and getting the job done, I focus on the grammar definition and let TatSu create a parser for me.
So, if you have some time to spare but not a huge amount of it and need to lay down the foundations for an extensible solution, this might be the right approach for you too!