Derivatives of Symbolic Automata Explained

A tutorial on taking the symbolic derivative of a regular expression and an introduction to some Haskell syntax

Published in

Better Programming

18 min readOct 11, 2022

Megan Martha White from The White Stripes. The self-taught minimalist drummer that disappeared from public life in 2011. Photo by Dena Flows.

Symbolic derivatives are used to find the leaking of security keys in emails and for analysing AWS access policies to avoid the accidental exposure of private data — according to Microsoft Research.

They also fall out when you are trying to apply derivatives for regular expressions in practice. They result in manageable state space in practice given a larger Unicode alphabet.

I have also “reinvented” symbolic automata without even knowing it while creating a validation language using derivatives. They simply pop out naturally when you are trying to do anything practical with derivatives for regular expressions.

In this tutorial, I will explain how to take the derivative of a symbolic regular expression.

First, we need to understand the difference between a standard automaton and a symbolic automaton.
Next, I will explain how this affects the algorithm for derivatives of regular expressions.
Finally, I will also show how we need to extend symbolic derivatives to allow for better simplification rules, smart constructors, and production use.
As a bonus, I will also touch on how to explore the state space without having any knowledge of the alphabet.

This post is a follow-up post on the posts that explain:

Symbolic Automata

Deterministic Finite Automata (DFA) consist of two functions:

A transition function takes as parameters a source state and a character and returns a new destination state.
The accepting function takes a state as the parameter and returns a boolean.

For example, say we want to accept words like “regular” that start with a consonant, where all consonants are separated by vowels and all vowels are separated by consonants and we end in a consonant. Here is the shortened regular expression for our “regular” words:

((b|c|d|f...)(a|e|...))*(b|c|d|f...)

As another example, “vowel” is a word that will be accepted by the regex.
“consonant” and “automaton” are examples that will be rejected.

The transition and accepting functions for a DFA are very simple functions, they are simply tables:

It is common to rather represent the DFA with a drawing of a graph:

Symbolic Deterministic Finite Automaton, where “vowel” is the start state and the double circle “conso” is the accept state. — A Deterministic Finite Automaton, where “vowel” is the start state and the double circle “conso” is the accept state.

Notice how I have only drawn the transitions for the first 5 letters of the alphabet to avoid a chaotic picture.

The symbolic automaton extends the transition function and allows it to include predicates on the input character. A predicate is a function that takes a parameter and returns a boolean. This means we could write:

(is_consonant, is_vowel)*, is_consonant

Note: we add commas for concatenation to aid readability.

The transition table now becomes much simpler as it is allowed to include predicates.

And even the drawing becomes something that we can completely draw without taking shortcuts.

Another important property of symbolic automata is that the predicates don’t have to take a character as input, they can take any symbol, even cymbals.

Cymbals as an Example of Symbols

Drawing of Travis Barker by David Choe. Sole survivor of a plane crash. I use his drum samples in my songs.

Let’s use drum notation for cymbals as an example of symbols for our symbolic automaton. First, let me give a short introduction to drum notation. Each cymbal has its own notation:

Drum notation for cymbals: high hat, second high hat, high hat open, high hat half open, high hat foot, high hat foot splash, crash cymbal, second crash cymbal, ride cymbal and second ride cymbal.

Each line with a cross represents a different cymbal or a way to play a cymbal. Following the two bars on the left, in order from left to right we have:

high hat
second high hat
the circle represents a high hat played open
the circle with a stripe represents a high hat played half open
high hat played with the foot pedal
the circle below represents a high hat played as a foot splash, crashing the high hats forcibly and letting them ring
crash cymbal
another crash cymbal
ride cymbal
another ride cymbal

Additionally, each other these cymbals can be played with some modifiers.

Drum notation modifiers: rim shot, ghost note, stopped, right stick, left stick, bell, edge and choked.

These modifiers tell the drummer how the cymbal should be played, again from left to right we have:

rim shot: striking the head and the rim simultaneously
ghost note: a very light strike
stopped: touch the cymbal after playing to stop it from vibrating
right stick: optionally it can be recommended which stick to use to strike the cymbal
left stick
bell: hit the cymbal on the bell
edge: hit the cymbal on the edge
choked: grab the cymbal immediately after being struck

We can use a symbolic automaton to only match rhythms that:

start with a high hat using the left stick,
where each hit is with an alternating stick,
and the final hit is a crash cymbal

(
 !is_right,
 (
  !is_left & (is_high_hat | is_crash), 
  !is_right
 )*,
 is_crash & !is_left
)

Note that we use a predicate “not is right”, since the indications for which stick to use are optional, we can play with the left stick whenever it isn’t indicated that we should play with the right stick.

cymbolic automaton. Credit: Pictures of cymbals - SVG repo

This automaton will match the following cymbal rhythm:

Cymbal sequence of hi-hats that ends in a crash. Credit: Pictures of notation — Guide To Drum & Percussion Notation — Audio Graffiti

As you can see these symbolic predicates give us the flexibility to have regular expressions over any type of input list, this includes the ability to compose predicates into larger predicates using boolean operators, but technically these nested functions could return any type as long as the top function returns a boolean, but we will get more into that later.

Derivatives of Symbolic Expressions

Cymbals Eat Guitars, a band I was introduced to in the Netherlands. The keyboardist makes guitar pedals to make his keyboard sound like an electric guitar. The derivatives will now eat Cymbals.

We have previously explained derivatives of regular expressions. Now we are going to explain how to adjust this algorithm to handle predicates on symbols, instead of just characters.

Our basic operators still include:

empty set: ∅
empty string: ε
logical or: |
concat: , and
star: *

We replace the basic character operator with a symbolic operator, which is a predicate or function that takes a symbol as a parameter and returns a boolean:

The nullable function stays the same for all other operators and always returns false in the case of the symbolic operator. Similar to what the character operator returned.

The derivative function also stays the same for all previous operators.

In the case that the predicate returns true, the symbolic operator returns the empty string and
in the case that the predicate returns false, the symbolic operator returns the empty set.

Again, similar to how the character operator worked.

That is it, this is how simple it is to abstract the character operator to a symbolic predicate in the derivative algorithm for regular expressions. Just to prove I am not lying, here is the diff from the standard to symbolic implementation in Haskell:

Diff between standard derivative for regular expressions algorithm and the symbolic implementation in Haskell.

For those unfamiliar with Haskell: (s -> Bool) represents a function from any symbol, represented by s, to a boolean.

Comparable Predicates

This algorithm works, but we are unable to perform our optimisations. For example, simplification requires that we able to compare regular expressions:

simplification rules that require comparing regular expressions

We need to be able to not only compare if two regular expressions are equal, but also if one is “smaller” than the other, so we can have a consistent order to

decrease the state space, which increases the chances of us hitting the memoization cache, but also to
increase the chances of us finding duplicates and eliminating them using our simplification rules.

The problem is our symbolic predicates aren’t comparable. Fixing this requires us to create a new predicate data type and inside we include a descriptor string, which is comparable and describes the predicate fully. This is what it would look like in Haskell:

For those unfamiliar with Haskell: The instance Eq and Ord statements, simply tell Haskell that Pred is comparable and which functions to call when the predicate is compared with an equal function == or a compare function respectively. Notice how the equal function is represented by an infix function ==. An infix function is a function with two parameters that is placed between its parameters. We can turn any two-parameter function into an infix function by placing back-ticks around the function name, see compare. Another infix function is ++ which concats the strings for the descriptor, where [c] represents the one character string.

We can then swap out the predicate in our regular expression for the comparable one.

This works really well for basic expressions, but soon you will want to build up larger expressions for your symbols.

For those unfamiliar with Haskell: we can use deriving (Eq, Ord) to generate the compare and equal functions and instances for the regular expression and it is smart enough to defer to our hand-written compare and equal functions for the predicate.

Extending Symbolic Expressions for Production Purposes

Symbolic automata require that the predicates are:

purely functional (always return the same result given the same input),
decidable (the functions have to guarantee that they will finish running and return a boolean),
complete (does not throw an exception and is defined for every input value) and
does not emulate an incomplete function by returning an error.

Other than these requirements, we can build an arbitrary expressive language inside the predicate. For example, we might want to push the regular expression operators & and | into the symbolic predicate as and and or functions:

(
 (!more_than(length(modifiers), 1) | is_crash)*,
 is_crash & !is_left
)

We might also want to introduce other functions, like more_than and length, to specify that we don’t want a cymbal to have more than one modifier in its list of modifiers.

Back in 2015, I “reinvented” symbolic expressions without knowing what they existed, while building a validation language for protocol buffers, JSON and XML called katydid. I will create a follow-up post explaining the algorithm, but for now, all you need to know is that it had the following requirements for these expressions:

The user should be able to compose existing functions to build up their own expressions. We want to give the user some expressive power.
Fast to compare. This requirement came up as a production issue. Comparison was the main bottleneck during simplification, which caused a big enough slowdown for users to notice.
Fast to execute. Validation speed is the main feature of the validation language.
The library of functions should be extendable by users. This means our functions will be represented by a data structure, that can be constructed by a user and is not part of a huge sum type that only the language maintainers can edit.
Functions can return errors. For example, an atoi function is more robust if it can return an error. Note: in the case of the predicate at the top of the expression tree, the errors are converted to a false result to preserve the complete requirement of symbolic automata. This works very well in practice.

Descriptor

Rob Rolfe, drummer for Enter Shikari. I have seen this band play in South Africa, The Netherlands and The UK.

Allowing our expressions to be nested, requires us to keep track of the Abstract Syntax Tree (AST) that was used to create our expression. This is needed so that we can do a proper comparison of predicates, which is required by the simplification rules of our regular expressions. We will call this a descriptor of the expression. I stole this name from protocol buffers, which store their AST in a descriptor.proto file. I think this name is more readable and approachable than an acronym. Each expression calls a function (which has a unique name) with some parameters, which can be other expressions with their own descriptor:

data Desc = Desc 
  { _name :: String
  , _params :: [Desc]
  }

Another optimisation we want to make is to precompute any part of the expression that we can compute before starting the validation and taking into account the input. For example, if the user writes:

(less_than(length(modifiers), minimum(3, 2)))*

we want to simplify that to:

(less_than(length(modifiers), 2))*

We will not do more advanced simplification, such as Minterms, since this would limit the type of functions we can allow the user to write, but we think asking the user to provide a flag about whether a function reads from the input aka is a reader is acceptable:

data Desc = Desc 
  { _name :: String
  , _params :: [Desc]
  , _reader :: Bool
  }

Finally, we want our descriptors to be optimised for fast comparisons. This requirement was only discovered in a production issue. We eventually found that these comparison were a huge bottleneck during simplification and when we fixed it, this fixed the wait time of getting the initial result from the filter, which was lazily building the automaton.

data Desc = Desc 
  { _name :: String
  , _hash :: Int
  , _params :: [Desc]
  , _reader :: Bool
  }

Our comparison function tries to short-circuit as quickly as possible to return a less than or greater than. First, it compares the hashes and only if they are equal does it try to compare the names and only if they are equal does it try to compare the number of parameters and only if that is equal does it try to recurse and compare the parameters.

If you are unfamiliar with the <> operator, Monoids or if like me you have never thought of compare as a Monoid, I’ll explain this in the next subsection, which you can optionally skip over.

Compare Monoid (Optional)

Martijn van der Merwe, percussion for Manouche and Leeway. My university roommate and the only drummer I have ever been on stage with.

Monoids are just an interface with two methods or in Haskell a typeclass with two functions:

mappend (inherited from Semigroups, which only has this one function)
mempty

If your type satisfies this typeclass then you can use the monoid’s mappend operator <> on your type. Using this operator properly requires your functions to also satisfy the following mathematical properties:

mappend is associative: (a <> b) <> c = a <> (b <> c)
mempty is an identity element: (mempty <> a) = a

Here are some examples of monoids:

multiplication: (1*2)*3 = 1*(2*3) , (1*7) = 7
addition: (7+4)+3 = 7+(4+3) , (7+0) = 7
concat: ("" ++ "ab") = "ab", ("abc" ++ "cde") ++ "fgh" = "abc" ++ ("cde" ++ "fgh")

Another monoid that might be unfamiliar is the compare monoid:

instance Monoid Ordering where
  mempty         = EQ
  LT `mappend` _ = LT
  EQ `mappend` y = y
  GT `mappend` _ = GT

The _ represents a wild card and the function definitions of mappend are prioritised in order from top to bottom, until one is found that matches the input.

It is not obvious at first, but compare does satisfy all the properties:

associativity: (EQ <> LT) <> GT = EQ <> (LT <> GT) = LT
identity: EQ <> GT = GT = GT <> EQ

This is perfect for short circuiting comparisons, for example, if you want to sort a list of strings by length first and then in alphabetical order:

mySort :: [String] -> [String]
mySort list = sortBy ((comparing length) <> compare) listmySort ["abc", "a", "bc", "ab"] = ["a", "ab", "bc", "abc"]

comparing length returns a function that takes two strings and returns a comparison of their lengths.
compare is a function that compares two strings

This means that (comparing length) <> compare is equivalent to:

\s1 s2 -> comparing length s1 s2 <> compare s1 s2

This short-circuiting of comparison is exactly what we use in our comparison function for our descriptor, to make sure that it first tries to efficiently compare hashes, before doing an inefficient deep comparison.

Executing Expressions Efficiently

Alex Rodriguez, drummer for Saosin. It sounds like all the instruments are in slow-mo compared to his drumming — picture by oneshforfree

We aren’t going to interpret the descriptor each time we want to check a predicate, which means our expression needs to include a function that we can call. This function needs to take a symbol s and return a result r or an error string.

type Result r = Either String rdata Expr s r = Expr
  { 
    desc :: Desc
  , func :: (s -> Result r) 
  }

Comparing two expressions simply falls to comparing the two descriptors of the expressions.

instance Eq (Expr s r) where
  (Expr desc1 _) == (Expr desc2 _) = desc1 == desc2instance Ord (Expr s r) where
  (Expr desc1 _) `compare` (Expr desc2 _) = desc1 `compare` desc2

We need to cater for constructing various expressions:

constants,
functions that read the input and
functions that take parameters.

The Reader function

Cobus Potgieter performs drum covers on youtube. A long time ago, he was teaching me about tools for hacking.

Constructing our functions that read from the input is simplest if we only allow one type of reading function. This function simply reads the input and returns it.

\s -> return s

Constructing this function as an expression, first requires constructing the descriptor, so we make a helper function:

mkReaderDesc :: String -> Desc
mkReaderDesc name = Desc
  { _name = name
  , _hash = hashWithName name []
  , _params = []
  , _reader = True
  }

Our read_char function reads a character from the input and simply returns the same character:

read_char :: Expr Char Char
read_char = Expr (mkReaderDesc "$c") (\s -> return s)

Other Functions

Atom Willard, drummer for Angels and Airwaves, Against Me and The Offspring. Against Me’s audiobook is one my favourites.

Constructing any other function also requires a helper function for constructing the descriptor:

mkDesc :: String -> [Desc] -> Desc
mkDesc name ps = Desc
  { _name = name
  , _hash = hashWithName name ps
  , _params = ps
  , _reader = any _reader ps
  }

This function hashes the name and the parameters’ descriptor. It also ensures that if any parameter is a reader, then this function is also considered a reader, which ensures we don’t attempt to precompute it.

This makes it possible to create a constant char expression:

const_char :: Char -> Expr s Char
const_char c = Expr (mkDesc ("'" ++ [c] ++ "'") []) (\_ -> return c)

The const_char function ignores the input and simply returns the character c it was constructed with: \_ -> return c.

Now that we can read a character from the input and create a constant, we need to compose them with an equal function to recreate our original character comparisons that were central to regular expressions:

eq($c, 'a')

Constructing our equal function takes two expressions that return a character as a parameter and returns an expression that returns a boolean:

Creating this descriptor requires passing the descriptor of the parameters. Next, we create the evaluating function, which requires evaluating the expression parameters, checking their errors and returning the result.

For those unfamiliar with Haskell: Left represents an error and Right represents a result. Leave it to Haskell to always finds a way to discriminate, in this case against left-handed people. Thanks Haskell! We can do this more concisely using Haskell’s do notation, which does the error checking for us. Think of <- as an alternative to the ? in Rust or you can go down a deep rabbit hole and read up on monads.

We can write this even more concisely using applicative style:

The last function that is interesting as an example, is a function that returns an error, in this case ctoi that converts a character that is a number into an integer and otherwise returns an error:

We can now write expressions like the following:

ZeroOrMore (Symbol (less_than (ctoi read_char) (const_int 5)))

This will match a string of zero or more numbers that are all less than 5, for example: “4214”.

Complete Predicates

Adam Carson, drummer for AFI. The only band left on my bucket list to see live.

Our predicates have to be complete and only return a boolean, which means they can’t throw any exceptions and can’t emulate incomplete functions by returning errors. We don’t have any exceptions and our functions are all complete (we have to trust our users to stick to this requirement), but our functions do emulate incomplete functions by returning errors.

To fix this we have to evaluate our predicate at the top of the expression tree using a special evalPred function that simply ignores the error and returns false:

type Pred s = Expr s BoolevalPred :: Pred s -> s -> Bool
evalPred e s =
  case eval e s of
  (Left _) -> False
  (Right r) -> r

Production Predicates

Now our predicates are ready for production, since they satisfy all the requirements:

The user can nest expressions, see eq_char .
The expressions are fast to compare using the precomputed _hash field in the descriptor.
The expressions are fast to execute, they aren’t interpreted.
The library is extendible by users, see less_than below.
Functions can return errors, see ctoi and the predicate at the top ignores the error and always returns a boolean, see evalPred.

Here is the full working algorithm:

Note that we have not included the simplification rules, smart constructors, extra operators or an exhaustive library of functions, simply for the sake of brevity, but it is very simple to add these now.

Memoization and Conditionals

We haven’t explained how to lazily build symbolic automata using memoization. The motivating example for writing a paper about symbolic derivatives was about being able to translate the derivative algorithm into something that an SMT solver, like Z3, can check for satisfiability. An SMT solver is really good at checking whether a formula is satisfiable - if there exists any input that would result in the formula returning true. Applying derivatives inside an SMT solver requires it to be possible to explore the whole graph without exploring the full alphabet, since this would be too inefficient.

Caleb Stanford, Margus Veanes, and Nikolaj Bjørner solved this by providing the following recursive function in to the SMT solver. The SMT solver can analyse this formula without knowing the value of the string s, but can check whether the concrete regular expression R is satisfiable (able to match at least one string):

(length(s) = 0 & nullable(R)) |
(length(s) > 0 & in(s[1..],deriv(R)(s[0])))

This method also gives us the ability to support infinite alphabets and still fully explore the graph, which was impossible before introducing symbolic predicates.

To enable the usage of this formula in an SMT solver, we want to be able to take the derivative of a regular expression without providing an input character. In other words, we want to pre-calculate all we can before we provide the character as a parameter. Effectively efficiently currying the derivative function.

deriv :: Regex s -> (s -> Regex s)

If we want to pre-calculate all that is possible to calculate before reading the input, we need to introduce if expressions:

data IfExpr s =
    JustRegex (Regex s)
  | IfExpr {
      _pred :: (Pred s)
    , _then :: (IfExpr s)
    , _else :: (IfExpr s)
  }

These expressions are either just a regular expression as the base case or they represent an if expression that can be evaluated later when the input character is available. Now we can write our derivative function that returns an if expression:

The empty string and empty set will always return the empty set, no matter the input.
The symbol will have to defer to the if expression, to wait until the character is provided to decide if the derivative should be an empty string or an empty set.
The other operators require writing new composition functions for the if expressions to push the derivatives down into the _then and _else parameters.

For example here is the concatIfExpr function:

The orIfExpr function is available in the full code example, but in short, it will create a nested if which considers all the possible combinations of results of the two possible predicates.

We can introduce simplification rules for our if expression combinators, for example, orIfExpr could remove duplicate conditions. If we introduce an and operator and andIfExpr combinator, we could remove combined conditions that will always be false, for example, eq('a') and !eq('a').

Note: The way that it was implemented in the SMT solver paper differs slightly from my implementation. They created Transition Regexes, which adds boolean operators to the if expression data structure, to simplify the orIfExpr function, but this problem is simply pushed around between deriving time and evaluation time, the algorithms are equivalent. If you want to compare, I have also made a shortened implementation of these Transition Regexes.

Using these if expressions, we can now fully explore the graph or lazily build a symbolic automaton by creating a map from regular expressions to if expressions, even if we have an infinite alphabet of symbols. ba-dum tss.

Erik Sandin, drummer for NOFX. I think it has been 30 years since he last had heroin. I can recommend the band’s audiobook and watching their visit to South Africa.

Please comment on the post with a link to your implementation of derivatives for symbolic automaton in a programming language of your choice. How many operators can you introduce? How many simplification rules? Can you introduce simplification rules for if expressions?

Next, we will look at derivatives where our symbols are trees, but I still have to write that post.

Until then, we will start proving the implementation of symbolic predicates described here in Lean4. You are welcome to join us.

Thank you

Brink van der Merwe for taking the time to explain to me that symbolic automata are not intimidating, they are exactly what I have been implementing all along.

References

Stanford, Caleb, Margus Veanes, and Nikolaj Bjørner. “Symbolic boolean derivatives for efficiently solving extended regular expression constraints.” Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021.