How to Build a RegEx Engine in Python

Part 4: The AST

Lorenzo Felletti
Python in Plain English

--

Photo by Jr Korpa on Unsplash

In this article, I will present you the process I followed to end up with the final AST of the solution, and which are its strength and flaws (because yes, following my articles you won’t create a perfect solution, sorry).

So, let’s start with the logic blocks I wanted to have to build my AST.
I felt I needed:

  • a Group block (to group various elements)
  • an Element block, or more in general a Leaf block (used for elements, wildcards, match start/end, range elements, …)
  • an Or block, to express the alternative
  • a Regex block, to express the “single point of entry” (kinda useless, it will turn out)

So, each regex will have a root node called RE, the fancy name I chose for the top-level block acting as a single point of entry.
This node is actually quite useless, but also harmless, and in the next article, I will briefly discuss why it stayed in the final solution.

Having these blocks clear in mind, I created this hierarchy of Python classes:

Hierarchy of AST nodes.

The base class is the AstNode, and the RE, Group, and Or nodes directly inherit from it.

All the other nodes will be leaves of our AST, meaning they will not have children nodes, so I created the intermediate class LeafNode inheriting from AstNode from which they all inherit, and that serves as a base class to group them all.

Oops, I made a mistake, and then another (maybe)

I decided on the fly that it would be good for the RE node to have only one child, as it cannot have more than one, so I called this property child.
Following the same idea, I called the GroupNode children property children, and the OrNode children left and right.

This was making total sense in my mind, until I had to code the engine, in which as we will see we want to iterate through the children of these nodes.

You can guess the error now: having all these different names for (basically) the same property is not a good design choice.
You cannot write node.children and move on, you have to check the type of the node, and then access the children using the right field name.

This would make the code more error-prone, uselessly longer, and unnecessarily leads to WET code.

We need to DRY this code, but how. We need to call all these properties with one name: children.

But, the parser is already using the child, children, left, and right properties, and it would be painful to rewrite the code using the newborn property.
This is when I had the “eureka!”:

Just add the children property to every class and handle the mapping under the hood.

A (mostly harmless) flaw

The problem is that, as it is implemented, if someone changes, for example, the right child using the node.right property after the node is created the change is not propagated to the children property, and vice-versa.

This is an error, actually, but I was too lazy to figure out a different solution, and after all, changing a node after its creation doesn’t make much sense, so we can supersede this error.

Conclusion

To conclude, the AST hierarchy we created is good enough to don’t end up being an obstacle to our project, even if it has a couple flaws, not very relevant in our context.

Anyway, it is crucial to notice these flaws, and trying to avoid making the mistakes leading to them next time would be good for sure.

--

--