Tutorial: Write a Finite State Machine to parse a custom language in pure Python

I was once a huge fan of FSMs (Finite State Machines) as a mechanism to keep track of states. Automata theory is the basis of class of computational problems solvable by discrete math. I had used fysom in the past but this time I wanted something home grown. I was able to write complex language parse in a couple hours using only 200 lines of code.

Work, Home, and Bed state machine example

First what is a FSM: This work, home, and bed example shows how the transitions work. Each transition changes state. However not every transition is available from every state. For example, you can’t sleep at work. WAKE UP!

My Problem is simple: Write a language parser for the syntax I invented call POSH Syntax. You can read all about this syntax and its purpose in a future article yet to be written. Just know it has something with to do with writing rules for detecting parts of speech. Why didn’t I use the traditional methods of parsing a language using lex/yacc or even Beazly’s PLY? Well the reasons are simple:

  1. More straightforward
  2. Zero dependencies
  3. Ability to have cleaner control over program flow
  4. I wanted others to have the ability to read my code

Last time I tried to write a parser was probably 20 years ago and it started looking like a mess of reg-ex and obfuscated logic. This time I wanted it to be clean, I wanted it to be in Python, and I wanted it to be entirely under 200 lines of code. Long story short, this tutorial will outline the steps taken to get me there.


1. Analyze the structure

First a couple simple examples of the POSH Syntax one per line (3 examples):

VB(noise+3)
NNS(acoustics) & RB(not)
(NNS(acoustics) & RB(not)) | JJ(muted)

Let’s not dive too deep into the meaning of my syntax. Instead we are going to focus on the stream of characters and how they may be parsed. A couple quick observations:

  • There is a prefix like “VB” then a parenthesis set.
  • Inside the parenthesis is a subject
  • each of those prefix + ( + subject + ) we can call a Rule
  • then there may be an operator that strings rules together like ‘&’ or ‘|’
  • sometimes there are groups of things also encapsulated by a parenthesis.
  • if I look left to right, character per character, I observe what options from that spot to the next will modify state

2. Draw a state diagram

It’s easiest for me if I simply start with one state and I ask myself, “self, where can I go from here?” For example if start with the state PREFIX while looking at the very first character of the first example the ‘V’ (in “VB(noise+3)”) I see I can only go two places, 1) to another character, in this case the ‘B’ or 2) I can go to the first ‘(‘ parenthesis. If I graph this out it looks like this:

Here I am saying when I go to the next character I am either still in the prefix or I move to the subject state if I ever to see the “(“ character.

We need to do this for every single state. We just keep going character per character and ask ourselves what are the possible options and what are the results of those options. Then end diagram looks like this:

POSH Syntax State Diagram with transitions

Notice I have given each transition a name, each state as well.

A note on coverage: This is where TDD (test driven development) comes it. A later goal will be to try to cover (run a “coverage” report) every transition and every state.

3. List our transitions and states and convert them to code

For the transitions I will give each a short all caps name then a lower case longer name (so I can implement those in Python later):

T_SKIP = transition_skip
T_NEW_GROUP = transition_new_group
T_APPEND_CHAR_PRE = transition_append_pre
T_ADD_OP = transition_add_op
T_ADD_OP_NEW_RULE = transition_add_op_new_rule
T_END_GROUP = transition_end_group
T_END_RULE = transition_end_rule
T_APPEND_CHAR_SUBJ = transition_append_subj
T_ADD_GROUP_OP = transition_add_op_new_group

Now for the states, give each one a short name and then a longer full text name to help identify them.

S_NEW_GROUP = “STATE: NEW_GROUP”
S_END_GROUP = “STATE: END_GROUP”
S_PRE = “STATE: PREFIX”
S_OP = “STATE: OPERATOR”
S_END_RULE = “STATE: END_RULE”
S_SUBJ = “STATE: SUBJECT”

4. Create a transition table of state changes

Our transition table must contain one entry for each:

  • start state (src)
  • end state (dst)
  • rule for transition (condition)
  • callback for the transition (callback)

In Python it will look something like this:

# For transition 1
{‘src’: S_NEW_GROUP,
‘dst’: S_PRE,
‘condition’: “[A-Za-z|+|-|\d]”,
‘callback’: T_APPEND_CHAR_PRE}

Now we number all of our possible stage changes so we are sure we don’t miss one.

POSH Syntax State Diagram with transitions

The end table looks something like this:

5. Complete our program class design

We start with three class:

  1. A Rule class: It will hold our Prefix, Suffix, and operator (left hand side)
  2. A RuleGroup: This will have a parent RuleGroup — poor man’s pushdown automata
  3. And the Rule_Parse_FSM class that main goal is to iterate over the input string and handle the transitions while callling the callbacks.

Rule Class:

RuleGroup Class:

And finally the Rule_Parse_FSM class:

A quick overview of the parse logic:

  • Run take the input string character by character and sends it to process next.
  • process takes only the relevant transitions (those with matching states to current state) and sends the character to an evaluator
  • The evaluator (iterate_re_evaluators) takes the cmd logic (which happens to be a very short regex statement) I stress Very short and not confusing, I hope. If it matches it fires a state change
  • The state changer stores a new stage and calls the call backs

Now we need callbacks that do something different depending on the transition from state to to state. We already outlined them but we need to now write each. Here is what we end up with:

It’s very convenient to pass in the fsm_obj which is an instance of Rule_Parse_FSM class. Each transition does something different. Let’s take transition_add_op for example. That is when we hit a operator like the ’&’ or the ‘|’ we find the Rule it belongs to (which will eventually be the complete left hand side of the rule) and assign that operator to the instance.

6. Test our Program

The complete program may be located here.

Running this files gives us:

N -> STATE: NEW_GROUP : STATE: PREFIX
N -> STATE: PREFIX : STATE: PREFIX
( -> STATE: PREFIX : STATE: SUBJECT
h -> STATE: SUBJECT : STATE: SUBJECT
e -> STATE: SUBJECT : STATE: SUBJECT
l -> STATE: SUBJECT : STATE: SUBJECT
l -> STATE: SUBJECT : STATE: SUBJECT
o -> STATE: SUBJECT : STATE: SUBJECT
) -> STATE: SUBJECT : STATE: END_RULE
skip ' ' in STATE: END_RULE
& -> STATE: END_RULE : STATE: OPERATOR
skip ' ' in STATE: OPERATOR
N -> STATE: OPERATOR : STATE: PREFIX
N -> STATE: PREFIX : STATE: PREFIX
( -> STATE: PREFIX : STATE: SUBJECT
w -> STATE: SUBJECT : STATE: SUBJECT
o -> STATE: SUBJECT : STATE: SUBJECT
r -> STATE: SUBJECT : STATE: SUBJECT
l -> STATE: SUBJECT : STATE: SUBJECT
d -> STATE: SUBJECT : STATE: SUBJECT
) -> STATE: SUBJECT : STATE: END_RULE
<RuleGroup: {'rules': [<Rule:  NN(hello)>, <Rule: & NN(world)>], 'level': 0, 'rule_count': 2, 'parent': None, 'op': None}>

Now we have completely implemented our state machine in 200 lines of Python code. Yay! 🍺

Again the final code may be located HERE