How I built a simple compiler — Markdown to HTML
An inspiration
A colleague of mine, who happens to be my ̶b̶o̶s̶s̶ ̶ leader wrote a really cool compiler. It’s not a cool CRUD app or complicated animation.
IT’S A COMPILER. Used in the production environment. And it’s not backed up by a community or a big company.
I think that’s pretty cool…I would like to code cool stuff...So I decided I’m going to write a compiler!
So I did…because I could do following
- Learn more about the magic of compilers
- Learn something else than a new JavaScript framework
- Achieve one of my goals.
- Code something different.
- Understand the codebase more in-depth
How?
Foundation of my compiler is built on top of Irony. It’s a .NET Language Implementation Kit. Unfortunately, there’s not a lot of info or tutorials online. Even with the little amount of online documentation, I was able to define language grammar, so I can parse the input.
I’m using Irony’s NonTerminal and Terminal classes to define my markdown.
Grammar
Now it’s time to define Grammar rules. Grammar is defined using Backus-Naur form. Irony overloads C# operators and makes it really easy to define grammar.
It’s so easy to read. Let’s dive in.
Bold text has to start with a star * followed by a Text ending with a star *. Text can be StyledText or PlainText. But StyledText can Bold, Italics or U͟n͟d͟e͟r͟s͟c͟o͟r͟e͟. Do you see the little hack? Now Bold text can also be underscore and italics! Same thing with Header. Underscore header? No problem, since it fits our grammar rules.
If you notice the definition of NonTerminals and Terminals you will see the second parameter is a type of Ast node. This way I could map my parse tree to AST. It was really a pain to achieve this. I searched StackOverflow, second page of google results…no luck. My solution might not be the best in the world but it works. It’s a function which creates an instance of the type you provide it with.
It’s much easier to work with AST than parse tree :) That’s why you do it.
From Abstract Syntax Tree to HTML
Grammar is defined, AST looks good. Now I had to do a thing which I’ve been successfully avoiding for like 5 years. I had to learn the visitor pattern. It makes much more sense now, than in a university few years ago.
This is how Irony accepts Visitor when traversing the tree.
And this is how I wrote my HTML visitor. Before you visit a node write you tag, after that he will visit every child.
When you encounter PlainTextAst, output the text we parsed
And now just visit the root
Voilà!
From this
#"H1 Heading"
#_"H1 undescore heading"_
##"H2 Heading"
*"This is bold text"* "and this is regular text"
"Following three items should be in a list, this is plaintext"
- "Plaintext list item"
- *"Bold List item"*
- _*"Underscore bold list item"*_
- _"Underscore list item"_
#"This is heading again"
To this
<div>
<h1>H1 Heading</h1>
<h1>
<u>H1 undescore heading</u>
</h1>
<h2>H2 Heading</h2>
<b>This is bold text</b>
<p>and this is regular text</p>
<p>Following three items should be in a list, this is plaintext</p>
<ul>
<li>Plaintext list item</li>
<li>
<b>Bold List item</b>
</li>
<li>
<u>
<b>Underscore bold list item</b>
</u>
</li>
<li>
<u>Underscore list item</u>
</li>
</ul>
<h1>This is heading again</h1>
</div>
It works!
Conclusion
I wanted to write a compiler from my markdown to HTML. I picked up great available tools to achieve it and did it. You don’t have to start in assembly to write your own parser, compiler, language, operating system…
Try it. Do it. Solve it.