Textadept Language Lexers

Impression and tips on extending language support


It is beginning to look like I will be writing an article on Textadept for the coming weeks or months to come. As to keep things simple, here is part three of what I’m now calling: “The Ever Expanding Series on Textadept Guides, Tips, and Tricks.” Abbreviated as TEESOTGTAT… or humanly understandable, TGTT. What can I say? I love the darn text editor.

A little back story, again. A few ages ago, before my enlightenment on Textadept, I gained a massive fear of writing language lexers for text editors. It was never because I did not have the ability to do so, but because the process was so darn extravagantly complex for some, it became an annoyance to finish or start them. I would start each time optimistically. Yet after a few days(weeks), I end up with a mesh of regular expressions that would become self aware and still not be done with my ‘attempt’ at a new language lexer. To give some credit however, many of my tries did get close to what I envisioned for the language lexer, but never did they reach a level I felt comfortable publishing. As a programmer(commonly known as a person suffering from OCD) it was frustrating to not finish a lexer and I ended up not trying anymore. Then came Textadept.

What makes Textadept so special in lexer language writing is the fact it uses parsing expression grammar(PEG) for the syntax parsing. If you never used PEGs, the syntax isn’t hard to learn, but is not even required to know for making lexers on Textadept. In fact, thanks to the extensive abbreviations done for you on Textadept, you can make a lexer without ever knowing much of PEGs. All you really have to do is look over the excellently written piece at the API section for lexers on Textadept to get started.

However if you want to make something well, do yourself a favor and watch Roberto’s presentation on LPEG(Lua’s PEG interpretation) or take a few moments of your time to read through the LPEG reference manual. PEGs happen to be a complete joy to use, as opposed to regular expressions(regex), for programming language parsing. The layman’s reason is that the PEG standard is spectacular powerful in programming language parsing due to memoization. It is practically built exclusively for the ability. Memoization allows PEGs to parse syntax grammar quite precise with very little to zero redundancy in the interpretation.

The fact that Textadept uses PEGs for language parsing makes writing an entire lexer a breeze. To give an example. My Rust lexer, which holds EVERY single context and definition of the language, has 92 lines, including whitespace, comments, and formatting I tend to use. Heck, the default ANSI C has 73 lines.

I won’t try to explain on how to actually make a lexer since the lexer api is so elegantly written and easy to follow. Instead, I’ll leave you with a couple tips that I tend to follow when writing a lexer for Textadept.

The first tip when writing a lexer is to look at a language’s lexer that is already written which is similar to your language’s syntax. You can save a lot of time in trying to figure out how to get say ‘brackets folding’ correctly with only using the reference on how it’s done on a similar language. As an added bonus, it also helps you see the best practices on how to write your lexer.

Another tip I have for you is a follow up on the first. If the language lexer you are writing is actually just an embedding of a couple of other languages, then use the embedding syntax of Textadept. Don’t go and write an entire lexer for multiple languages, just to find out you could’ve embed them in under 20 lines.

The same can be said in use of an alias for a lexer. Save yourself the trouble and time and just use the alias. My Linux kernel module is literally a one liner alias to `ansi_c`:

return require(“lexer”).load(“ansi_c”, “linux”)

Next, when writing your lexer, try to separate the parsing done to single expressions. The power in LPEG and PEGs in general is that you can separate a parsing expression to how ever many lines you wish. Then combine with the conditional operators `+` or `*` the expressions. Doing the syntax parsing as such can also help you clearly identify what’s going on.

-- Strings
local sq_str = P(‘L’)^-1 * l.delimited_range(“’”)
local dq_str = P(‘L’)^-1 * l.delimited_range(‘”’)
local raw_str = “##” * (l.any — ‘##’)^0 * P(“##”)^-1
local string = token(l.STRING, sq_str + dq_str + raw_str)

Above is a quick example of separation of the parsing expression for all the ways strings are written in Rust. The first three lines are expressions for each type of string. The last line is where I collect all the expressions I defined to use as the pattern for strings.

Lastly. Don not be afraid of writing a language lexer for Textadept. It really is easy and is one of the major powers of the text editor that very few others have.

So what you waiting for!? Open up Textadept, make a new lexer file, and follow the lexer reference guide to get started!