Iro is a system used to create syntax highlighters for modern IDEs.
Syntax highlighting is the process of stylising text within a document based on rules. We have all observed some form of highlighted document such as a text editor that underlines badly spelled words, or as programmers — quoted text being coloured slightly differently.
Syntax highlighting is usually achieved by subdividing text into regions one or more times, and then associating a style with the region or region parts. Until recently there were many different competing specifications on the best way to represent these rules but more recently the Textmate grammar definition format has emerged as the most popular choice, with excellent support in Eclipse, Atom, VSCode, Sublime and more.
For example, let’s imagine that we wish to syntax highlight the following text:
The cat entered the room and said, “I am a cat.”
Let us assume we wish to create a rule whereby the word cat is coloured blue and everything inside quotes is coloured red.
All other text is the default style. By way of disambiguation, we wish for the quote rule to override the cat rule, so all text inside quotes is red whether it contains the word cat or not.
Textmate grammar definition files (.tmlanguage) files define a set of rules from which to split text using regular expressions. Textmate operates on a per-line basis. That is, it only ever attempts to match text on the current line, then moves onto the next, and the next, and the next.
So when thinking about syntax highlighting, there are four things to consider:
- Text on the current row.
- The current match cursor position in the row
- The current context of the matcher, and its call stack.
- Rules in the current matcher context.
Iro (link) is a web-based tool that simplifies the act of creating syntax highlighting definition files in many different syntax highlighting formats — including the dominant Textmate format.
The value-add part of Iro is that it provides an additional layer of abstraction, a debugger, and a visualiser. It can save a lot of time in that you can create multiple syntax highlighting formats from a single definition and the debugger is particular convenient.
An Iro syntax highlighter looks like this (rules omitted for brevity at this point):
We must now fill this Iro grammar definition with rules and understand the matching algorithm. In this example, we require only two rule types.
- A pattern match rule — Attempts to match text from the current cursor position on the current line.
- An inline push rule — Attempts to match a regular expression from the current cursor position on the current line, and if matched, enters a new context that can only be exited from by matching another regular expression (may be on a subsequent line).
Iro requires a certain level of knowledge of regular expressions, and I cannot recommend strongly enough learning regular expressions. If you do not have a good understanding of regular expressions, then you still may be able to to understand but it may be difficult to understand.
Textmate based syntax highlighters, Iro, and many other syntax highlighting technologies use the following matching approach.
Start of Algorithm (simplified version)
The highlighting engine, at the current cursor position, on the current line executes all regular expressions in the current context, and the earliest rule in a list that matches closest to the cursor gets to consume the text.
If after executing all regular expressions in the current context, no regular expression has matched the remainder of the line from the cursor position, then the remainder of the line is emitted with the default (non styled) style, and the cursor moves to the beginning of the next line (if there is one).
If after executing all regular expressions in the current context, one or more matches is offset one or more characters from the cursor position, then the number of characters that exist before the match is emitted with the default (non styled) style, and then the rule associated with the matched regular expression runs from the cursor position of the start of the match.
The “pattern” rule-type
The simplest type of matching rule is a pattern matching rule. It simply matches a regular expression, and then assigns one or more styles to the entirety of the match, or sub-regions within the matched text. Upon completion of emitting the text, the cursor is moved to the end of the matched text, and we go back to the “Start of Algorithm” section, the match context unchanged, but the match cursor adjusted.
Here is an example of matching “cat” using a standard pattern rule. We will omit the “style ” section for brevity, but assume it is unchanged from the previous snippet.
The rule adds a pattern block. The pattern block starts with the “:” character. The pattern block has two key attributes. The regex attribute, which describes a regular expression (and must use the \= operator), and the styles block, which describes a list of styles to be applied to the regular expression match (and must use the “ =” operator).
“(cat)” without the quotes is a regular expression. The brackets are special characters in a regular expression and represent a matching group. Matching groups are required for all matches in iro. Every part of a match regular expression must be contained by brackets.
“(cat)” represents a “c” followed by “a” followed by “t”. If that pattern of text is available anywhere on the current line, then a match occurs. If there are no matching patterns that match earlier than the “c” then all text prior to the match will be emitted without a style, and the “c” “a” and “t” will be emitted with the “.cat” style. The match cursor will be moved forward to the character after the matched “t”. At the moment, there is only one match pattern and one match context, so if there isn’t “cat” available on the current line then the cursor will move to the next line (if available) and the contents of the current line will be emitted without style.
The snippet shown above actually won’t work as it references a non-existent style “.cat”. The name “.cat” is arbitary and could have any name, but naming them as per their use makes sense. If the style is only for the word “cat”, it makes sense to name it “.cat”.
Styles are groups of attributes that should apply to a region of the text. Styles are non overlapping. Every character in a syntax highlighted piece of text is either emitted with or without a single style. When emitting to a renderer, it can use the colour contained in a style, or a external style identifier to look up with externally defined themes.
In this example, we will create a style with explicit colours. In the boilerplate we set that the current theme would have a white background, so we can set up the style based on that assumption.
All colours contained within styles are meant for debug rendering purposes only. Textmate, Ace Editor and other libraries have their own method for assigning colours to text — with which Iro is completely compatible.
We add the following styles:
Notice we use the incorrect spelling of colour ;-).
Now we click the render preview button on Iro.
We can see the following render:
The render is correct in that the only rule is to find text that matches “cat” and to assign the “.cat” style to that text, that style specifying “blue” as its render colour. RGB colours in the format “#000” to “#fff” are also supported.
Inline Push Rules
An inline push rule is a rule in which we can move the parser into state, the start and end regions of that new state demarcated by regular expression. We call this type of rule an “inline push” because we are pushing a new context onto the context stack.
A stack is a programming concept where items can be added to a list, but they can only be removed from a list in the order in which they were added. Iro (and almost all syntax highlighters) uses a stack to remember what state to revert back to if we encounter a pop match item.
In Iro, an inline push requires a regular expression to be matched to enter the inner context, then the inner context can have any number of rules defined internally, but the first rule MUST BE A POP. The pop has priority for returning to the context in which the inline push is declared (in this case main).
For the quoted text rule, we use an inline push that is initiated with a double quote character (\”) and the context is popped with another double quote character (\”). These two characters and all characters inbetween will be assigned the “.quoted” style. Iro supports different styles for the start and end characters too and escaped characters within a quoted block. You can find information on advanced use in the user guide.
Here is the updated main context:
The “styles” block defines the styles to assign to the main “regex” regular expression (.quoted). The “default_style” is a catch all for all match items that were not matched by items in the inner context.
In this case, we define no rules other than the pop, so all non pop matches will also be assigned to the “.quoted” style. Finally, we have an inline pop rule, which looks for the quote character, and if matched will emit the character with the (.quoted) style AND will pop out of the inner context back to the context that contains the inline_push rule (in this case “main”).
With the addition of the inline push, here is our new preview render :
To emit a textmate compatible .tmlanguage file we must associate textmate scopes to our internal style settings:
A Textmate scope describes text in terms of programming language concepts. By not defining scopes / styles in terms of a fixed set of colours, it allows editors that use Textmate definitions to define an infinite number of themes whist using the same grammar for tokenising the text. Textmate scopes are documented in various places but they are also listed in the Iro user guide.
In Iro, the text colour options are provided for convenience of debugging, but after a grammar is defined, scopes should be assigned to each and every defined style.
Emitting Textmate Grammar
The web version of Iro will generate a Textmate compatible grammar file (alongside 5 other formats) each time a preview is rendered.
There is an indicator label in the bottom right of the screen to indicate if the current definition or preview render is fresh or stale.
To access the Textmate definition, go to the Textmate tab, and from there, you may copy and paste the definition text (in pList xml format):
Alternatively, a Sublime 3 definition format is available, if you should so wish, or Atom, or Ace. All three use the same scope associations covered by “textmate_scope = …” in the Iro style block per style.
The Scope Report
Iro supports outputting to a variety of grammer definition file formats. There are (currently) essentially 3 families.
- Ace family
- Textmate family
- Pygments family.
The Ace family inherits from the Textmate family, so defining a Textmate scope disambiguates an emitted Ace Editor grammar file. But defining just a textmate scope (as we have) means that if we are emitting grammar definitions files for Pygments or Rouge (both use the same scope families), then it doesn’t know what scope to assign for the internal Iro styles.
As we do not wish to force scopes to be provided for syntax highlighters for which the author has no interest in using (which invites entering nonsense for its own sake), then we do the next best thing, and provide a coverage report.
The coverage report gives a health check on coverage. Green is full scope coverage for all defined Iro styles and all emitted grammar files. Yellow is partial. Red is zero.
The scope report is accessed via the “…” option on the right hand side of the web editor. It is only valid if you see the “Fresh” label at the bottom right.
The report correctly indicates that the Textmate family has full coverage (all green below the Textmate heading), but that the Pygments family is ambiguous. Ambiguity is measured if two or more Iro styles share the same output scope for a given syntax highlighting family.
The solution to this ambiguity is to define Pygments styles. Common Pygments styles are documented in the user guide. In the below snippet, we choose two unique Pygments scopes arbitrarily. Experimentation may be required to assign the correct scope to the correct style. Iro may be used to define styles at a higher level of granularity than the underlying scopes (two or more Iro styles to a single underlying scope). In these cases, the suppress_scope_conflict_warning=”true” option can be provided to explicitly flag that the collision is acceptable.
Now that we have added these new scopes, the scope conflict report is clear. All syntax highlighting formats should be valid (although bear in mind, this is a beta project so there may be bugs).
And that covers a vertical slice of functionality for the Iro syntax highlighter editor / debugger / generator.