How we ported HighlightJS to Dart

Alexey Inkin
Akvelon
Published in
13 min readMay 9, 2023

Background

HighlightJS is one of the most popular packages for code highlighting, and is used on many websites where code snippets can be seen. At Akvelon, we needed the same for apps written in Dart.

Java code highlighted in a Flutter app.

There is a Dart highlight package that was made by porting HighlightJS, however it was abandoned and not maintained for 2 years. We needed a fresh version with bugs fixed, so we decided to fork and revive that package and then to catch up on HighlightJS changes.

This resulted in highlighting and flutter_highlighting packages, which you can now use too. In this article, we will share with you the details on how we made them.

HighlightJS Architecture

The original JavaScript package consists of its core and 197 language definitions:

Each language is declared as an instance of Language interface. It does not contain the whole syntax tree definition, but only rules to parse comments, string literals, and some other things sufficient for highlighting. This is more efficient than a whole syntax tree parser.

Each such rule is called a Mode. There are modes for C-style comments, for string literals, for keywords, etc. A Mode may contain other Modes. For instance, a doc comment may contain references to variables, and a string literal may contain variable interpolation. A Mode may even contain a piece in another language: think of a doc comment with Markdown code. To accommodate this, the Language interface extends the Mode interface.

The core of the package contains the parser and some other processors of the Language objects. The output of the parser is a Result object, which can be turned into an HTML string.

Dart Architecture

The architecture of HighlightJS is mirrored in Dart for ease of maintenance. We have the core, the language definitions, and the Result class with the same fields.

Additionally, we have a Flutter package that wraps the whole thing in a widget. It accepts the code string and builds a RichText widget with a tree of colored TextSpan objects.

Scavenging the Abandoned Package

The Porting Tool

The original Dart package was ported from HighlightJS 9.18, while the current version is 11.7, so a lot has changed.

Because language definitions change the most, the package comes with a porting script for them. However, the tool no longer works out of the box. With minor changes, we could port many languages, but others required extensive work.

The Core

There is no porting tool for the core because it does not follow a simple format like the language definitions, and there is no general transpiler from JavaScript to Dart.

The Core of HighlightJS diverged to the point that it was easier to discard the old Dart core and write a new one. This is why we chose to transpile the core from JavaScript to Dart manually. We just replicated all classes and functions, mostly line by line.

We rely on tests to see if anything has changed in the JavaScript core that needs to be reflected in the Dart core.

Porting the Language Definitions

An Example Language

In HighlightJS, each language is defined in a separate file as a factory function that returns an object conforming to the Language interface. For instance, here is the definition of Dockerfile syntax:

/** @type LanguageFn */
export default function(hljs) {
const KEYWORDS = [
"from",
"maintainer",
"expose",
"env",
"arg",
"user",
"onbuild",
"stopsignal"
];
return {
name: 'Dockerfile',
aliases: [ 'docker' ],
case_insensitive: true,
keywords: KEYWORDS,
contains: [
hljs.HASH_COMMENT_MODE,
hljs.APOS_STRING_MODE,
hljs.QUOTE_STRING_MODE,
hljs.NUMBER_MODE,
{
beginKeywords: 'run cmd entrypoint volume add copy workdir label healthcheck shell',
starts: {
end: /[^\\]$/,
subLanguage: 'bash'
}
}
],
illegal: '</'
};
}

The key property here is contains, which is a list of the Mode objects that can be parsed from text. In this example, only the last element is specific to the language, while 4 others are reusable constants because they are common for many languages:

  • HASH_COMMENT_MODE matches comments starting with #
  • APOS_STRING_MODE matches single-quoted string literals
  • QUOTE_STRING_MODE matches double-quoted string literals
  • NUMBER_MODE matches number literals

For instance, this is the definition for double-quoted string literals:

export const QUOTE_STRING_MODE = {
scope: 'string',
begin: '"',
end: '"',
illegal: '\\n',
contains: [BACKSLASH_ESCAPE]
};

Here:

  • scope will be translated to the CSS class name of a colored span
  • begin is how the match begins
  • end is how the match ends

Automation

The easiest way to parse and transpile such language definitions is to write a tool in TypeScript. This tool is a client of the HighlightJS library. It loops through language definitions and calls each one’s factory function to get a JavaScript object. This object will get longer because all constants for common modes will expand.

For the above Dockerfile syntax definition, we get this Language object at runtime:

{
"name": "Dockerfile",
"aliases": ["docker"],
"case_insensitive": true,
"keywords": ["from", "maintainer", "expose", "env", "arg", "user", "onbuild", "stopsignal"],
"contains": [
{
"scope": "comment",
"begin": "#",
"end": "$",
"contains": [
{
"scope": "doctag",
"begin": "[ ]*(?=(TODO|FIXME|NOTE|BUG|OPTIMIZE|HACK|XXX):)",
"end": {},
"excludeBegin": true,
"relevance": 0
},
{
"begin": "[ ]+((?:I|a|is|so|us|to|at|if|in|it|on|[A-Za-z]+['](d|ve|re|ll|t|s|n)|[A-Za-z]+[-][a-z]+|[A-Za-z][a-z]{2,})[.]?[:]?([.][ ]|[ ])){3}"
}
]
},
{
"scope": "string",
"begin": "'",
"end": "'",
"illegal": "\\n",
"contains": [
{
"begin": "\\\\[\\s\\S]",
"relevance": 0
}
]
},
{
"scope": "string",
"begin": "\"",
"end": "\"",
"illegal": "\\n",
"contains": [
{
"begin": "\\\\[\\s\\S]",
"relevance": 0
}
]
},
{
"scope": "number",
"begin": "\\b\\d+(\\.\\d+)?",
"relevance": 0
},
{
"beginKeywords": "run cmd entrypoint volume add copy workdir label healthcheck shell",
"starts": {
"end": {},
"subLanguage": "bash"
}
}
],
"illegal": "</"
}

Then, the idea is to walk through this JavaScript object and to write the definition of the Language object in Dart based on that.

Porting the Common Modes

If we simply generate the Dart equivalent of this object definition, it will be just as long. We can simplify this if we detect those common modes that were just expanded by JavaScript.

In the case of Dockerfile, we should identify those HASH_COMMENT_MODE, APOS_STRING_MODE, QUOTE_STRING_MODE, and NUMBER_MODE.

To be able to use these building blocks in the Dart language definitions, we must port them to Dart.

We do this by inspecting the global hljs object because each common mode is an exported constant that ends up being a property of hljs.

Then we write common_modes.dart file with definitions like these:

final QUOTE_STRING_MODE = Mode(
scope: "string",
begin: "\"",
end: "\"",
illegal: "\\n",
contains: [
Mode(
begin: "\\\\[\\s\\S]",
relevance: 0,
),
],
);

Or with a pass to detect nested common modes:

final QUOTE_STRING_MODE = Mode(
scope: "string",
begin: "\"",
end: "\"",
illegal: "\\n",
contains: [BACKSLASH_ESCAPE],
);

Porting the Actual Language Definitions

Now that we ported the common modes, which are the building blocks of a language, we can do the same with the language definitions.

However, a huge difference is that language definitions may have circular references.

A good example is Dart language, which allows interpolating strings with arbitrary expressions, that may in turn contain strings, and so on.

This definition creates a circular reference (comments are mine):

const BRACED_SUBST = {
className: 'subst',
variants: [
{
begin: /\$\{/,
end: /\}/
}
],
keywords: 'true false null this is new super'
};

const STRING = {
className: 'string',
variants: [
// (Skipping 4 other literal variants.)
// We are interested in those allowing interpolation like this one:
{
begin: '\'\'\'',
end: '\'\'\'',
contains: [
hljs.BACKSLASH_ESCAPE,
SUBST,
BRACED_SUBST // This allows us to parse ${something}
]
},
// (Skipping 3 other literal variants.)
]
};

BRACED_SUBST.contains = [
hljs.C_NUMBER_MODE,
STRING // This circles the reference.
];

This means that we cannot just iterate this object to write the Dart definition because the recursion will be infinite.

To break the circular references, we will use the circular-json package. It is an object serializer that detects if some object is repeated in the structure.

This simple snippet shows what the package does:

const object = {};
object.arr = [
object, object
];
object.arr.push(object.arr);
object.obj = object;

var serialized = CircularJSON.stringify(object);
// '{"arr":["~","~","~arr"],"obj":"~"}'

In this example:

  • The circular references to the root object are serialized as "~" strings
  • The circular references to the arr property is serialized as "~arr" string

In general, it replaces all repeating objects with the path of their first occurrence in the structure.

When we serialize the Dart language definition this way, we get this JSON below (comments are mine):

{
"name": "Dart",
"keywords": { /* Skipping a lot here */ },
"contains": [
{
"className": "string",
"variants": [
// (Skipping the 4 variants of raw strings.)
// The definition of a multiline single-quoted string:
{
"begin": "'''",
"end": "'''",
"contains": [
// (Skipping 2 non-important elements.)
// This is "BRACED_SUBST" constant:
{
"className": "subst",
"variants": [
{
"begin": "\\$\\{",
"end": "\\}"
}
],
"keywords": "true false null this is new super",
"contains": [
// (Skipping 1 non-important element.)
// This is the reference to the 'all strings' definition.
// This path reads as:
// 1. Take the root object.
// 2. Take its 'contains' property.
// 3. Take its 0th array element.
"~contains~0"
]
}
]
},
// The definition of a multiline double-quoted string is much
// shorter because it references the objects encountered before:
{
"begin": "\"\"\"",
"end": "\"\"\"",
"contains": [
// (Skipping 2 non-important elements.)
// This is the reference to BRACED_SUBST constant.
// This path reads as:
// 1. Take the 4th variant of string (multiline single-quote).
// 2. Take its 'contains' property.
// 3. Take its 2nd array element.
"~contains~0~variants~4~contains~2"
]
},
// (Skipping 2 other string definitions, single-lined)
]
},
// (Skipping 8 things other than string literals highlighted in Dart).
]
}

In this JSON, we get a lot of such tokens: ~contains~0~variants~4~contains~2

Sometimes they save us from circular references. Other times, they just shorten the language definition by avoiding repetitions of long JSONs.

Also while serializing, we replace occurrences of common modes with their names to make the definition even shorter:

{
"name": "Dart",
"keywords": { /* Skipped a lot here */ },
"contains": [
// (Skipped all string definitions.)
"C_LINE_COMMENT_MODE",
"C_BLOCK_COMMENT_MODE",
"C_NUMBER_MODE",
// (Skipped some more.)
]
}

This whole language definition can now be parsed back with the ordinary JSON.parse() into a non-circular object. We can walk it and write the Dart equivalent:

final dart = Language(
// This is the dictionary of all repeated parts.
refs: {
// This is the "BRACED_SUBST" construct:
'~contains~0~variants~4~contains~2': Mode(
className: "subst",
variants: [
Mode(begin: "\\\$\\{", end: "\\}"),
],
keywords: "true false null this is new super",
contains: [
C_NUMBER_MODE,
Mode(ref: '~contains~0'),
],
),
'~contains~0~variants~4~contains~1': Mode( /* Skipping */),
'~contains~0': Mode(
className: "string",
variants: [
// (Skipping the 4 variants of raw strings.)
// The definition of a multiline single-quoted string:
Mode(
begin: "'''",
end: "'''",
contains: [
BACKSLASH_ESCAPE,
Mode(ref: '~contains~0~variants~4~contains~1'),
Mode(ref: '~contains~0~variants~4~contains~2'),
],
),
// The definition of a multiline double-quoted string:
Mode(
begin: "\"\"\"",
end: "\"\"\"",
contains: [
BACKSLASH_ESCAPE,
Mode(ref: '~contains~0~variants~4~contains~1'),
Mode(ref: '~contains~0~variants~4~contains~2'),
],
),
// (Skipping 2 other string definitions, single-lined.)
],
),
},
// End of the dictionary.
// Below is mostly the equivalent of the original definition:
name: "Dart",
keywords: { /* Skipping */ },
contains: [
Mode(ref: '~contains~0'),
// (Skipping 8 things other than string literals highlighted in Dart).
],
);

In this definition, Modes come in different ways:

  • Mode(begin: "...", end: "...", ...) is the traditional definition mirroring what was in the JavaScript
  • Mode(ref: "~contains~...") is a reference to the dictionary entry
  • C_NUMBER_MODE, BACKSLASH_ESCAPE and alike are the constants from common_modes.dart

Recovering the Circular References

At runtime, a language must be ‘compiled’ before it can be used for highlighting.

This means that all the ‘reference’ modes should be replaced with their dictionary entries: each Mode(ref: "~contains~...") is replaced with the corresponding Mode object from the Language.refs map.

With circular references back in place, we can do recursive highlighting of string literals that contain interpolation that contains string literals, etc.

At this point, we are ready to highlight most of the languages that HighlightJS supports. The actual parsing of code and matching against modes is done in the package core that we transpiled to Dart manually.

Testing

HighlightJS has golden tests. For each supported language, it has snippets of input code and the reference HTML to be produced when highlighting them.

We need to take those snippets and feed them to the Dart package to see if it highlights them to the same HTML as the original HighlightJS.

It is as simple as:

  1. Clone the HighlightJS repository
  2. Check out the tag with the version we are porting
  3. Find all input snippets
  4. Run the Dart highlighting
  5. Compare with the reference HTML

This work is one-off, so no language-specific work is required.

The original Dart package did that. We additionally made the tool write the actual output when it does not match the golden one. So for each mismatch we have a directory:

Broken golden tests in 4 languages.

For each broken language, we have a directory with all broken tests. For each of them, we have three files: the original code snippet, the actual highlighted HTML, and the expected highlighted HTML, so we can easily compare:

Fixing Tricky Languages

At this point most of the languages were working, and the tests showed us what exactly was still broken. We inspected each broken language to identify and port rarely used mechanisms in HighlightJS.

Callbacks

Unfortunately, not all language definitions are declarative. Some modes use callbacks when matching against code.

For instance, in PHP a multi-line string starts with a token that will also end it (called the “Heredoc” syntax):

<?php
echo <<<THIS_TOKEN_BEGINS_AND_ENDS_THE_STRING
a
b
c
THIS_TOKEN_BEGINS_AND_ENDS_THE_STRING;

It’s hard to come up with a declarative solution to match such patterns. So this is how HighlightJS defines a Mode for such syntax:

const HEREDOC = hljs.END_SAME_AS_BEGIN({
begin: /<<<[ \t]*(\w+)\n/,
end: /[ \t]*(\w+)\b/,
contains: hljs.QUOTE_STRING_MODE.contains.concat(SUBST),
});

The END_SAME_AS_BEGIN function adds two callbacks to the passed Mode:

/**
* Adds end same as begin mechanics to a mode
*
* Your mode must include at least a single () match group as that first match
* group is what is used for comparison
* @param {Partial<Mode>} mode
*/
export const END_SAME_AS_BEGIN = function(mode) {
return Object.assign(mode,
{
/** @type {ModeCallback} */
'on:begin': (m, resp) => { resp.data._beginMatch = m[1]; },
/** @type {ModeCallback} */
'on:end': (m, resp) => { if (resp.data._beginMatch !== m[1]) resp.ignoreMatch(); }
});
};

This results in the following effective definition of this Mode:


const HEREDOC = {
begin: /<<<[ \t]*(\w+)\n/,
end: /[ \t]*(\w+)\b/,
'on:begin': (m, resp) => {
resp.data._beginMatch = m[1];
},
'on:end': (m, resp) => {
if (resp.data._beginMatch !== m[1])
resp.ignoreMatch();
}
contains: [ /* Skipping complex things here. */ ],
};

Note the entries on:begin and on:end.

The first callback is called when the beginning of the Mode is matched. It stores the token that matched the regular expression.

The second callback is called when the end of the Mode is matched. It makes the core ignore the match if the ending token differs from the one that stated the match.

Since these callbacks contain arbitrary code, we cannot port them automatically. Our solution was to port them manually and to have a dictionary to map the JavaScript function bodies to Dart function names.

The most common callbacks were produced by functions like END_SAME_AS_BEGIN. For them we made the following map:

const commonCallbacks = new Map<string, string>([
[hljs.END_SAME_AS_BEGIN({})["on:begin"]!.toString(), "endSameAsBeginOnBegin"],
[hljs.END_SAME_AS_BEGIN({})["on:end"]!.toString(), "endSameAsBeginOnEnd"],
// Skipping some more.
]);

In JavaScript, the function.toString() returns the function code. We use this fact to populate the map.

We added a new check when serializing a language definition to JSON. We inspect all properties and find callbacks. If the callback’s body is found in this map, we replace it with the Dart function name.

Other callbacks are specific to the languages they are used in. For example, in the definition of the JavaScript language there is a huge callback that checks if something is a JSX tag.

As with any callback, we manually transpiled it to Dart. For language-specific callbacks, we just generate the Dart function names from the path where this callback was defined:
language_javascript_contains_0_contains_0_variants_0_onBegin

Auto-detection

HighlightJS can auto-detect the language. Initially, we wanted to skip this to complete the explicit highlighting faster. But it turned out that auto-detection is necessary even for that.

For example, XML language can treat the style tag in two ways:

  • If it contains CSS, it should be highlighted as CSS and nested XML should not be parsed.
  • Otherwise, treat it as an ordinary tag and highlight its content recursively as XML.

This is how these rules are defined:

{
className: 'tag',
/*
The lookahead pattern (?=...) ensures that 'begin' only matches
'<style' as a single word, followed by a whitespace or an
ending bracket.
*/
begin: /<style(?=\s|>)/,
end: />/,
keywords: { name: 'style' },
contains: [ TAG_INTERNALS ],
starts: {
end: /<\/style>/,
returnEnd: true,
subLanguage: [
'css',
'xml'
]
}
},

Therefore, even to highlight a snippet as explicit XML, we must be able to detect different languages in its content.

We ported the auto-detection part of the core as well, and that fixed a few more languages for us.

The Improvements Over the Original Dart Package

As a result, this is how we improved over the package that was 2 years old:

  • Improved syntax error tolerance, a bonus of the newer HighlightJS. In many languages, any missing quote used to kill the highlighting of the entire document. So did a missing colon in Python. This was heavily impairing the use in code editors where code is incomplete while typing. Now this works:
Old package on the left, new package on the right.
  • Improved highlighting details in languages with new callbacks. With older HighlightJS, PHP and others had to go false-positive on string endings without that check. That led to errors that were extremely hard to find:
Old package on the left, new package on the right.
  • Got some languages up to date. For instance, the old Dart version was not highlighting the required keyword, and the old Java version was not supporting multi-line strings.
  • Added new languages: C, LaTeX, NestedText, Node REPL, Python REPL, WebAssembly, and Wren.
  • Made the porting tool in TypeScript instead of JavaScript so it is easier to maintain.
  • Got a clear and documented workflow for porting so the package will never be abandoned again.

The Team

Applications for this Package

A lot of apps can make use of code syntax highlighting, including but not limited to:

  • Code editors
  • Messengers
  • Documentation viewers

What will you make with it? Please drop a link to your app in a comment.

We are working on a code editor that uses this package for highlighting code as you edit it. Want to learn about it? Follow us to get notified!

--

--

Alexey Inkin
Akvelon

Google Developer Expert in Flutter. PHP, SQL, TS, Java, C++, professionally since 2003. Open for consulting & dev with my team. Telegram channel: @ainkin_com