Handling bi-directional text in the Phrase CAT web editor

Petr Šimon
16 min readMay 27, 2020

--

The world’s languages utilize a number of different scripts, most of which are written from left to right (LTR) such as Czech, English, orSpanish.
A number of languages, however, are written in the opposite direction, i.e. from right to left (RTL), for example, Arabic, Hebrew, Persian, and others.

Writers and translators sometimes need to mix languages written in opposite directions and thus create bi-directional text, e.g. a text in Arabic may contain a word or an entire phrase in English and vice versa.

Modern text editors quite often implement the Unicode bi-directional algorithm which defines the flow of the text when rendered on the screen, but even if that’s the case, there are ambiguous situations that only the author can disambiguate.

The Phrase CAT web editor, henceforth the web editor, is built on web technologies and features our own implementation of a rich-text component. The work on this component started several years ago when we couldn’t find any solution that would be suitable for our needs and thus we created the editor from scratch.

The web editor doesn’t rely on native browser technology for rich-text editing such as contenteditable. We handle all input manually and draw our own cursor, selections, highlights, etc. This means that we have to manually handle input of not only Latin scripts but also input from CJK IME’s (stay tuned for an article on this topic) and RTL.

Both Phrase editors supported RTL from the very start. The Phrase CAT desktop editor took advantage of an implementation of RTL handling in the Qt library it is built on, and the web editor offered basic support for RTL, but without support for bi-directional text. The translators that were producing bi-directional texts had to resort to all sorts of workarounds. It took a while until requests from customers started to pile up and we had to take a stand.

Let’s take a closer look at what bi-directionality entails, and why it deserves lengthy prose to clarify.

Notation

“By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and in effect increases the mental power of the race.” — Alfred North Whitehead (1958). “An Introduction to Mathematics”, p.39, New York : Oxford University Press, 1958

Before we start, let’s define a few symbols:

  • lowercase letters or L are LTR: latin, abc, LLL
  • uppercase letters or R are RTL: CIBARA, WERBEH, RRR
  • (n) means position in text: (0)a(1)b(2)c(3)
  • ? inaccessible cursor position
  • T Phrasetag: LLTRR or RRTLLTRR
  • [] selection: ab[cd]ef, “cd” is selected
  • N neutral character (space, tab,…)
  • W weak character (num., punct., minus, degree, …)

And finally, the position of a cursor is represented by:

  • | generic cursor position, e.g. ab|cd
  • > LTR cursor, e.g. LL>LL
  • < RTL cursor, e.g. RR<RR

Character types

According to the Unicode bi-directional algorithm, every character has a so-called bi-directional type. There are three such types:

  • Strong type — character that is always left-to-right or right-to-left, e.g. common characters such “A”, “B”, “C” or most characters of Arabic or Hebrew alphabets
  • Weak type — numbers and few other characters
  • Neutral type — e.g. punctuation, space, etc.

The Unicode bi-directional algorithm uses these character types to derive a sensible display of the characters on a computer screen. We’ll discuss a few examples further on in the article.

Logical & Visual Positions

The logical order is the order of characters as stored in the memory of a computer which is typically represented as left-to-right order of characters, e.g.

(1)ARABIC english HEBREW

represents the logical order of two RTL words and one English word between them. This is the order in which the user types the characters on the keyboard.

The visual order is the order of characters as displayed on the screen. The same example would be rendered on the screen as

(2)WERBEH english CIBARA

Notice that sample (2) is read from right-to-left starting with “CIBARA”, which is the reversed version of “ARABIC”. Then the eyes of the reader start reading “A” and continue to the left until reaching the space after “C”, then jump to “e”, i.e. the start of the word “english”, read it from left-to-right, then jump again to “H” of “WERBEH” and continue reading right-to-left.

Typing in bi-directional text

Typing and the movement of cursor is naturally analogical to the order in which characters are read.

When typing in LTR text, our cursor moves uniformly from left to right and stays obediently behind the character that just appeared on the screen. The same holds for typing in RTL text, only in reverse, i.e. cursor moves from right to left, always staying “behind” the new character, i.e. to the left of the character. Notice that the notion of “behind” and “in front of” switches with text direction and becomes even more complicated in bi-directional text.

The uniformity of cursor movement, however, is lost in bi-directional text. Please note that we will be talking about typing in RTL documents, which dictates the general direction in case of ambiguous context (we’ll discuss this further below).

When the writer starts typing ARABIC in an empty RTL document, this is what she should see on the screen:

  • | (a cursor)
  • <A (first Arabic character and RTL cursor behind it)
  • <BA, <CBA (typing continues)

Nothing surprising here.

It is naturally possible to type LTR text into a RTL document:

abc> (cursor is at the end of the text; compare with >abc where the cursor is at the beginning of the text in RTL document)

Something unexpected happens once the user types RTL character A and thus starts working with bi-directional text:

<Aabc

The cursor changes direction to RTL and moves behind the RTL character which is, as we pointed out above, to the left of the RTL characters.

Adding another RTL character would result in:

<BAabc

and another LTR character in

d>BAabc and then de>BAabc

Another situation which should clearly illustrate how typing in bi-directional text behaves is when the cursor is at the beginning of the LTR text in an RTL document:
>abc

This position coincides with the beginning of the RTL document so when the user adds RTL characters A, B, C, we won’t be surprised to see:

abc<A, abc<BA, abc<CBA

In other words, the RTL characters appear at the beginning of the document, i.e. where the cursor was. This can be visually confusing at first, but practice makes perfect, either while developing bi-directional support in a text editor or actually typing something meaningful…

The last illustration is the cursor movement with space in play. When a single LTR character is typed in an RTL paragraph, the cursor positions are as follows:

(0)a(1) with LTR cursors >a>

The 0 naturally marks the beginning of the text, number 1 marks the second position, etc. Programmers typically count from 0 to make things more interesting…

With the cursor at the end of the text a>, we add space and this happens

< a

The reason is that space is, from the point of view of bi-directionality, a completely ambiguous character — called neutral — which in RTL context is disambiguated as RTL. In our example, the context is determined by the RTL nature of the document.

Mapping cursor positions

Now that we have a grasp on typing Arabic and mixing some English into it, we are ready to talk a bit more about the intricacies of making all this actually work in a rich-text editor.

The main problem when developing support for bi-directional text is where to put the cursor and generally how to display text selections. After all, a cursor is just a collapsed selection.

The position of a cursor naturally suggests to the writer where letters will appear when she starts typing on the keyboard. Typically, each new letter appears at the current cursor position and the cursor is pushed by the width of the inserted character either to the right (in LTR text) or to the left (in RTL).

The mapping between logical and visual positions in a uni-directional text is one-to-one, e.g.

(3) english: (0)e(1)n(2)g(3)l(4)i(5)s(6)h(7)

(4) ARABIC: (6)C(5)I(4)B(3)A(2)R(1)A(0)

Unfortunately, mapping logical to visual positions in bi-directional text is not as clear-cut. A single logical position can have two meaningful visual counterparts and it’s up to the developers of editors to decide how to resolve this situation because users need a visual aid to know where to type and where the new characters are bound to appear.

After an intensive search engine session, it seemed that there’s no standard on how to do this that we could rely on. We’ve found few discussions and different ways how to approach the problem and, naturally, we’ve experimented with support for bi-directional text provided by different editors. See Links at the bottom for most useful resources we used.

Since Phrase also offers a desktop editor (implemented using the Qt library), we wanted to provide a coherent writing experience to our users who switch between both editors. Therefore, some decisions we made were influenced by the bi-directional support in the Qt library.

Let’s take a look at few examples and, first of all, let’s clarify why it is even possible that cursor positions could have two different display positions. All texts are just linear sequences of characters after all… at least in the memory of a computer.

The rationale for the one-to-many relation between logical and visual positions is suggested by the need to jump when we read bi-directional text such as in the sample (2). The respective ambiguous cursor positions can be represented as follows (only relevant positions for illustrative purposes are shown)

(5) WERBE(16)H(15) (14,7)e(8)nglis(13)h(?) (6)CIBARA(0)

The logical positions 7 and 14 share the same visual position.

Position 7 is the LTR position in front of the word “english”. When the user types an LTR character such as “x” it will appear before “e”, rendering “xenglish”. When she types an RTL character such as “م” it will appear behind “h”, rendering “englishم”.

Position 14 is an RTL position behind the word “english”. When the user types an LTR character, it will appear behind “english”, but an RTL character appears before “english”.

To understand this apparently confusing behavior, we have to realize that “in front of” and “behind” have different meanings based on the text direction. In short, the position 14 means “behind” in RTL context and position 7 means “in front of” in LTR context.

It’s obvious that the shared positions such as 7 and 14 are always at places where text direction changes. These places delineate embedding levels, whose behavior is controlled by the rules defined by the Unicode bidirectional algorithm.

It’s beyond the scope of this article to discuss these rules and any such discussion would probably end up being just a repetition of the rules themselves.

The very short tale of four cursors

Having clarified where we want to display the cursor, we are still left with the task of how to graphically distinguish positions 7 and 14, so that the user knows what to expect when she starts typing.

There are two commonly used solutions; a small triangle is displayed at the top of the cursor facing either left or right signaling the direction of the current embedding level. A second option is to display two cursors, one bold signaling the dominant direction, the other opaque, representing a location of the character with the opposite directionality. The latter approach is used, e.g., in Firefox and Google Chrome browsers.

Since we wanted to stay as close to the Phrase CAT desktop editor, we’ve chosen the first option with the little triangle-like mark at the top of the cursor.

RTL cursor in Memsource Web Editor
RTL cursor in the Phrase CAT web editor
RTL cursor in Memsource Desktop Editor
RTL cursor in the Phrase CAT desktop editor
LTR cursor in Memsource Web Editor
LTR cursor in the Phrase CAT web editor
LTR cursor in Memsource Desktop Editor
LTR cursor in the Phrase CAT desktop editor
Two types of cursors as displayed in Google Chrome console
Two types of cursors as displayed in Google Chrome console

Inaccessible positions

Notice that one position in the sample (5) is marked by ?. That’s an inaccessible position. If we decided to use that position, it would be slightly unnatural, because when the user moves the cursor from position 6, it seems more likely she wants the cursor to be at the beginning of the LTR text, i.e. left of “e”. When choosing what to do in this situation, we have also taken inspiration from the implementation in the Qt library, which also disallows the cursor to be placed at certain positions. This way our users have similar typing experience when switching between editors.

Selections

Once cursor positions were sorted out, a more formidable enemy manifested before our eyes — selections. In example (2), we have discussed that when a reader scans bi-directional text, her eyes don’t move in a single direction, but jump over few characters and start moving in the opposite direction, e.g. when reading an English word embedded in an Arabic text. The eyes start on the right, track the Arabic word leftwards, then jump to the beginning of the English word, follow it rightwards, then jump again and so on.

We concluded that selections should preserve the visual order and follow the movement of a cursor as if the writer repeatedly pressed an arrow key and thus must be rendered discontinuously accounting for the non-linear jumps in the text.

Discontinuous selections are created as follows:

  • Given text RLLR| where | marks the original position of a cursor
  • Use keyboard shortcut shift+left and extend the selection marked by []
  • It goes like this: R|LL[R], R[L]|L[R], R|[LLR], |[RLLR]

Discontinuous selection using a mouse, dragging from right to the left:

  • R[LL|R], R[L]|L[R], R|LL[R], |[RLLR]

Notice that keyboard and mouse selections behave slightly differently. That’s because when using the keyboard the users direct the cursor to move along the visual coordinates, i.e. making jumps in text, whereas the mouse moves across all coordinates from right to left without any jumps (is that even conceivable?) and thus when crossing LTR text, it deselects previously selected text.

Notice also, that the cursor position doesn’t have to always coincide with the selection borders, such as in the step R|LL[R].

To draw discontinuous selection, we need to calculate coordinate ranges for all characters, or more generally, for all of the characters within a single embedding level, and then join overlapping and adjacent ranges.

Jumps within text

From the above discussion, it follows that even quite general movements in text, such as HOME and END jumps must be treated with respect to the bi-directional nature of the text.
When HOME or END is pressed the cursor should move to the logical start, or the logical end of the text, respectively.

  • LTR: (HOME)abc(END)
  • RTL: (END)CIBARA(HOME)
  • BIDI: latin(END)CIBARA(HOME)latin

Implementation details

As mentioned above, the rich text component in the Phrase CAT web editor relies on a very low-level implementation of input and cursor handling. The blinking cursor the user sees is drawn manually: the user inputs characters into a hidden input element, we process each character in a specific way (which becomes relatively complex in case of CJK) and insert it into another element that displays rich text represented as HTML.

In order to accomplish this, we need to convert frequently between positions in text and pixel coordinates. For example, when we want to correctly place the cursor after a user clicks on the text, this has to happen:

  • get coordinates from the mouse click
  • calculate the position in text from coordinates (note that the user most likely doesn’t click exactly between two letters, so we need to find coordinates of the nearest position between letters)
  • set collapsed selection, i.e. cursor
  • calculate coordinates from position (can be cached from previous calculation)
  • draw selection/cursor

The original implementation

Our original implementation for converting mouse clicks to positions in text was based on splitting text into two spans and shifting characters from one span to the other using binary search and measuring the Euclidean distance between the mouse click coordinates and [top, left] coordinates of the rectangles which bounds the second span.

This algorithm is fairly fast and works well in uni-directional text (both LTR and RTL).

Obviously, we are relying heavily on the way Document Object Model (DOM) is rendered on the screen and the API that is provided by DOM. When researching bugs caused by bi-directional texts, we realized that we can’t keep splitting text into two and measuring coordinates of rectangles bounding those spans, because when bi-directional text is rendered, the linear text (now split between two spans) has to be rendered discontinuously and the bounding rectangles overlap.

Let’s illustrate that with an example. Given our familiar text in both logical and visual representations:

  • Logical: ARABIC1 en|glish ARABIC2
  • Visual: 2CIBARA en|glish 1CIBARA

we want to find the coordinates of the cursor position marked by |. We split and wrap the text in two spans and read the second element’s [top, left] coordinates. This turns into two representations, logical in HTML and visual as rendered in the browser:

  • Logical: <span>ARABIC1 en</span><span>glish ARABIC2</span>
  • Visual: CIBARA2 [english CIBARA1]

The exact logical HTML representation is as follows:

HTML representing split text wrapped in spans

To see how browsers render this piece of HTML, we can open the dev tools and hover with the mouse over one of the spans, e.g. the second one, and this is what we would see in Google Chrome:

Bounding rectangle rendered when hovering on the second span
Bounding rectangle rendered when hovering on the second span

No wonder that users were unable to click into positions such as en|glish, because the way we calculated the cursor position, was completely unreliable in bi-directional text.

Feel free to play with the following JSFiddle snippet to see for yourself.

Overlapping bounding rectangles in bi-directional text

The new solution

While wrapping our heads around RTL and bi-directionality and learning how it’s handled by browsers, we experimented with few approaches and then concluded the following to be the most versatile for our case.

We found that characters wrapped in span element behave similarly to regular characters, i.e. they retain bi-directional behavior. In order to be able to measure character coordinates reliably, we needed to wrap characters in spans individually and it turned out that it’s most efficient to wrap all characters at once and inject such fragment into DOM to limit jank. Particularly, to be able to quickly draw the discontinuous selections, we need all character coordinates pre-calculated and cached.
Once we wrap all characters, we retrieve and map coordinates to positions in text, i.e. visual positions.

This new approach worked well. With few optimizations and judicious caching, we were able to achieve reasonable speed when drawing cursor, selections, and other highlights such as spellcheck errors.

Directional marks

However, our task was not done. In order to allow translators to work with bi-directional text efficiently, a text editor has to provide convenient means to adapt the bi-directional flow of the text to the specific needs of the author. We’ve included an option to insert two so-called directional marks.

We chose Right-to-Left mark (RLM) and Left-to-Right mark (LRM) as the most simple and arguably most versatile to use. These marks are regular characters but have no size, i.e. they are invisible. They represent general LTR and RTL characters, respectively.

In order to understand how these characters are used and why writers need to adapt the order of characters in bi-directional texts, we have to explain in more detail what happens with certain groups of characters, known as neutral and weak. We’ve already illustrated what happens around neutral characters such as “space” appearing in bi-directional text.

What happens with these characters, i.e. where they are displayed, is controlled by a number of rules, which we can greatly simplify into one rule: characters with ambiguous directionality inherit direction from their context. A context is, e.g., a document, a paragraph or characters with strong directionality surrounding them. That’s where directional marks come into play.

When an LTR phrase such as “64 bit” in RTL context is rendered on screen, e.g., “ARABIC1 64 bit ARABIC2”, the order of those two “English” words is flipped and the user sees:

CIBARA1 bit 64 CIBARA2

This is, however, unnatural and, above all, wrong. The phrase “64 bit” should be rendered as LTR, but the space in the phrase is disambiguated as RTL character, because it’s surrounded by strong RTL characters. The Unicode Bi-directional algorithm has no way of guessing that “64 bit” should be displayed in the same order as it’s logical order and instead uses few rules to find out the most reasonable way to render the whole sentence.

The disambiguation process can be simplistically represented as a list of character types RNWWNLLLR, where both Arabic words are represented by a single R to save space, numbers are weak characters W, spaces are neutral characters N , and LTR characters are L. The weak characters don’t contribute to the disambiguation and thus both spaces are treated as RTL, i.e. RRWWRLLLR. For comparison, the phrase “a 64 bit” would be rendered correctly, because those two spaces are surrounded by LTR context, i.e. letters “a” and “b”.

In ambiguous cases, the user has to manually correct the flow of the characters by placing invisible characters with special meaning. In the above phrase, the user can insert the LRM directional mark, which acts as a general LTR character and closes the space in “64 bit” in surrounding LTR characters, just like the letter “a” we’ve used above. The fully disambiguated symbolic representation of such sentence would then be RRLLLLLRR, i.e. all the weak and neutral characters are treated as LTR characters and the sentence is displayed as expected.

Non-printable characters

The web editor supports displaying a number of non-printable characters such as new lines, spaces, non-breaking spaces, etc. The invisible directional marks were the next obvious candidates to ease working with bi-directional text.

As mentioned above, the web editor needs to calculate coordinates of each character to be able to display cursor, selections, and highlights. When rendering invisible, i.e. zero-width, characters such as LRM or RLM, we need to replace the character with the visible symbol we are using for display. This works great in uni-directional text, but needs to be handled specifically in text which mixes scripts of two opposing directions.

Using our simplified notation, we need to replace the invisible LRM character R ^64 bit R, where ^ stands for LRM, by the visible symbol and obtain R ↱64 bit R. The symbol is, however, not a strong LTR character and simple replacement would cause incorrect rendering in the browser. As we explained earlier, we wrap each character by <span> to be able to obtain its exact coordinates, which means that we need the whole element <span>↱</span> to behave as LRM, i.e. some LTR character. This is accomplished by inserting the real LRM character in front of the span like so ^<span>↱</span>.

Conclusion

It doesn’t cease to amaze me what interesting challenges working with rich-text in browsers offer. More so, when one has to deal with languages other than those written in Latin scripts, such as scripts of Semitic languages or languages of the Far East, typically joined under the term CJK, but more on that in a later post…

Links

--

--