Bi-directional Text For Beginners
You see an input field with some Arabic text in it…
Naturally, you think to yourself: I should type something between those words. You do so (try it here), and something strange happens…
The Arabic words have switched places. If that makes sense to you — please stop reading now.
The Arabic word for man (rajul), is spelt ر (raa’), ج (jeem), ل (laam). Arabic however is written from right-to-left, so you would write it as رج ل … except you’d actually write it as رجل because Arabic letters change depending on where they are in a word. That however is a whole different topic.
How it is written though, does not affect how it is stored. Characters in a string are stored sequentially in memory, but memory is not organised from left-to-right (LTR) or right-to-left (RTL). Since you are reading this in a left-to-right blog post, it is most intuitive to show items in memory organised from left-to-right, e.g.
But it’s just as correct (and more intuitive for some) to show items from right-to-left, e.g.
To print RTL text, we start from the right side of the container, and print each character in sequence. No magic required, e.g.
But what if we want to print the English word “man”, followed by the Arabic word for man? In memory that string will look like:
If we print all of those characters from left-to-right, the Arabic word will be wrong. If we print them from right-to-left, the English word will be wrong.
Unicode solves this problem by allowing individual characters to specify their directionality, e.g.
Both English (i.e. Latin) and Arabic letters are classified as strong characters because they specify a direction.
To print such a string, the Bi-directional (BiDi) algorithm is used to determine the correct ordering of characters. The first step is to choose a base direction. Then we iterate though each character: if the character has the same direction as the base direction, we print it. If it has the opposite direction then put aside in a buffer. If we reach a character that has the base direction (or the end of the string) then we reverse what is in the buffer and print it.
If we want to print our example string in an LTR container, then our base direction will be LTR. m, a and n will be printed as they match the base direction. ر doesn’t match the base direction, so we start buffering until we reach the end of the string, where we reverse and print the buffered characters. The Arabic letters will become an embedded segment of RTL in an otherwise LTR string:
If the base direction is RTL, then m, a and n will be buffered. When we reach ر, we reverse that buffer (i.e. n, a, m) and print it. The remaining characters match the base direction and so are printed in order. The English letters will have become an embedded segment of LTR in an otherwise RTL string:
Characters like spaces are classified as neutral characters because they are used in both LTR and RTL text. Their direction is determined by the characters on either side of them.
The Arabic for ‘pencil’ is the word قلم (qalam) followed by the word رصاص (ressawss). Consider the sentence: “Pencil is قلم رصاص”.
The first space is surrounded on both sides by LTR characters, so it’s treated as such. The second space is surrounded by both, so it takes the base direction. The third space is surrounded by RTL characters so it’s treated as such.
When we print this string from left-to-right, the algorithm preserves the ordering of the entire RTL segment:
The system and digits we use for writing numbers in English are based on the Hindu–Arabic numeral system. Thus numbers can be written in Arabic in the same way that they’re written in English. For the purposes of printing, they can be considered LTR, but if they were strong LTR characters, then they would break up the RTL text surrounding them. For example, the Arabic for “he brought 123 men” is…
The BiDi algorithm will print the sequence of digits “123” from left-to-right, but will not break the right-to-left directionality of the surrounding Arabic:
Unfortunately the BiDi algorithm can’t always know how a string is intended to be printed. Consider the following example where the intention is to print the Arabic characters individually in sequence:
The BiDi algorithm sees the Arabic letters and their contained spaces as an RTL segment and prints them in reverse:
Not what we intended. The solution is to insert characters which explicitly set the direction. In this case we can use the LRO (Left-to-Right Override) and PDF (Pop Directional Format) characters to override the right-to-left directionality of the Arabic characters, e.g.
Which prints as:
Let’s return to the example that we started with. The initial string is made up of two Arabic words:
When that is printed from left-to-right, both words are treated as a single segment of RTL text, and the order of the words is preserved. The BiDi algorithm will buffer everything up the end of the string, and reverse it:
However, if we insert a strong LTR character in the middle of the string:
Then the BiDi algorithm will treat the Arabic words as two separate segments of RTL text, and their ordering is not preserved: