Some trivial knowledge about Unicode

13 min readJul 15, 2022

Recently, due to product requirements, I took a look at the Unicode. There are some small knowledge points to record here.

Before formally reading, you may wish to think about a few questions:

Does an emoji have to be a single character?
How are different skin tones versions of an emoji encoded?
Can I filter out all format characters when filtering text entered by the user?
Are none of the format characters visible?
If the username has styled characters like 𝔸, how can I convert them to A without using simple letter mapping?

After reading this blog, you should have the answer.

It is recommended to read my previous blogs first:

Introduction to character encoding

Almost every programmer will encounter the garbled characters except for programmers in English-speaking countries. We…

medium.com

Introduction to Unicode equivalence and normalization

This article mainly introduces the concepts related to Unicode equivalence and Unicode normalization.

medium.com

(Limited to the font of your device, some characters in this article may not be displayed properly on your device. I found that Mac Chrome, Mac Safari, and iPhone Safari all displayed slightly different results. Due to the poor support for tables in Medium, some tables in this article are screenshots of browsing the tables in yuque on Mac Chrome)

Variant form

A variant form is a different glyph for a character, encoded in Unicode through the mechanism of variation sequences: sequences in Unicode that consist of a base character followed by a variation selector character.

Emoji variation sequences

Emoji-style & text presentation
The most common variant form we see is emoji style variation.

Some emojis have two representations: text presentation and emoji presentation.

Specifying the desired presentation is done by following the base emoji with either U+FE0E (VARIATION SELECTOR-15, hereinafter referred to as VS-15) for text or U+FE0F (VARIATION SELECTOR-16, hereinafter referred to as VS16) for emoji-style.

The base emoji characters are rendered in either text or emoji styles, depending on your device. If it does not have a corresponding variant, it will display the style of the base character. The following table lists several emoji as examples.

Note that different devices may see different results.

Generally, newer emojis do not have text styles, but your device may support its text styles alone.

Emoji modifiers
There is also a common variant used on emojis showing different skin tones.

For example, U+1F385 (🎅) Santa Claus has five other versions of different skin tones in addition to the basic yellow version: 🎅🏻🎅🏼🎅🏽🎅🏾🎅🏿.

Human skin color can be divided into six categories from light to dark according to the Fitzpatrick scale. There are five related emoji modifiers defined in Unicode, officially named EMOJI MODIFIER FITZPATRICK TYPE , as shown in the following table:

So emojis of different skin tones are actually basic emoji characters followed by an EMOJI MODIFIER FITZPATRICK TYPE character.

Due to the different font support of different devices, you may see a base emoji plus a skin color block in the form. This is a normal phenomenon, which means that your font does not support the different skin color variants of the emoji.

East Asian punctuation positional variants

There are two variant forms of East Asian punctuation in fullwidth characters: corner-justified form and centered form. When the basic character is followed by a U+FE00 (VARIATION SELECTOR-1, referred to as VS-1), it represents the corner-justified form; when the basic character is followed by a U+FE01 (VARIATION SELECTOR-2, referred to as VS-2), it represents the centered form.

The reason why it is called corner-justified is that when the text is typeset horizontally, the symbol is on the left; when the text is typeset vertically, the symbol is on the right.

Since my computer can’t render this variant well, I won’t give an example here.

Mongolian variant forms

There are several variations of a Mongolian character. Each variant has a different appearance depending on where it is located, such as Isolate, Initial, Medial or Final.

Four free variant selectors are defined in Unicode to select different variants.

U+180B (MONGOLIAN FREE VARIATION SELECOTR ONE, referred to as FVS1)

U+180C (MONGOLIAN FREE VARIATION SELECTOR TWO, referred to as FVS2)

U+180D (MONGOLIAN FREE VARIATION SELECOTR THREE, referred to as FVS3)

U+180F (MONGOLIAN FREE VARIATION SELECOTR FOUR, referred to as FVS4)

The FVS is required for Mongolian, and if you want to filter the user’s Mongolian input, you cannot filter out the FVS.

Emoji Sequences

The emoji variant introduced above is actually a type of emoji sequence, and here are two more.

Keycaps

Any one of the characters from 0 to 9, # or * followed by a U+20E3 (Combining Enclosing Keycap) can form an keycap emoji, such as U+0039 (9) followed by U+20E3 to get 9️⃣ .

Flags

There are flags of many countries and regions in emoji, such as the Singapore flag 🇸🇬. This emoji is actually a combination of regional indicator symbols in the Enclosed Alphanumeric Supplement area of Unicode. These symbols are in one-to-one correspondence with 26 English letters, for example, U+1F1E6 (🇦) corresponds to U+0041 (A).

According to ISO 3166–1 alpha-2, the country code of Singapore is SG. Connect U+1F1F8 (🇸, corresponding to S) and U+1F1EC (🇬, corresponding to G) in the area indicator, and it will become 🇸🇬.

Zero-width joiner

U+200D (Zero-width joiner, ZWJ) is a non-printable format character. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

The exact behavior of the ZWJ varies depending on whether the use of a conjunct consonant or ligature is expected by default. When a ZWJ is placed between two emoji characters (or interspersed between multiple), it can result in a single glyph being shown, such as the family emoji.

Emoji ZWJ Sequences

The emoji ZWJ sequence is also a type of emoji sequence. Two or more emojis are connected through ZWJ and finally displayed as an emoji that combines their respective characteristics.

For example this sequence:

👩 (U+1F469)

ZWJ (U+200D)

❤ (U+2764)

VS-16 (U+FE0F) Note that this is a variant selector, combined with the previous ❤, it will become ❤️, no ZWJ connection is required between them.

ZWJ (U+200D)

💋 (U+1F48B)

ZWJ (U+200D)

👨 (U+1F468)

The final display is 👩‍❤️‍💋‍👨 (different devices may see slightly different, for example, the left and right positions of men and women are swapped) (I’ve recently found that Chrome renders this emoji a little weird)

Another example:

👩 (U+1F469)

🏿 (U+1F3FF) Note that this is a variant selector for skin color, combined with the previous 👩, it will become 👩🏿, no ZWJ connection is required between them.

ZWJ (U+200D)

🦳 (U+1F9B3)

The final display is 👩🏿‍🦳 .

It should be noted that this sequence is not arbitrarily combined, the sequence is fixed.

Are the base character and its variant equivalent?

I have introduced Unicode equivalence in Introduction to Unicode equivalence and normalization. Many of the above mentioned are also a string of Unicode code points combined into a glyph, that is, a variant sequence. Is there a compatible equivalence for the variant sequence? For example, are the emoji variants compatible equivalent to the base emoji? No, they don’t have the concept of equivalence

The emoji variant is actually just a change in the glyph, and there is no new code point corresponding to it, so there is no concept of equivalence.

Equivalence is primarily a matter of dealing with text search and text processing. The simplest way to search for variant sequences is to ignore variant selectors.

Zero-width non-joiner

U+200C (Zero-width non-joiner, ZWNJ) is also a non-printable format character. Contrary to ZWJ, its function is to insert it into the middle of characters to prevent characters from joining together and avoid the appearance of ligatures.

For languages such as Persian, Malay, German, Nepali, etc., ZWNJ is necessary, and if ZWNJ is omitted, it may lead to different meanings or violate grammar.

For example, Nepali श्रीमान्‌को has a ZWNJ (श्रीमान् [ZWNJ] को), and if you remove it, the text becomes श्रीमान्को. Notice how the two texts look different. If you ask Google to translate, the meaning is completely different.

Mongolian vowel separator

When I mentioned Mongolian earlier, I mentioned that a Mongolian character has multiple variants, and each variant has a special appearance depending on where it is located. To avoid ambiguity, there is also a Mongolian vowel separator U+180E (MONGOLIAN VOWEL SEPARATOR, MVS for short) in Unicode to separate vowels.

MVS is very important for the correct display of Mongolian. If the user’s Mongolian content needs to be filtered, this format character cannot be filtered out.

Visible format character

The format characters seen above are invisible themselves, but only affect the glyphs of other characters. In fact, the format characters can also visible, such as the Syriac Abbreviation Mark U+070F (SAM). When using SAM, it displays a horizontal line above the text.

Private Use Areas

There are three Private Use Areas (PUA) in Unicode:

Private Use Area (U+E000..U+F8FF)

Supplementary Private Use Area-A (U+F0000..U+FFFFF)

Supplementary Private Use Area-B (U+100000..U+10FFFF)

Due to the existence of Supplementary Private Use Area-A and Supplementary Private Use Area-B, the High Private Use Surrogates (U+DB80..U+DBFF) area also belongs to PUA according to the rules of the Unicode surrogates area. (Can refer to my article Introduction to character encoding)

Unicode does not assign characters in these areas, but third parties can assign characters in this area themselves. Despite the name private use area, third parties can make their own allocated schemes public.

Since it is privately assigned by a third party, if you use the characters encoded by the code points in this area, these characters are generally only visible on the platform you are currently using, and if you need other people to see the same text as you , others need to use the same font as you.

For example, Apple assigns U+F8FF to Apple’s logo (). This character can only be seen when accessing this page with an Apple device. Non-Apple devices generally cannot render this character, or render it as other private characters.

Bidirectional text

(This section is mainly from Wikipedia)

The typesetting direction of general languages is fixed, such as Chinese and English, which are typesetting from left to right (LTR). But Arabic and Hebrew are typeset from right to left (RTL). If all the characters in the text are in one typesetting direction, there will be no problems in typesetting. Strange problems can arise if there is a shuffle of text with different typeset orientations, this type of text is called bidirectional text.

There is a very cumbersome Unicode Bidirectional Algorithm (UBA) in the Unicode standard to regulate the typesetting of bidirectional text.

But UBA also makes mistakes, and occasionally developers need to provide UBA with some guidelines to make it typesetting correctly.

The Unicode standard calls for characters to be ordered ‘logically’, i.e. in the sequence they are intended to be interpreted, as opposed to ‘visually’, the sequence they appear. For this purpose, the Unicode encoding standard divides all its characters into one of four types: ‘strong’, ‘weak’, ‘neutral’, and ‘explicit formatting’.

Strong characters

Strong characters are those with a definite direction. Examples of this type of character include most alphabetic characters, syllabic characters, Han ideographs, non-European or non-Arabic digits, and punctuation characters that are specific to only those scripts.

Weak characters

Weak characters are those with vague direction. Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols.

Neutral characters

Neutral characters have direction indeterminable without context. Examples include paragraph separators, tabs, and most other whitespace characters. Punctuation symbols that are common to many scripts, such as the colon, comma, full-stop, and the no-break-space also fall within this category.

Explicit formatting

Explicit formatting characters, also referred to as “directional formatting characters”, are special Unicode sequences that direct the algorithm to modify its default behavior. These characters are subdivided into “marks”, “embeddings”, “isolates”, and “overrides”. Their effects continue until the occurrence of either a paragraph separator, or a “pop” character.

Marks
If a “weak” character is followed by another “weak” character, UBA will look at the first neighbouring “strong” character. Sometimes this leads to unintentional display errors. These errors are corrected or prevented with “pseudo-strong” characters. Such Unicode control characters are called marks. The U+200E (LEFT-TO-RIGHT MARK, LRM) or U+200F (RIGHT-TO-LEFT MARK, RLM)) is to be inserted into a location to make an enclosed weak character inherit its writing direction.

For example, in an Arabic (RTL) passage, we need to correctly display the U+2122 (™, TRADE MARK SIGN) for an English name brand (LTR) . For example, “قرأ Wikipedia™‎ طوال اليوم.” . If a LRM mark is not added after the trademark symbol, the weak character ™ will be neighbored by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order (e.g. “قرأ ™Wikipedia طوال اليوم.”). (Here is a deliberate mistake to simulate viewing in an Arabic environment. Error text without LRM is displayed correctly in English/Chinese environment, only incorrect in Arabic environment)

+-----+--------------------+--------+
| LRM | LEFT-TO-RIGHT MARK | U+200E |
+-----+--------------------+--------+
| RLM | RIGHT-TO-LEFT MARK | U+200F |
+-----+--------------------+--------+

Embeddings
The “embedding” directional formatting characters are the classical Unicode method of explicit formatting, and as of Unicode 6.3, are being discouraged in favor of “isolates”. An “embedding” signals that a piece of text is to be treated as directionally distinct. The text within the scope of the embedding formatting characters is not independent of the surrounding text. Also, characters within an embedding can affect the ordering of characters outside. Unicode 6.3 recognized that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use.

+-----+-------------------------+--------+
| LRE | LEFT-TO-RIGHT EMBEDDING | U+202A |
+-----+-------------------------+--------+
| RLE | RIGHT-TO-LEFT EMBEDDING | U+202B |
+-----+-------------------------+--------+

Isolates
The “isolate” directional formatting characters signal that a piece of text is to be treated as directionally isolated from its surroundings. As of Unicode 6.3, these are the formatting characters that are being encouraged in new documents — once target platforms are known to support them. These formatting characters were introduced after it became apparent that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use. Unlike the legacy ‘embedding’ directional formatting characters, ‘isolate’ characters have no effect on the ordering of the text outside their scope. Isolates can be nested, and may be placed within embeddings and overrides.

+-----+-----------------------+--------+
| LRI | LEFT-TO-RIGHT ISOLATE | U+2066 |
+-----+-----------------------+--------+
| RLI | RIGHT-TO-LEFT ISOLATE | U+2067 |
| FSI | FIRST-STRONG ISOLATE  | U+2068 |
+-----+-----------------------+--------+

Overrides
The “override” directional formatting characters allow for special cases, such as for part numbers (e.g. to force a part number made of mixed English, digits and Hebrew letters to be written from right to left), and are recommended to be avoided wherever possible. As is true of the other directional formatting characters, “overrides” can be nested one inside another, and in embeddings and isolates.

+-----+------------------------+--------+
| LRO | LEFT-TO-RIGHT OVERRIDE | U+202D |
+-----+------------------------+--------+
| RLO | RIGHT-TO-LEFT OVERRIDE | U+202E |
+-----+------------------------+--------+

Pops
The “pop” directional formatting characters terminate the scope of the most recent “embedding”, “override”, or “isolate”.

Runs

In the UBA, each sequence of concatenated strong characters is called a “run”. A “weak” character that is located between two “strong” characters with the same orientation will inherit their orientation. A “weak” character that is located between two “strong” characters with a different writing direction will inherit the main context’s writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL).

How to handle explicit formatting characters when filtering user input

The above directional formatting characters can affect external text, so we need to be especially careful with text entered by the user, such as the username.

Users can customize their own user names. If the user name will participate in the typesetting, especially if it is inserted into a piece of text, the user can insert a control character such as RLO to make the text content with his name typesetting abnormally.

If you want to filter directional formatting characters, I personally think that except for marks, others should be filtered out to avoid them interfering with external typesetting.

If the target market languages of your app are familiar to you, such as Chinese and English markets, it is safest to use the whitelist method to filter characters. If there are many languages in the market, you can only use the blacklist mode, find one case, and ban one. After all, we’re not Unicode experts and don’t know which format characters are required for which languages.

How to remove styles from styled text

The styles here refer specifically to characters in the Mathematical Alphanumeric Symbols block, such as U+1D538 (𝔸). And some full-width characters in the Halfwidth and Fullwidth Forms block, such as U+FF21(A). If you need to convert U+1D538(𝔸) or U+FF21(A) input by the user into U+0041(A) and don’t want to manually write a bunch of mappings in the code, how to convert?

First we need to clarify the scope and only convert Mathematical Alphanumeric Symbols and the full-width characters we know about.

Referring to Introduction to Unicode equivalence and normalization, these characters are basically compatible equivalent to the corresponding basic characters, so do NFKD on these texts once, only get the basic characters, and replace the original characters with the basic characters.

Note that this operation cannot be performed on all texts, because if U+00C3(Ã) is operated in this way, it will become U+0041(A), which is inconsistent with what we said about just removing the style of the styled text.

中文版:

一点Unicode冷知识

最近因为业务需求，看了一下Unicode字符集里的东西，有一些小的知识点在这里记录一下。

medium.com

Some trivial knowledge about Unicode

Introduction to character encoding

Almost every programmer will encounter the garbled characters except for programmers in English-speaking countries. We…

Introduction to Unicode equivalence and normalization

This article mainly introduces the concepts related to Unicode equivalence and Unicode normalization.

Variant form

Emoji variation sequences

East Asian punctuation positional variants

Mongolian variant forms

Emoji Sequences

Keycaps

Flags

Zero-width joiner

Emoji ZWJ Sequences

Are the base character and its variant equivalent?

Zero-width non-joiner

Mongolian vowel separator

Visible format character

Private Use Areas

Bidirectional text

Strong characters

Weak characters

Neutral characters

Explicit formatting

Runs

How to handle explicit formatting characters when filtering user input

How to remove styles from styled text

一点Unicode冷知识

最近因为业务需求，看了一下Unicode字符集里的东西，有一些小的知识点在这里记录一下。

Written by Wan Xiao