The technology behind beautiful text. Part 1 of 3.

Andrew Asadchy
12 min readMay 9, 2023

--

We communicate using screens every day, while text remains the main medium of on-screen communication. The goal of this article is to describe how text appears on screens. I wanted to create a strong vision of the process, highlight nuances and provide useful takeaways. There are a lot of great articles about the topic but all of them cover only part of the subject. This is my attempt to create a consistent and structured description of how all technologies behind the text work together. I hope it will be useful for a wide range of readers.

The publication consists of 3 chapters:
1. Unicode and Font Fallback (decoding the data, selecting a font to display)
2. OpenType (selecting glyphs, applying features, rendering)
3. Layout and Reading (how font choice and layout affect readability)

1. Unicode and Font Fallback

Unicode and Encoding

The idea of Unicode is ingeniously simple. Unicode creates a global standard that defines a unique address for every possible character in all possible languages.

For example: “Latin Small Letter A” (a) takes 97th place in the Universal Character Set, and “Hiragana Letter Small A” (ぁ) — 12 353rd.

In this case, any device can effectively “communicate” with any other device if they both support Unicode, no matter what the operating system is and what language is set.

Actually, sending the character number is not enough. Although it’s agreed among humans to use “U+” and the number in HEX format to write down these Unicode addresses, machines still operate with bytes.

Let’s say we want to send the ”ぁ” with the address of 12 353 (00110000 01000001). As it takes two bytes, we need to know if it is one “ぁ” symbol or two different “A” (01000001) and “0” (00110000). We need to establish additional rules about how to encode and decode these numbers. These rules are called Encoding formats.

UTF-8

There are several of them for Unicode, and UTF-32 uses the simplest rule. It encodes all Unicode code points using 32 bits (4 bytes) per character. The rule is: every 4 bytes it’s a new character. Simple as that. It’s fast to work with, but it consumes the maximum amount of memory. UTF-16 and UTF-8 are variable-length encodings. UTF-8 (Unicode Transformation Format, 8-bit) can use from one to four bytes per code point. Besides the number of a character, UTF-8 uses particular bits in a byte to indicate if this byte is independent or a part of a longer sequence. UTF-8 is the most space-efficient but the slowest one because it takes additional time to decode the right amount of bytes per character.

So your browser probably uses UTF-16 or UTF-32 when performing calculations, but communicates with the Internet via UTF-8.

Another big advantage of UTF-8 is backward compatibility with ASCII format, which some people still refer to as “Plain Text”. The thing is that there is no such thing as “Plain Text”. Every text has its encoding and ASCII was the previous American standard which could encode only 128 code points and therefore couldn’t be international. But it still exists together with a bunch of other formats.

So we still can face a situation that was very common at the dawn of the Internet, when text encoded with one format was decoded with another and became something called Mojibake.

The crazy thing is that a webpage can still use any encoding and the information about its encoding is written inside the same already encoded page. So browsers need to be able to encode at least this line of text correctly:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

It would make things slightly more complicated if browsers saw Mojibake instead of the “charset=utf-8” string. Fortunately, all Encodings used on the Internet are ASCII-compatible and use the same rules to code the basic Latin section.

Stop. What happens if a browser starts reading a page as UTF-8 and then finds out that the Encoding is Windows-1250? Nothing special, it starts reading the page from the beginning using the correct encoding. That’s why it is not recommended to place this tag at the end of the page ;)

But let’s get back to Unicode.

The Size of Unicode. Unicode allows to code 1 112 064 code points, nowadays (May 2023) about 150 000 are taken. And the size of the Basic Latin section is 128 points.

Unicode doesn’t consist of characters. Every code point represents just some theoretical concept, e.g. “Latin Small Letter A” doesn’t define the grapheme, it’s just an abstract ideal, something we can call a letter “a”. But the way it looks on the screen entirely (well, almost) depends on the font.

Unicode has information on every character of every human language including ancient ones, such as Egyptian Hieroglyphs or Ancient Middle East Cuneiform. It contains as well a lot of Special Symbols and certainly Emojis. Pretty much everything related to writing and symbols, except for commercial logos and national flags.

It has more than 10 variants of spaces (including a zero-width one which visually does nothing). Even such a small character as Hyphen has several variants: Standard, Not-breaking, and Soft. Soft Hyphen is invisible and only appears when reaching the end of the line.

There is a huge set of symbols related to Math. Be careful with them because you might accidentally summon evil dead:

Yes, the mathematical section contains several own stylized Latin alphabets because you might want to use fraktur (𝕶) letters to indicate some characteristics of the continuum, or double-struck (ℍ) letters to denote number sets. Or you might want to use one of the online services to make your name look groovy.

But doesn’t the existence of several Latin alphabets create problems? It does, and it’s not the only source of problems.

Here, the first character is the Ohm sign (the unit of resistance), and the second one is Greek Omega.

Here, the first ä is one character and the second one is a combination of two glyphs: a small Latin “a” and a Diaeresis Combining Diacritical Mark “◌̈”. So, the problem with previous examples is that some characters are absolutely identical and have the same meaning for users, but not for the code. Fortunately, Unicode has special Normalization algorithms to cope with it. In two words, Canonical algorithms say that Ω = Ω, ä = a + ̈, and so on. And Compatibility algorithms say that ℍ = H, ① = 1, ¼ = 1 / 4, and so on.

Unicode allows adding as many diacritical marks to one letter as you like, and all of them will be automatically spaced, creating a Zalgo effect to your text. Ideally, like this:

This is something we have already seen in the era of Typewriters when Backspace meant what it was named for. Those days, Backspace typed space in the reverse direction (moved the mechanism back to the already printed character) and the next character could be typed on top. Retyping the same character created a “bold” effect. And typing “~” on top of the previously typed “a” created “ã”.

Later, when Teletypes were invented, ASCII appeared to operate with them and Backspace changed its function.

Interesting to know that ASCII used special characters to control teletypes. Such characters didn’t convey any graphical information but told teletype to do something — to ring a bell, for example. “Hey, I sent you a text, did it ring a bell for you?”
But what is more interesting — all those Control characters still exist at the very beginning of Unicode (Unicode inherited ASCII structure for backward combability), and in Unicode they do nothing, they just waste some space like junk DNA with pseudogenes.

Unicode has its own Control characters. The most exciting of them are Bidirectional text control characters. If you place the “U+202E Right-to-left Override” character in your text then everything you type after this character will be displayed from right to left. For example, “CoolOwl[U+202E]gpj.exe” will be displayed as “CoolOwlexe.jpg” Beware! Sometimes the owls are not what they seem.

Another interesting Control character is the Zero Width Joiner (U+200D). It helps combine several emojis into a new one. For example, you can convert a black bear into a polar one by adding a Snowflake symbol to it. Of course, you need a font supporting this polar bear image, otherwise, you’ll see just a black bear face and a snowflake or even Tofu symbols instead.

Speaking of Tofu, let’s start moving toward Fonts.

Unicode & Fonts

As mentioned before, Unicode does not contain any graphical information about characters, but fonts do. To display a character we always need a font, any font, at least the system one. But no font contains all Unicode characters, even if it’s technically possible, it has no practical sense. Imagine, in order to publish a new font, a Type Designer has to create all 150 000 characters… and the next day a Unicode update is released.

In fact, the font structure doesn’t even repeat Unicode. Font can have a visual of one Unicode character, several visuals of the second, miss another one, and have some special non-Unicode characters, like .notdef (Tofu).

Tofu

Tofu is a nickname of the .notdef glyph inside a font. It is a graphic representation of the fact that the font cannot display the required character. Usually, it looks like a rectangle or a crossed rectangle (or the way a type designer wants).

When a system asks a font to provide a visual of the specific Unicode character, the font algorithm maps the request to its structure, initiates some additional algorithms (we will talk about them later), and decides what to send back.

So, what happens when the font replies to the request with .notdef? Depends on the application. A professional tool like Photoshop honestly displays Tofu. Other apps just display nothing not to confuse you. And your browser uses font fallback technology to silently switch the font to one that supports the symbol. But let’s finish with the Tofu first.

Tofu ≠ Replacement Character. The important thing to understand about Tofu is: even if you see the Tofu symbol instead of the required one, “ぁ” for example, the code of the string remains the same, it’s still U+3041. So, if you choose another font to display it or copy the string and send it to your friend, you still operate with “ぁ”, not tofu. In fact, you cannot copy the Tofu symbol, even if you want to.

Unicode has a lot of different symbols and some of them may look like Tofu, for example, this one: “White Vertical Rectangle” (U+25AF). As you suspect, if the font does not support this symbol, it can be replaced with Tofu and accidentally look right. In this case, you might think that you are able to copy/paste Tofu, but you still operate with U+25AF.

In fact, you cannot copy and paste any font glyph, not only Tofu, you copy and paste only its Unicode address. You can think of it this way: you copy and paste “containers” for symbols with fixed Unicode addresses, and what is inside these containers can vary depending on the font currently in charge. As long as these containers keep their order, the code remains the same and the sense remains the same. With one exception.

Another Unicode symbol that can look like Tofu is the Special “Replacement Character” (U+FFFD). It looks like � and is used to replace a sequence of incoming bytes whose value is unknown or unrepresentable in Unicode. For example, if a part of a code marked as UTF-8 does not follow the encoding rules, it will be replaced with such a character. And this means that your string is not valid Unicode and you have a problem. Seeing this character means the following: some data has been replaced with the Replacement Character code, so it’s not the original sequence of bytes anymore. But if your intention is just to paste the Replacement Character itself, you can do it freely because it has a valid Unicode address.

Yes, sometimes it’s impossible to recognize which symbol you see (and the simplest example is Latin O, Cyrillic О, and Greek Ο). The only way to find out is to represent your string as real codes of the characters. One of the Online services can help with this task.

Font Fallback

Now let’s take a closer look at the Font Fallback. The idea is simple: it’s better to display the requested symbol using another font instead of showing Tofu. Every OS has its own set of such “backup” fonts: some of them cover exotic languages, others deal with emojis, etc.

It is easier to understand using the example. Say, I want to create a simple webpage with three words “hello” in English, Polish, and Georgian. And let’s add the rocket emoji to the string: Hello, Cześć, გამარჯობა! 🚀 To make the example more explicit, let’s use some exotic font, say Nanum Gothic.

<!DOCTYPE html> 
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Test</title>
<style>
@import url('https://fonts.googleapis.com/css2?family=Nanum+Gothic&display=swap');
*{
font-family: 'Nanum Gothic', 'Inter';
font-size: 6vw;
}
</style>
</head>
<body>
<p>Hello, Cześć, გამარჯობა! 🚀</p>
</body>
</html>

In this part of the code “font-family: ‘Nanum Gothic’, ‘Inter’;” we actually created our own fallback list. The browser will try to display the text with Nanum Gothic and then, if some glyphs are not available, will switch to Inter, right? Well, it’s right if Inter is available on the user’s device (you see, we didn’t import it in the code). Ok, if Inter is not available and we didn’t specify any other option — end of story?

Not really, the operating system will step in to use its list to cover all the gaps. In the case of Windows it will use Arial to display “ść” (it will deliberately use sans-serif font to match the original sans-serif font better), Sylfaen for Georgian, and Segoe UI Emoji for the last symbol.

You might have noticed that despite of the effort to match the font, the word Cześć doesn’t look great. Unfortunately, we have no additional control to solve this issue, but at least we can read the text.

And what happens if the System is not able to choose the required symbol from its huge fallback list? Well, hello Tofu, my old friend.

Ok, we have chosen all required fonts for the text, let’s have a closer look at the glyph selection process.

2. OpenType (selecting glyphs, applying features, rendering)

--

--