Computers Are Hard: representing alphabets with Bianca Berning
Provided you’re storing and processing text in Unicode, it just works.
If you look deep enough, all data on our computers is binary code. Strings of 0s and 1s are the only language that processors understand. But since humans are not nearly as adept at reading binary, there are multiple layers of translation data goes through before it’s presented to us in a legible way. And we rarely think about it, but that’s a herculean task.
Unlike computers, we don’t have a unified way of communicating. We speak thousands of languages, written in hundreds of scripts. We write equations and use emoji. Many engineers spent years trying to untangle the messiness of language for the purpose of representing it in software and then they spent some more years trying to agree on a common approach.
To find out more about what goes into representing languages and alphabets in software, I reached out to Bianca Berning, an engineer, designer, and Creative Director at a type design studio Dalton Maag. We talked a little about the history of encoding standards and about how fonts come to be and how much work it takes to create one. We also touched on tofu, but not the edible kind.
What’s character encoding and why are there different standards for that? What’s the difference between Unicode and ASCII, for example?
Character encodings assign numeric values to characters. They’re used because digital data is inherently numeric so anything which isn’t a number needs to be mapped to one. There have been many competing standards for encoding characters over the past hundred years, but ASCII and Unicode are the most well known.
ASCII, from around 1960, used seven bits to represent characters, meaning it could encode only 128 different characters. This is just about acceptable for English, but will struggle to faithfully represent the alphabets of other languages. Straightforward extensions of ASCII, using eight bits for 256 different characters, appeared rapidly, but they were many and incompatible. Each language or at best, groups of languages needed their own encoding as 256 characters is still not enough.
There were standardization efforts which led to some rationalization. For example, there are a range of ISO standards for 8-bit encodings which group languages together, such as Latin-1 (formally known as ISO/IEC 8859–1, first published in the early 1980s) for the languages of Western Europe.
It was Latin-1 which was used as the basis for Unicode in 1988. By growing that 8-bit encoding to a 31-bit encoding there were finally enough codepoints for every character from every language, while retaining some easy compatibility with one of the most common 8-bit encodings.
UNICODE
Unicode is the globally adopted standard for character encoding, maintained by the Unicode Consortium, which has some of the largest tech companies as its members. Unicode supports most writing systems, current and ancient, as well as emoji 🦛. That’s why every platform has the same set of emoji, even though they look different on iOS/Mac, Android, and Windows.
What are control and format characters? What purpose do they serve?
Control characters control the behavior of the device which is displaying the text. Many are rooted in the physical nature of early teletype devices, such as backspace which moved the carriage back one character (it didn’t delete) and bell which rang a little bell on a typewriter.
Say you’re building a chat service. You’d obviously want to support as many alphabets and writing systems as possible. Is this something that software development tools provide by default, or does it require extra work?
I’m probably not the right person to talk about implementation, but in general modern operating systems and application development environments are fully Unicode aware and compliant.
What happens if someone tries to paste unsupported text into your app?
Provided you’re storing and processing text in Unicode, it just works. If you’re not, you’ll get a lot of missing glyph tofu characters.
TOFU
Each font should include a .notdef glyph. It’s the glyph that appears when a website or an app is trying to display an unsupported character. Usually, .notdef glyph is a white square (like this □), which is why it’s called tofu.
What writing systems are the most complicated to support and why?
There are writing systems, such as Arabic and Devanagari, in which letter shapes vary depending on the context in which they appear. While their Unicode characters are as straightforward as any other writing system, they require an additional stage of processing, known as shaping, to get from sequence of characters to correctly formatted glyphs.
Most physical keyboards are based on the Latin alphabet. How do you make typing in a vastly different writing system possible with this interface?
It depends on the writing system and the language. As an example, Japanese keyboards have both the English QWERTY layout and hiragana indicated on their standard keyboards.
Many minority scripts don’t have a history of physical keyboards, but digital only keyboards can often be installed to be able to input in languages that use those scripts. It’s far from perfect and far from complete, but people have adapted.
Is it possible to create your own, custom font for your project? If so, how do you go about this? How do you make sure it’s properly rendered across all platforms?
Creating a custom font can be easy or hard depending on the scope and ambition of the project. Each font is a collection of graphic representations of characters, known as glyphs. The more glyphs there are, the more complex the behaviour, and the more diverse the script systems being supported, the more specialist knowledge and skill will be required.
To guarantee cross-platform compatibility, we’re relying, again, on industry and formal standards. The most common file format for fonts, OpenType, is an ISO standard (ISO/IEC 14496–22) and the most common encoding for accessing the glyphs is Unicode.
When we look at fonts, we only see the aesthetic side. But what are the technical steps to creating a new font?
We refer to the technical steps by the umbrella term “engineering”. It includes everything from rules for getting from characters to glyphs, to adding extra instructions to help a glyph make best use of the available pixels when it is displayed on screen.
If you want to use an existing font, how do you know what alphabets it supports? Do you need to test it manually?
There are some existing tools, such as Wakamai Fondue, that can provide you with a list of languages that are likely supported by a font. Most of them are based on Unicode’s CLDR which has the largest and most extensive repository of locale data available but it should by no means be considered complete or absolute.
And if you want to use a font but it doesn’t work with all alphabets you want to support, can you just edit and expand it?
You will need to check the license agreement you agreed to when you downloaded that font to find out what you can and can’t do with it — terms vary from supplier to supplier. If you chose an open source font for your project, the same applies. The terms often allow expansion of the font but there might be restrictions on, or requirements for, how you can distribute the result.
Computers Are Hard
- Introduction
- Bugs and incidents with Charity Majors
- Networking with Rita Kozlov
- Hardware with Greg Kroah-Hartman
- Security and cryptography with Anastasiia Voitova
- App performance with Jeff Fritz
- Accessibility with Sina Bahram
- Representing alphabets with Bianca Berning
- Building software with David Heinemeier Hansson