Source: Emojipedia

Internationalization is fascinating: the ebook case

If you’ve been paying attention lately, you know I’m in charge of Readium CSS, a project funded and driven by EDRLab, and whose goal is the creation of a reference CSS for EPUB ebooks.

We’ve done a lot of research and development since July 2017, and do care a lot about the (digital) reading experience: we’ve spent countless hours fine-tuning typography to provide users with the best experience possible and tried to leverage all the awesome tools the web has to offer — yes, the web culture can be that awesome. And now is the time to rinse and repeat for internationalization.

Believe it or not but we even created our own tools to check typefaces we can recommend.

As a web developer or designer, you may know German words can be awfully long — Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz, which was the title of a law regulating the testing of beef —, that some european languages use diacritics (“Ğ”, “Ș”, “é”, and so forth), that User Interfaces must be laid out right to left for Arabic and Hebrew scripts, and that some languages like Japanese can be written vertical right to left. Actually, there’s a lot more, and it’s fascinating how web technologies can impact entire cultures we may not be familiar with.

To be honest, for a Frenchman having done some research, individual and collective efforts can not be overemphasized, and there is a lot of people trying to make those cultures shine out, designing typefaces, documenting requirements for each language, and shipping products in the real world.

Mukta, a Unicode compliant, contemporary, mono-linear font family available in seven weights, supporting Devanagari, Gujarati, Gurumukhi, Tamil and Latin scripts. By EkType.

Now, I must kickstart internationalization in Readium CSS, and here are some issues, docs and insights which may be interesting to you, should you do internationalization someday.

It’s just about typefaces and writing-modes, right?

I wish it were true but one month of research — which means preparing development, we’re not even talking about CSS and documentation there –, makes it clear this is not the case. At all.

Let’s start with Indic, which is written left to right and top to bottom. That should be easy, right?

Well… Did you know India officially recognizes 22 languages and 12 scripts?

Sure, we could start with the Devanagari script, but then it is used for Hindi, Sanskrit, Nepali, Bengali, etc., which can have their own rules. And then you have to take the Gujarati, Kannada, Punjabi, Tamil, Telugu, etc. scripts into account…

An example of Malayalam text. Left is badly-rendered and isn’t the result intended by the typeface designer since the font used is not the original (source: University of Chicago). Right is the correct rendering. Thanks to Santhosh Thottingal for pointing it out and providing the correct version.

Now, India population is 1.324 billion, people use mobile phones and Malayam, for example, is spoken by more people as a native language (38 million) than Swedish (9.2 million), for which we can offer good support in Readium CSS.

In other words, the situation becomes really tricky super fast if we take a look at raw data, because we obviously want to provide the best support we can get, and we don’t necessarily know support for those languages may impact so many people.

For instance, can you tell if the search library you’re using or developing can manage CJK, Arabic, and Indic as expected? Unicode normalization, zero-width non-joiner and zero-width joiner (non-printing) characters, punctuation marks and signs, and right to left direction are all items you might want to test sooner than later if you’re planning to go international at some point.

Let’s take a look at a more complex language: Japanese. For Japanese, we must support vertical writing. But that’s the easiest part of the story…

An example of Japanese text laid out vertical-rl. (Source: Japanese Text Layout Requirements)

Did you know Japanese publishers have their own typography rules, which represents their brand, and line-adjustment therefore becomes a super complex issue?

Speaking of which, Arabic-Persian scripts have a concept called “Kashida elongation” (کشیده), which is a type of justification that web browsers don’t support yet.

Example of Kashida elongation. (Source: Arabic Script Text Layout Requirements)

In the end, we’ll even have to disable some user settings like letter- and word-spacing because they can’t apply to some languages! And then there is accessibility, for which we have very little info – it happens to be one of EDRLab’s core missions. All I know at the moment is that we might want to disable ligatures in Arabic Script to make it more accessible to people with reading issues.

Typography is awesome

I’ve learnt so much reading Text Layout Requirements for non-latin languages that it’s mind-blowing.

Here’s the list of text layout requirements I’ve read so far:

It really makes you feel humble, to be honest. All the issues you’ve encountered in latin languages are, quite frankly, nothing compared to the issues some languages have to deal with, when it comes to their essential parts.

For instance, orthographic syllabic boundaries is the reference browsers should use in Indic. And sometimes they don’t. How would you react if a browser couldn’t wrap or hyphenate words well? Or :first-letter couldn’t select the correct group of characters for your script?

Examples of drop caps in Indic. (Source: Indic Text Layout Requirements)

There’s also hanging punctuation, which is purely æsthetic in some western languages, but can make or break the harmony in some others — the Japanese Kihon-hanmen (基本版面) relies on gravitational forces for instance.

As a matter of fact, if you think latin typography is hard, chances are you’ve never encountered ruby annotation in CJK. Ruby (ルビ) is such a complex system that I can’t even understand all its details yet.

In Chinese, you can have two interlinear annotations at the same time. (Source: Chinese Text Layout Requirements)

I can’t either understand all the idiosyncracies of joining forms, which are not ligatures, in Arabic yet. And I’m just starting to get familiar with the typographic adjustments the writing modes (horizontal-tb/vertical-rl) require in CJK.

As for special typographic features like Warichu (割注), a type of inline note where the text runs on two lines, I’ve been wondering how one could achieve that with the modern specs CSS provides. So internationalization definitely keeps my brain firing, and offers some unexpected room to flex my CSS muscles.

Example of Japanese Warichu. (Source: Japanese Text Layout Requirements)

This is a tough challenge, but I love this. I may not be able to read, write and speak those languages in my lifetime, but typography is at least a common language we can share. And discovering those new concepts is awesome, as I can now see my native language’s typography from a new and exciting perspective.

If typesetting has become a mere set of rules and you’ve been doing things on autopilot mode lately, I can only advise you to read the requirements I listed above. Chances are you’ll get a new canvas on which you can experiment new approaches by breaking the rules you’ve been following so far.

Fragmentation is a problem the web has not solved (yet)

In ebook-land, people — both authors and users — expect paged views. If you take a look at EPUB/Kindle files, you can clearly tell authors design their CSS with the assumption contents will be paginated. And if you don’t ship pagination in your app, a lot of users might simply not consider even testing it. Don’t forget ebooks are not longform content but super mega longform content so some users prefer it to be fragmented into pages.

Moreover, there’s a technical constraint we often forget about: eInk. Its super slow updates makes scrolling a real pain, so fragmenting contents into screens (pages) is our only viable option right now — unfortunately, eInk is something people tend to overlook when discussing how we could revolutionize electronic books, although it’s been one important cornerstone of this industry; it at least deserves a mention.

Alas, internationalization brings a lot of issues there. And what’s interesting is to take a look at existing Reading Systems’ solutions to those issues.

Since we have nothing to paginate with CSS in practice, we use multi-columns to fragment contents. Currently, CSS multicol is indeed the only cross-platform spec using fragmentation in web browsers. At some point, I even discovered that CSS Regions could be considered a super-set of columns in Blink/Webkit.

Problem is the column axis depends on the writing-mode documents use. In vertical-rl for instance, columns will be automatically laid out on top of one another (y-axis), not next to one another (x-axis).

If you’re trying to paginate using multi-columns in vertical-rl, here’s what you get.

Believe it or not, but Apple solved this issue by extending the multicol spec a little bit in Safari 7, in order to create a pagination API on iOS. To sum things up, you can force the column axis by using -webkit-column-axis in Safari (here is a demo you must open in Safari to see it in action).

Unfortunately, Chromium removed it 4 years ago. To be fair, CSS regions were still a thing at the time, and paged overflow was meant to solve such issues. Only did Blink remove CSS regions, on which Opera’s implementation of paged overflow was heavily relying… Which leaves us with nothing today — funny story: if you try overflow:-webkit-paged-x in Chromium (build 251715), in which this non-standard -webkit-column-axis property was still supported, you would obtain the desired result.

Here’s what you get when setting an explicit -webkit-column-axis in Safari. We wish we could get that in all the other browsers.

As a consequence, we need to cheat a little bit, by embracing the y-axis and adapting the page-progression-direction accordingly: when you swipe/tap right or left, pages move upward or downward. This means we can’t do spreads for vertical writing, unless we use the non-standard -webkit-column-axis CSS property on iOS/Safari, CSS Regions in MS Edge on Windows, and a rendering engine in JavaScript for other platforms.

Needless to say CSS Houdini could be the missing piece the ebook community needs to solve the fragmentation issue for its use cases. But having a standard column-axis property implemented could probably help solve 90% of those use cases in the nearer term.

The best it can be

Obviously, in the case of ebooks, CSS is just one piece of the internationalization puzzle. The entire framework is impacted (metadata handling, rendition, page progression, User Interface, apps’ features, etc.).

We can implement a baseline but we’ll need expertise to make internationalization the best it can be. We’ll have to pick the best fonts we can find for each language, we’ll need reviews on documentation and stylesheets, we’ll have to take informed decisions when we hit edge cases, and so forth.

It takes some expertise to pick typefaces with well-supported joining forms in the Arabic script… Can you help?

If you’re fluent or do web development/design in those languages, you probably know a lot of details we can’t currently grasp, and CSS tricks to achieve special typographic features. Your knowledge is invaluable and could definitely help us create Reading Systems all users will enjoy, from Paris and Moscow (Москва) to Tokyo (東京) and Kathmandu (काठमाडौं).

If you want to help, please feel free to weigh in on our localization issue or the comments of this post, or just ping me on Twitter.

The web is beautiful, let’s make this ebook CSS reference equally awesome for users all around the World.