WTF-8: Limitations of character encodings, character sets, and code

I read Ashley Blewer’s “Artist_Exhibition-copy (FINAL)(2).mov: Preserving diacritics in filenames as significant properties in media conservation”, which lead me to re-reading “Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files” by Elvia Arroyo-Ramírez. That and (2) a comment about Blewer’s blog post: “Would love to do more research on this from a conservation perspective” — so here’s an attempt to open up the conversation by echoing Ashley’s breakdown with an artwork that highlights the cultural biases of computer science and challenges the assumptions made in programming.

Source ? Developer : Conservator

In 2013, computer scientist and artist, Ramsey Nasser programmed قلب (Qlb) during his residency at EYEBEAM. It is a functional programming language written entirely in Arabic that is a conceptual art piece.

‫قلب‬ is a “Lisp-like programing language with minimal Scheme-like parenthesize syntax.” It uses REPL, Editor, PEG.js (a parser), and largely based on the implementation of Lispy. (https://github.com/nasser/---).

While ‫قلب‬ doesn’t use the tradition UTF-8 encoding for the front-end interface users see, the JavaScript in the back-end is all UTF-16 and written in English. And doesn’t mention which Arabic dialect it is written in (an aspect that speakers and coders would distinguish today)

Digital imperialism: teach yourself how to code! — …. but you’ll need to learn English first

Do non-English speakers code? Yes.

Are they fluent in English? Not necessarily.

Do you need to know English in order to code?

Scott Hanselman posed this question back in 2008 and I’d honestly like to think no, but it’s hard to see the inclusion or progress when I can’t use the tilde, or even accent in my own name, Ramírez-López, without getting character encoding errors.

“All modern programming tools are based on the ASCII character set, which encodes Latin Characters and was originally based on the English Language.”

1950s: IBM starts its process towards ASCII with Binary Coded Decimal (BCD), a four-bit encoding that stored decimal numbers in a binary form.

“Instead of numbers running from 0000 (0) to 1111 (15), they ran from 0000 (0) to 1001 (9) — each four bits representing a single digit.” (https://www.whoishostingthis.com/resources/ascii/).

1963: IBM created the Extended Binary Coded Decimal Interchange Code, an 8-bit encoding for all “standard printable characters”

1991/1992: Ken Thompson, with the help of Rob Pike, invented UTF-8: http://doc.cat-v.org/bell_labs/utf-8_history

2007: ASCII gets replace by UTF-8 as its Unicode was not only the same as the first 128 characters of ASCII, but also could display Chinese, Japanese and Arabic.

The algorithms developed that we use to map and translate are all based on a Western culture that is solely-based on an English language, Latin character set. This wouldn’t be an issue if the products weren’t pushed to other countries to adapt and work in their local environments only to break when it enters another environment — usually through an acquisition

In interviews, Nasser expressed how “If [non-English languages, especially non-Latin character encodings] could exist, it would make people’s lives so much easier, but rewriting the last 50 years of software engineering isn’t on the table. (MIC 2015)” [Nasser] teaches programming to students from around the world. A lot of whom are not native English speakers, but are still taught JavaScript and C Sharp — languages that all have English words build into them: ‘if,’ ‘while,’ ‘do.’ (from “Arabic Programming language at Eyebeam”).

To parallel this to the field of archives and libraries, one of the main issues is that “despite the democratizing promise of technology… the digital tools we build and provide are likely to reflect and perpetuate stereotypes, biases, and inequalities.” — Chris Bourg.

In !!CON 2018, Ahmed Abdallah actually compiled an Algol-based, imperative Arabic programming language called Noor. He posed an interesting question during a Hanselminute interview(“Do you need to speak English to Code? Noor — an Arabic programming language with Ahmed Abdallah”, May 24, 2018):

Is a programing language suppose to be this generic tool like a core document that you should follow with strict guidelines or is it the programming language reflect the aethestics of the creators behind?

As developers, we like to provide access and one of those quick fixes is to translate a bunch of English documentation and terms in different alphabets, but the issue still lies that everything is still deeply rooted on this English, Latin character set because those terms will have no inherent meaning.

To end with another Bourg quote from that “Never neutral: Libraries, technology, and inclusion” talk from Ontario Library Association Conference in 2015 (that Elvia also quotes because honestly is Bourg way more eloquent than I’ll ever be)

“…without active intervention we end up… classifying and arranging our content in ways that further marginalizes works by and about people of color, queer people, indigenous peoples, and others who don’t fit neatly into a classification system that sets the default as western, white, straight, and male….”

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store