The Chaos that is Character Encodings

Zwork101
11 min readMar 1, 2020

--

Photo by Tanner Mardis on Unsplash

Hello! This story was my submission to National History Day, and because of that, the format of this story may be a bit odd (Like I bring up “barriers” a lot, that was the theme). While I’m talking about a technical, I try to keep it simple so the average person (like the judges) could understand it. With that said, there’s a lot of cool information here, and I hope you enjoy it!

Imagine for a moment, that you need to communicate with someone 200 feet away. The problem is that all you have to communicate with is a flashlight. How do you do it? If you know morse code you may use that, or you somehow sent instructions beforehand and explained what each flash of a flashlight means. Either way, you’re using a character encoding. Character encodings are used not just by flashlights but act as a backbone for computers to show you information. Instead of flashlights, computers have spurts of electrical energy, which computers transform into letters, numbers, and emojis. However, it’s not always that simple, which is the basis of this paper.

Communication has been a problem throughout human history. Different languages lead to confusion and act as a barrier to cooperation. Character encodings are incredibly necessary for humans to communicate on electrical forms of communication, which today might make up more of our time compared to verbally speaking. We take for granted the dispute between character encodings that happened right underneath our noses. If we all used different forms of encoding, a well-written paragraph for one person might be a j̸̜̐a̴̖̒͠r̸̹͇͊b̴̰̱͂ḷ̶̌̊é̸͙͑d̴̗͌ ̷͕̓͌m̶̠̓e̸̤͚̎s̴̨̺̆s̴͉̱͗ to the next. ASCII (ask-ee) was the key to starting this character encoding war, which is won by Unicode, a standard accessible to all languages, used and unused.

The best place to start is at the beginning. Before computers, there was a lone wire, that spanned a great distance. This wire would send electricity and trigger an electromagnet, creating a tapping sound. The telegraph, invented by Samuel F.B. Morse, was not a computer. Yet “Like the binary system used in modern computers, it [morse code] is based on combinations of two possible values — in the case of Morse code, a dot or a dash” (Searle, Brief).

Source: http://tronweb.super-nova.co.jp/tronwebimages/charcodehisfig2.gif

Morse Code would be very successful and is well known even today. This code would not work for the teleprinter, an electric printer invented by Jean-Maurice-Émile Baudot in 1874. So, Baudot created “the 5-bit Baudot code, which was also the world’s first binary character code for processing textual data”. The Baudot code would be used by many computers before ASCII, but not all.

Also from the same TRON website as Figure 2

These encodings supported English and marginally supported some other Western European languages, but few others. Sending a message in Japanese for example, was impossible.

Morse Code and the Baudot Code were part of a communication revolution, now we are in a technological revolution, the computer era. With computers communicating over large distances for mostly military but also commercial use, new code systems were necessary. For the military, FIELDATA was invented. This new encoding was “developed by the United States Army for use on military” (Mackenzie). It consisted of basic characters in its possible 128 set, however, the encoding left many spaces undefined, so they could be used later for “more complex kinds of functions and necessary for interconnection and control of data transmission”. The encoding was successful, and before ASCII was created, FIELDATA “became a U.S. Military Standard in 1960”. At that same time, IBM was creating a computer called “Project Stretch”. This computer would be one of the first computers to have an 8-bit architecture. This meant that Project Stretch was capable of a character encoding with a maximum of 256 characters, compared to the 64 characters other computers had. Eventually, IBM decided on a “120-character set that, apart from its size (most computer character sets of that day were 48-character sets)” included different characters compared to other encodings. This included additional mathematical characters for programming languages, uppercase, and lowercase, symbols, and punctuation. As computers became more and more compatible with 8-bit architecture, this style of character encoding would become more and more common.

Some years later, we enter an age where consumers started to have access to these amazing new computers. While the average consumer was satisfied with a computer’s default encoding scheme, companies knew it wouldn’t stay that way. This encoding problem “was becoming increasingly evident as companies like IBM began networking computers” (Awolkoff). Files sent on one computer couldn’t be read on another, simply because the receiver didn’t have that file’s encoding. To solve this problem, “In May 1961, an IBM engineer, Bob Bemer, sent a proposal to the American National Standards Institute (ANSI) to develop a single code for computer communication”. ANSI created the X3.4 Committee which contained most of the country’s current computer makers. When the dust settled, what was left was ASCII, a 7-bit encoding that could be used by all computers. This encoding was so successful, that “In 1968, President Lyndon B. Johnson signed a memorandum adopting ASCII as the standard communication language for federal computers”. President Lyndon viewed this new standard as a “major step toward minimizing costly incompatibility among” their “vast Federal computer and telecommunications data systems” (United States). After becoming a federal standard, this encoding becomes well known to programmers across the United States of America. Standards like e-mail and HTTP (Websites) would all be ASCII compatible, and still are to this day. ASCII was not without its flaws though, and it showed. By default, ASCII did not support accents or special characters in Western-Europe languages. Not to mention, the standard that had a massive influence on the internet, had no support for non-Latin based languages. To solve the Western-European language problem, Latin-1 was created by the ISO (International Standard Organization). Accents, special characters and more were included in this standard that utilized the fact that ASCII was 7-bits long, not 8-bits. By using another bit, Latin-1 extended the number of characters it can support by 128. However, this standard was not considered perfect by any means, “Three French characters are not part of” it in the original draft, “œ, Œ and Ÿ” (André)! Languages that were not Western-European in this era were simply excluded, creating a communication barrier between computers.

Most American computer companies enjoyed the simplicity of ASCII, but large companies like IBM and Microsoft were unsatisfied. They had markets in places like Japan and China, and this encoding made it harder to market their computers in that era. There was a huge boom in encodings after ASCII, because ASCII brought up the topic of character encodings, which was not considered a problem until recently. Many encodings were created, but two have the most significance on computers today. The two key encodings spawned from ASCII were TRON and Unicode. They would fight for years until finally, one would collapse and the other would become the universal victor. After ASCII, computers got better. The average computer could compute equations faster than ever, meaning now they could compute fixed-length encodings. ASCII and Latin-1 are both single-byte encodings. This meant that a computer would read a byte (8-bits), and use the character encoding to turn that byte into a character. Some encodings that supported East-Asian encodings would have a similar scheme but read 2 bytes (16-bits) for every character. Variable-length encoding sizes depended on the character. The first byte usually indicated how many other bytes would be necessary to display the character. For example, UTF-8 needs a single byte to display the letter “A”, but 3 bytes to display “犬”. This improvement leads to the ability to feature every language in the world in a single character encoding, but TRON and Unicode have different views on how to go about doing that. Unicode has a theoretical limit to how many characters it can support, however that number lies over a million. To save space for future characters, Unicode has decided to attempt “Han Unification”. “In order to support Chinese, Japanese, Korean, and Vietnamese, encoding for many thousands of ideographs must be provided” (Topping). Some characters in these for writing systems are the same however, which “is referred to as Han, because it originated in China during the Han dynasty”. To prevent redundancy, Unicode is “assigning single code points to the Han characters”, drastically increasing the number of characters necessary. This change was met with great criticism, especially from the TRON project, which supports each language’s unique character. TRON claims that “Unicode was aimed at ‘the unification of markets,’ which is why its creators were oblivious for so long to the fact that they were simultaneously unifying elements of Chinese, Japanese, and Korean culture without the participation of the governments in question or their national standards bodies” (Searle, Unicode). This, and the fact that people were generally unhappy that Unicode may be “merging cultures” is where it faced most resistance. Unicode had the backing of large companies such as IBM, and Microsoft who used Unicode by default in their operating systems. That fact would come into play when Microsoft went after the Japanese market. The TRON project did more than just make encodings, it created operating systems like BTRON as well. “When the Japanese government announced it would install BTRON PC in Japanese schools, the U.S. government objected” and threatened sanctions (Krikke). The Japanese market could not afford this loss, and “quickly dropped the plan”. The U.S.A. would cancel that threat, but by then “Nearly all Japanese companies involved in TRON-related activities had canceled their projects”. Not only was this a huge loss to TRON’s OS, but it was also a huge opening for Microsoft, whereby getting their computers in Japan, they got Unicode inside Japan as well. Coincidentally, “Tom Robertson, Microsoft’s Tokyo-based director for government affairs in Asia, is a former official of the United States Trade Representative office that issued the threats against the Japanese government”. This was a blow that the TRON project could not withstand and over time, it became clear who won this character encoding war. Unicode would be the encoding of choice, to break down the communication barrier, making computers, and the internet, capable of using any language used, and unused.

It’s easy to assume that now that we have a standard, everyone is happy. That is not the case, however, and many people thought Unicode’s goal to be impossible. To understand Han Unification, imagine that you’re “suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered ‘similar’ (such as ‘M’ and ’N’ sounding and looking so much like each other) and too “complex” (“Q” and “X” — why, they are the nothing more a fancier ‘C’ and an ‘Z’)” (Goundry). The problem is still fairly bad, and it’s not unheard of to be unable to spell Japanese names (Imagine having to write ‘Kathy’ while your name is really ‘Cathy’). While there was resentment towards the encoding, many companies embraced it. Having multiple encodings before meant that “Transferring of text from one machine to another one often causes some loss of information” (IBM). Not only did Unicode solve the problem of sending information between machines, but it was also easy to use due to it’s “interoperability with both ASCII and ISO-8859–1 [Latin-1], the most widely used character sets”. These benefits would prove incredibly helpful for being used on the internet. If the protocol wasn’t compatible with ASCII, programmers would have to go through each file and change its encoding. Instead, programmers just told the computer to use Unicode, and everything worked as usual. Google’s recent survey polled that “Unicode has experienced an 800 percent increase in “market share” since 2006” (Davis). So to put that into perspective, it was concluded that “nearly 80 percent of web documents are in Unicode (UTF-8)”.

Source: https://4.bp.blogspot.com/-O4jXmTm7WWI/Tyw1As8jt7I/AAAAAAAAI9E/nxxi1T21IH4/s500/unicode.png

Unicode may be a bit complex of an encoding to use with a flashlight, but for computers, it’s fairly easy. For the longest time, most languages on Earth could not be spoken on computers, let alone be used to send messages between computers. This communication barrier was an economic burden for companies, a social pain for the people using the computers, and as we saw in the memorandum, a political issue. It sparked inventors around the world to create the best encoding standard, creating an encoding mess.

Source: https://imgs.xkcd.com/comics/standards.png

Out of this mess, Unicode would rise, for better or for worse. So the next time you use a computer, any computer at all, say hi to Unicode, which is being used every time you touch your keyboard.

Badly formatted bibliography due to Medium’s formatting restrictions.

Awolkoff. “ASCII.” ASCII — Engineering and Technology History Wiki, 25 Jan. 2019, https://ethw.org/ASCII.

Davis, Mark. “Unicode over 60 Percent of the Web.” Official Google Blog, Google, 3 Feb. 2012, https://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html.

Elias, Alexandre. “Encodings of Japanese.” Encodings of Japanese, Sci.lang.japan, https://www.sljfaq.org/afaq/encodings.html#encodings-Goal-of-this-document.

IBM. “How Unicode Relates to Prior Standards Such as ASCII and EBCDIC.” How Unicode Relates to Prior Standards Such as ASCII and EBCDIC, Apr. 2016, https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/nls/rbagsunicodeandprior.htm.

John, Nicholas A. “The Construction of the Multilingual Internet: Unicode, Hebrew, and Globalization.” Journal of Computer-Mediated Communication, vol. 18, no. 3, Apr. 2013, pp. 321–338. EBSCOhost, doi:10.1111/jcc4.12015.

Searle, Steven J. “Unicode Revisited.” Unicode Revisited, TRON Web, http://tronweb.super-nova.co.jp/unicoderevisited.html.

United States, Executive Office. “Memorandum Approving the Adoption by the Federal Government of a Standard Code for Information Interchange” Memorandum for the Heads of Departments and Agencies, by Lyndon B. Johnson, March 11, 1968, https://www.presidency.ucsb.edu/node/237376.

André, Jacques. “ISO-LATIN-1, NORME DE CODAGE DES CARACTÈRES EUROPÉENS ? TROIS CARACTÈRES FRANÇAIS EN SONT ABSENTS !” ISO-LATIN-1, NORME DE CODAGE DES CARACTÈRES EUROPÉENS ? TROIS CARACTÈRES FRANÇAIS EN SONT ABSENTS !, Cahiers GUTenberg, 1996, http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf.

Bruchanov, Martin. “Spectroscope View of RTTY (Baudot).” BruXy: Radio Teletype Communication, BruXy, 10 Oct. 2005, http://bruxy.regnet.cz/web/hamradio/EN/radio-teletype-communication/.

Goundry, Norman. “Linguistic, Political, and Technical Limitations.” Why Unicode Won’t Work on the Internet, Hastings Research Inc, 1 June 2001, http://www.hastingsresearch.com/net/04-unicode-limitations.shtml.

Krikke, Jan. “The Most Popular Operating System in the World.” LinuxInsider.com, LinuxInsider, 15 Oct. 2003, https://www.linuxinsider.com/story/31855.html.

Mackenzie, Charles E. Coded Character Sets, History and Development. Addison-Wesley, 1980.

Searle, Steven J. “A Brief History of Character Codes.” Brief History of Character Codes in North America, Europe, and East Asia, TRON Web, 6 Aug. 2004, http://tronweb.super-nova.co.jp/characcodehist.html.

Topping, Suzanne. “The Secret Life of Unicode.” Microsoft Word — The Secret Life of Unicode.docx, IBM’s DeveloperWorks, 1 May 2001, http://www.btetrud.com/Lima/The%20Secret%20Life%20of%20Unicode.pdf.

Zentgraf, David C. “What Every Programmer Absolutely, Positively Needs to Know about Encodings and Character Sets to Work with Text.” What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text, 27 Apr. 2015, http://kunststube.net/encoding/.

--

--

Zwork101

Self-taught programmer for a while now, and I don’t see that changing any time soon.