UTF?

Logan Emerson
5 min readSep 5, 2017

--

When learning to code, one gets used to things not making since at first glance: “&&”, “thing.select { |thing| thing.name == ‘cool thing’ }”, etc. I’m glad for the daily practice in humility, and I’m becoming comfortable with it all. I find that I struggle much more when my tools display foreign (and important-looking) information; I’m interested at understanding code at a professional level, so I should know a lot about my text editor. Similarly, if I point at a buzzing knife-lazer-combination-looking-thing that’s approaching my mouth and ask my dental hygienist “Hey, what’s this do?”, and they reply with a shrug, then my confidence in their ability to not kill me in the following moments will be diminished to some degree.

When I started using a text editor (i.e. Atom) that displayed UTF-8 in the bottom right of the display, I immediately looked up what it meant. I quickly read that it was a Unicode standard. I’ve proceeded to forget this and look it up again many times over the last few months. There are surely better resources for a complete understanding of what UTF-8 is, but this post will serve to convey what UTF-8 is in layman’s terms and to solidify my knowledge of UTF-8 such that I won’t need to look up the term again.

ASCII

In the 1963, ASCII (the American Standard Code for Information Interchange) established the first standard for encoding Latin characters in binary. While the first thing to pop into one’s head when they hear ASCII is something very stupid, like…

    (  )   /\   _                 (     
\ | ( \ ( \.( ) _____
\ \ \ ` ` ) \ ( ___ / _ \
(_` \+ . x ( .\ \/ \____-----------/ (o) \_
- .- \+ ; ( O \____
) \_____________ ` \ /
(__ +- .( -'.- <. - _ VVVVVVV VV V\ \/
(_____ ._._: <_ - <- _ (-- _AAAAAAA__A_/ |
. /./.+- . .- / +-- - . \______________//_ \_______
(__ ' /x / x _/ ( \___' \ /
, x / ( ' . / . / | \ /
/ / _/ / + / \/
' (__/ / \

…the system that was standardized was very cleverly implemented. ASCII encodes 128 specified characters into seven-bit integers. Ninety-five of the encoded characters are printable: these include the digits 0 to 9, lowercase letters a to z, uppercase letters A to Z, and punctuation symbols. In addition, the original specification also included 33 control codes for non-printing characters for obsolete Teletype machines.

The low-hanging fruit solution would have been to assign “A” the binary value of “0000001”, assign “Z” the binary value of “0011010”, then to follow with the assignment of the lowercase “a” to “0011011” (i.e. the very next value after “z”). This doesn’t seem like a terrible idea, but their solution was much better:

  • A = 65 (10 0001)
  • a = 97 (11 0001)

With this convention, ASCII was fairly legible.

Encode Wars

While ASCII was used widely among English-speaking businesses, there were still many encodings being used around the world. At this point in time (i.e. the 1960’s) this was not especially problematic, because fax machines were still the business communication technology of choice.

While computers are being popularized and incompatibilities between encodings are becoming more and more problematic, a trend toward 8-bit (i.e. eight binary bits per character) encoding further impedes the path toward a global encoding. Even this wasn’t such a big deal, because it was still rare for a company to be sending encoded information to a company using a different encoding standard.

Japan did have problems, as it had multiple encodings unto itself. When a floppy disc was walked across the street from one business to another, there was a real possibility that the information would be completely incomprehensible to the receiving business’ system. Japan has a word for this, mojibake

Then the World Wide Web is launched, and businesses have a serious problem: the greatest tool ever invented is now available, and international businesses can’t take advantage of it, because nobody’s computers can speak with anybody else’s. In essence, everyone is now suffering from Japan’s encoding woes.

Enter the Unicode Consortium

Since the late 80’s, a standard has gradually formed. The goal was simple: every script character encoded to its own number. That is, every character from Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, and Yi, each associated with their own number.

One way to do this would have been to account for a huge number of characters and going with a 32-bit (i.e. 32 bits) system. This would have been the low-hanging fruit solution. Problems with this convention would have been:

  • ASCII Incompatible: as the leading standard in the business world, compatibility with ASCII was desirable
  • Inefficient: if they began the system with “A” at 00000000000000000000000000000001, then every Latin language word document would be HUGELY wasteful, taking four times the 8-bit size in storage)
  • Broken: many computer systems interpret 00000000 (i.e. eight zeros in a row) as a c all to terminate a running calculation, so a 32-bit text format would suffer from irreparable backwards incompatibility

Very smart people have worked UTF-8 into a standard that accounts for all of these problems.

First, they began by matching ASCII. If a character can be defined by seven bits, UTF-8 leads an eight-bit code with a 0, then uses the exact ASCII.

To account for the inefficiency of 32-bit character encoding, UTF-8 is dynamic in its character storage size. For example:

This standard, which nearly works for all applications (you’ll find that antiquated characters can cause corruption in some instances), is the 8-bit Unicode Transformation Format, or UTF-8.

--

--