On The Road To Professional Web Development | Character Encodings

Alexander S. Augenstein
4 min readJun 11, 2020

--

*The objective of this article is to prepare for a technical interview. To succeed (1) we must be technically proficient, and (2) we must be capable of communicating our proficiency effectively.

In this post, we’ll first become conversationally fluent on the topic. Next we’ll study W3C-provided resources and get some hands-on experience. Finally we’ll finish the discussion with a curated list of (reportedly) common interview questions.

This post is part of a series on becoming a professional web developer — click here to navigate to the table of contents*

Part I: Character Encodings

How Difficult Is It To Master?

This topic is particularly straightforward. Functionally speaking, once we learn enough to understand the phrase “always use UTF-8”, our work is basically done. The fact that this is so easy for us is only possible thanks to many other people laying the groundwork over many years — so for completeness, we’ll also take the time to understand the historical context surrounding UTF and the value it adds.

Character Encodings: What Are They?

‘Character encodings’ are a mapping from binary data to characters. For example, the binary data 01000001 might be used to represent the letter A. Other binary data could represent other characters. Computers only speak in binary, so if we humans want to be able to use computers to write in languages we understand, we need to be able to represent characters using patterns of binary data. Even more importantly, we need to mutually agree on a standard for consistently mapping binary patterns to characters. We can pick whatever standard we want, as long as we stick to our convention. It’s convenient to document these one-to-one mappings in a table. For instance, the mapping from 01000001 to the letter A was derived from an ASCII table. ASCII is a particularly dated character encoding scheme — it was first invented in 1963 and supports only a small subset of all possible characters (limited basically to the English alphabet plus a few symbols). There are many others, many of which emerged to accommodate the needs of various languages.

Unicode Transformation Format (UTF) is a character mapping that comes in three flavors (UTF-8, UTF-16, and UTF-32 — each flavor supports all the same characters, but the binary representation is different). What’s special about UTF is that it was designed to contain all possible characters and symbols across all possible languages represented by any other character encoding. The flexibility of UTF is why there’s essentially no reason to use non-UTF character encodings. Everyone uses UTF-8, which helps ease the burden of internationalization and mitigate the risk of incompatibilities.

As a demonstration of this, Medium (and in particular the post you’re currently reading) uses UTF-8. Don’t believe me? On most computers you can use the f12 key to inspect the source code of the page — or right click + inspect source. Whatever works for you, look at the very top of the HTML document that defines this page. In the very first tag under the header, you’ll see a line that specifies charset=utf-8.

This screenshot is from the HTML of the page you’re currently reading — note the highlighted text on the far right

What Terminology Comes Up When Discussing Character Encodings?

  • Coded Character Set (CCS): a mapping where each character corresponds to a unique number. “This is a synonym for character encoding”
  • Character Encoding Form (CEF): a mapping from code points to code units. “CEFs are part of the Unicode encoding format”
  • Character Encoding Scheme (CES): the mapping of code units to a sequence of octets for efficient storage. “UTF-8 is the most popular character encoding scheme”
  • Code Point: the number representing a given character in a character encoding. A in ASCII has code point 01000001
  • Character Repertoire: characters that can be represented by an encoding. “Chinese symbols are not contained in ASCII’s repertoire”
  • Code Space: a range of integers whose values are code points. “Thinking in terms of mathematical domain and range, if code points are the range, the code space is the domain”
  • Code Unit: a bit sequence used to encode each character. “Code units equal code points in fixed-width encoding schemes, but multiple code units may be combined to make more code points in variable-width encoding schemes”
  • Variable-Width Encodings: using varying numbers of bits to represent characters in the character set. “UTF-8 and UTF-32 represent the same character set thanks to UTF-8’s ability to represent larger code points by combining smaller code units”

Part II: Understanding Character Encodings

Part II focuses on understanding character encodings. Applying this knowledge is relatively straightforward, so here we’ll focus on the context surrounding Unicode character encodings.

  1. Read this intro to character encodings, including some of the history of ASCII
  2. Read the W3 Consortium’s official statement on why we should care
  3. Read the W3 Consortium’s detailed documentation on essential concepts
  4. Learn how to specify UTF-8 as our default encoding in HTML documents

Part III: Character Encoding Interview Questions

There’s not much to this (which is incredibly uncommon). Knowing some of the terminology discussed above may be valuable for our own edification, but the below are the only recurring interview questions I’ve found across the web.

  • discuss Unicode
  • list the number of bits required to represent ASCII / UTF-8 / UTF-16 characters

Closing Thoughts

I hope you’ve enjoyed this post as much as I enjoyed writing it. If you have thoughts you’d like to share, your editorial suggestions are always welcome. This post is part of a series on becoming a professional web developer. If you’d like to see more content like this, please click here to navigate to the table of contents.

--

--