A Practical Guide to Character Sets and Encodings

or: What’s all this about ASCII, Unicode and UTF-8?

This guide was originally developed in collaboration with Cari Davidson while we were working at CustomMade, Inc. to explain the basics of character sets and encoding to python/javascript coders that had little introduction to the subject beyond the obvious learned about the ASCII character set as one works with it.

I’ve re-written it to fit better into a form presentable on Medium, and present the slide deck at the end in-case that form is more usable.

A basic understanding of ASCII, hexadecimal and arrays is assumed.


Two Concepts

Character Sets: a collection of characters associated with numeric values. These pairings are called “code points”.

Encoding: how a sequence of code-points are represented as an array of bytes.

ASCII: An Ancient Character Set

American Standard Code for Information Interchange

US-ASCII is a character set (and an encoding) with some notable features:

Values are between 0–127 (x00–x7F)

ASCII code-point 32 (decimal) represents a SPACE

ASCII code-point 65 represents the uppercase letter A

The string “Foo Bar” is represented by the following 8-bit bytes:

Hexadecimal Values: 46 6F 6F 20 42 61 72
ASCII Characters: F o o B a r

Note: there is no ASCII value for the copyright symbol: ©. To compensate for this, people restricted to ASCII characters in a document would represent the copyright symbol with three characters “(C)” (the letter-C surrounded by parentheses).

Why are we here?

We are here because the English-centric nature of computer science lead to a character set (ASCII) that was reasonable for the time and for english, but not sufficient to support all languages.

ASCII CODES

The following chart is familiar to Unix users that have ever typed man ascii at a shell prompt (which is actually where I scraped the information). A more in depth explanation is available here: https://en.wikipedia.org/wiki/ASCII

The following table shows all ASCII characters and their hexadecimal values:

00 nul 01 soh 02 stx 03 etx 04 eot 05 enq 06 ack 07 bel
08 bs  09 ht  0A nl  0B vt  0C np  0D cr  0E so  0F si
10 dle 11 dc1 12 dc2 13 dc3 14 dc4 15 nak 16 syn 17 etb
18 can 19 em  1A sub 1B esc 1C fs  1D gs  1E rs  1F us
20 sp  21 !   2223 #   24 $   25 %   26 &   27
28 (   29 )   2A *   2B +   2C ,   2D2E .   2F /
30 0   31 1   32 2   33 3   34 4   35 5   36 6   37 7
38 8   39 9   3A :   3B ;   3C <   3D =   3E >   3F ?
40 @   41 A   42 B   43 C   44 D   45 E   46 F   47 G
48 H   49 I   4A J   4B K   4C L   4D M   4E N   4F O
50 P   51 Q   52 R   53 S   54 T   55 U   56 V   57 W
58 X   59 Y   5A Z   5B [   5C \   5D ]   5E ^   5F _
60 `   61 a   62 b   63 c   64 d   65 e   66 f   67 g
68 h   69 i   6A j   6B k   6C l   6D m   6E n   6F o
70 p   71 q   72 r   73 s   74 t   75 u   76 v   77 w
78 x   79 y   7A z   7B {   7C |   7D }   7E ~   7F del

It’s noteworthy to describe the first 32 values (and the last value) which are known as control characters. There may be a textual form for the characters in some character sets but they are mainly used to control text display and form (for instance, x08 (BS or backspace) is used to move the cursor one position to the left)

Please check ISO 2047 for definitions of specific control characters in the table above.

Unicode is a character set

Unicode is a superset of ASCII with character values between x0–x10FFFF (1,114,111 possibilities). Unicode characters whose values are between x0–x7F are exactly the same characters as those in the in the ASCII chart above.

Unicode doesn’t specify an encoding, though. It is just a character and number

Note: the Unicode code-point for the copyright symbol © is represented by the number 169 (xA9)

http://unicodelookup.com

Encodings

An encoding translates a sequence of code points to a sequence of bytes and is defined by its meta data, byte ordering and compression.

Meta data is well understood parts of the data that are used to describe the overall format of the data.

Example: if I told you that any series of characters that start with the lowercase letter X can be interpreted as a hexadecimal number, and all other series of digits can be interpreted as decimal numbers: we might describe the “x” as meta-data describing the radix of the number.

Byte ordering or endianness describes which order bytes should be arranged in a sequence. For all practical purposes there are two orders multi-byte characters can be ordered:

  1. Most significant byte first, called big endian, would be like writing a date as year, month, day or 2015–11–21. If you were writing the ten thousand two hundred and forty seven you would write it in big endian as: 10247
  2. Least significant byte first, called little endian, would be like writing a date as day, month then year or as: 21–11–2015. If you were writing the ten thousand two hundred and forty seven you would write it in little endian as: 74201

There are good reasons to pick either of the formats, although they may not be obvious. Big endian is a little more natural, it is how we write numbers: from left to right.

but, little endian offers some interesting advantages for hardware multi-byte addition routines. Take the following two numbers 10247 and 37499 as arrays of numbers in big endian format that we would like to add together:

   10247
+ 97499
========
107746

A third grader would add these two numbers from left to right using carry addition.

But it is much simpler (for hardware) to add the numbers together if the digits are presented in reverse order:

index: 00 01 02 03 04 05
-----------------
7 4 2 0 1
+ 9 9 4 7 9
========================
6 4 7 7 0 1

Because any additional carry digits can be added on to the end of the array instead of inserted into memory at the beginning.

Compression: methods for using the least number of bytes to transmit the common code-point values.

In some encodings, ASCII characters are supported in a compatible way by assuming that all character values between x00 and x7F (those with their high bit not set) are assumed to be single byte value (that is, represented by a single 8-bit value with the 8th bit not set), any value larger than x7F is represented by two or more bytes.

See the next section for more information about this.

Popular Encodings

ASCII: 7-bits

Latin-1: 8-bit ASCII or named in a different standard than we’ve talked about: ISO-8859–1

UTF-8: is backwards compatible with ASCII as it using single byte values for ASCII characters, but multi-byte values for non-ASCII characters:

The following table describes how code-points are encoded in UTF-8:

The string “Foo Bar” is represented by the following bytes:

Hexadecimal Values: 46 6F 6F 20 42 61 72
ASCII Characters: F o o B a r

The string “Foo © Bar” is represented by the following bytes:

Hexadecimal Values: 46 6F 6F 20 C2 A9 42 61 72
Characters: F o o © B a r

Notice that C2A9 represents the copyright symbol, why? Because we are trying to represent xA9, which in binary is 10101001 which can’t be represented as a single byte because its high bit is set. Instead it is represented as two byte values in as:

Binary:        1100 0010 1010 1001
Control bits: xxx xx
Data Bits: x xxxx xx xxxx
Hexadecimal: C 2 A 9
UNICODE: ©

UTF-16: An encoding standard that use at least two byte values to encode 1,112,064 possible unicode code-points.

UTF-32: or UCS-4 is a fixed-length encoding.

It may be useful for you to review how various encodings compare to each other.

Tips and tricks when writing code

Specify your encoding: every communication should announce its encoding so the receiver can interpret the encoded data correctly, examples below:

HTML

Content-Type: text/html; charset=ISO-8859–1
Content-Type: text/html; charset=UTF-8

XML

<?xml version=”1.0" encoding=”UTF-8"?>

SQL

DEFAULT CHARSET=utf8

CSS

@charset “utf-8”;

Why do i see little boxes

When a web page calls for a character to be rendered that the browser can’t display, the browser substitutes a little box, a question mark or some other symbol. The computer can’t display the character may be because of several reasons:

  • its operating system is obsolete
  • the browser is obsolete
  • no font with that particular character has been installed on your computer
  • your system as not been configured to display asian fonts.

Some Unix Commands

Just so you understand my environment, I’m running on a MAC, here is the output of the uname command:

laptop$ uname -a
Darwin El-Guapo.local 15.0.0 Darwin Kernel Version 15.0.0: Sat Sep 19 15:53:46 PDT 2015; root:xnu-3247.10.11~1/RELEASE_X86_64 x86_64
laptop$

Use ‘od’ to view the output

laptop$ echo -n “Foo Bar” | od -t x1 -t a
0000000 46 6f 6f 20 42 61 72
F o o sp B a r
0000007
laptop$ echo -n “Foo © Bar” | od -t x1 -t a
0000000 46 6f 6f 20 c2 a9 20 42 61 72
F o o sp c2 a9 sp B a r
0000012

Environment variables:

laptop$ env | grep ‘\(LC_.*\|LANG\)’
LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8

ssh config: ~/.ssh/config

Host *
SendEnv LC_* LANG

A command to convert between character sets

iconv -f ISO-8859-1 -t UTF-8 < infile > outfile
iconv --list

The Take Away

Character sets are character/value pairs named code-points. The code point for capital letter A in ASCII is 65 (in decimal, x42 in hexadecimal). Unicode is an expanded set used to represent non-english expansions (and symbols like copyright ©)

Encodings define how code-point values are represented in memory (or in a data stream). Many encodings exist, concentrate on UTF-8 as it is extremely popular because it both a single byte encoding which is backwards compatible with ASCII and multibyte encoding for other Unicode characters.

Declare your encoding in whatever programming language or document format so that programatic readers can’t correctly interpret its context.