Charset, encoding, encryption, same thing?

Joffrey Bion
3 min readMay 30, 2017

--

A lot of people I met were a bit confused about encodings and charsets (and I was the first of them!). Some people surprisingly also mixed that up with encryption.

After reading a bunch about these, and in particular Joel Spolsky’s excellent article about encodings, character sets, and their history, I decided to give it a go for a more condensed article.

Encoding VS Encryption

Let’s get that out of the way first, as these are quite different.

Encoding, in the context of programming, is the process of representing characters using binary digits (0s and 1s). This is necessary for storage, communication, or simple manipulation on a computer. The mapping between characters and bytes (or groups of bytes) is defined by an encoding scheme or character encoding (or simply encoding).

Encoding = characters -> binary

Encryption (or ciphering) is the process of converting a binary representation into another binary representation for the purpose of protecting data from outsiders. It does not need any character encoding because it operates on binary data already, so there is no concept of characters at all. It usually requires a secret encryption key (or more).

Encryption = understandable binary -> unintelligible binary

The source representation is some binary data that could be understood by anyone. If this binary data represents text for instance, one just needs to know (or guess) the encoding to be able to read it. The target representation is binary data that doesn’t make sense unless properly decrypted.

Encryption frameworks sometimes perform both the encoding and the encryption, which makes people confuse them.

Charset VS Encoding

Now this is what I wanted to talk about, as there is a slight confusion between these 2 things.

Before the invention of Unicode, there was no real difference between charset and encoding. Both terms were used to refer to a way to represent letters in binary. Therefore, ASCII, Latin1, Cp1252 etc. can be considered as character sets and encodings at the same time, hence the confusion.

Once Unicode came to the party, a clear distinction was made, because representing characters into bytes was now a 2-step process:

  1. associate a character/letter concept to a number called code point
  2. encode this "code point" number using bits

So, nowadays, we can probably define these 2 terms this way:

A charset (character set) is a set of characters mapped to theoretical, abstract numbers called code points. Unicode is an example of character set which contains almost every letter used in the world.

Charset = letter -> code point

An encoding scheme describes the way these code points are represented into bytes, such as UTF-8 or UTF-16BE.

Encoding scheme = code point -> bytes

Here is an example with the euro symbol “€”:

Technically, we can still see “charset” where we should in fact talk about “encoding”. The HTTP header Content-Type: text/html; charset=UTF-8 is a famous example of this, and so is the Java API for methods related to encoding.

I hope this clarifies your idea of encodings and characters. If you haven’t read the article I mentioned in the beginning, I recommend you do, as it is a very instructive and funny read. Spread the knowledge!

--

--