What’s the deal with Base64?

For the past few years, Base64 has been all around me. I used it to embed images in stylesheets, to safely (not securely!) transfer parameters over http, and more recently to make our GraphQL node IDs more opaque to API consumers. While I’ve always been familiar with its output, I never stopped to ask myself what was going on under the hood. Turns out, I should have.

At its core, Base64 takes binary data and turns it in an ASCII string. It doesn’t care about the original data, whether it was a UTF8 string or a PNG file; if it has bits, we’re good to go.

The goal of Base64 is to be able to safely transfer information from one system to another without loosing any data. It does so by using a set of 64 characters common to most encodings. While the actual set may differ in each implementation, the default Base64.encode64 ruby implementation uses a-z, A-Z, 0-9 as well as + and /. For URL or path encoding, you can also use Base64.urlsafe_encode, which replaces + and / with - and _. The set of characters doesn’t really matter, as long as both systems agree on a single set.

Each base64 digit needs exactly 6 bits of information to be represented.

The fact that we only use 64 characters is what make Base64 so interesting. We’re able to transform any set of potentially problematic characters (brackets, parenthesis, emojis, …) into a readable, printable string that most systems understand, avoiding any potential delimiter collision.

Following this Wikipedia example, let’s try to convert a regular string into its Base64 format. First step, we need its binary representation.

$ binary = "Man".unpack('B*').first
> "010011010110000101101110"

Each character is encoded in their binary notation, for a total of 24 bits. Now, remember; Base64 only requires 6 bits per character. So instead of 3 groups of 8 bits, let’s consider this as 4 groups of 6 bits.

$ base64 = binary.chars.each_slice(6).map(&:join)
> ["010011", "010110", "000101", "101110"]

Same bits, arranged differently. Now as stated at the beginning, Base64 takes binary data and turns it in an ASCII string. It does so by using a table for which each index maps to an ASCII character. A is at index 0, B at 1, all the way up to / at index 63. You can find the full table on the Wikipedia page. All we have to do now is to convert those bit groups back into their decimal representation, and find their corresponding character.

$ base64.map { |bits| bits.to_i(2) }
> [19, 22, 5, 46]

Mapping those integers with the table, we get T, W, F, and u. Guess what?

$ Base64.encode64(“Man”)
> "TWFu"

That was easy, as 24 happily gets divided by 6. But Base64 works as well when it’s not the case, padding the last 24-bit block with 0, mapping those to the special = character. The number of = then indicates the number of missing bytes in those last 24 bits. Let’s see another example.

$ binary = "M".unpack('B*').first
> "01001101"

Obviously, since the string contains only one character, the “last” 24-bit block contains only 8 bits. It needs to be padded with 0.

$ binary = binary.ljust(24, "0")
> "010011010000000000000000"

The rest goes as usual, as we’re now able to build groups of 6 bits.

$ base64 = binary.chars.each_slice(6).map(&:join)
> ["010011", "010000", "000000", "000000"]
$ base64.map { |bits| bits.to_i(2) }
> [19, 16, 0, 0]

Mapping this against the table, we get T and Q, as well as two =.

$ Base64.encode64(‘M’)
> "TQ=="

Decoding works the same, going through the same steps in reverse order. We map characters to their index in the table, grab the six relevant digits of their binary representation, join them together and voilà!, we have our original data back. Remember; Base64 doesn’t encrypt nor hash your data!

While first introduced to fix communication problems, Base64 is now widely used in HTML to avoid network calls, embedding images and fonts directly within the body of the page. Since the encoded string only represent a set of bytes, a header is usually attached to the content to let browsers know how to interpret the data.

background: {
url(data:image/gif;base64,R0lGODlhEA==)
}

Javascript also allow encoding and decoding of Base64 data.

encoded = btoa("hello") # binary to ascii
decoded = atob(encoded) # ascii to binary

Of course, Base64 isn’t perfect. Being able to represent everything through a finite set of 64 characters means that everything takes more space. As we just saw, it encodes each set of three bytes into four bytes, in addition to the padding and the optional header. But for as long as we’ll have potential delimiter collisions, Base64 will likely stay around.

Thanks for reading!