Powering the Internet with Base64
Diving into the mechanics of base64 encoding
Base64 is ubiquitous on the Internet right now. Sometimes it seems like every request, every URL and every file is being encoded in base64 format! As a programmer, base64 certainly feels like a daily fact of life.
Every time I use base64, I can’t help but wonder how it works. What is the dark magic hiding underneath the hood when you type base64_encode()? Today, we’ll dive into the specifics of base64 and find out.
What is base64 encoding?
Base64 is a binary to ASCII encoding scheme. It is designed as a way to transfer binary data reliably across channels that have limited support for different content types.
A base64 encoded string looks like this:
V2hhdCBoYXBwZW5zIHdoZW4geW91IGJhc2U2NCgpPw==
Base64 characters only use the same 64 characters that are present in most character sets. They are:
- Upper case alphabet characters A-Z.
- Lower case alphabet characters a-z.
- Number characters 0–9.
- And finally, characters + and /.
- The = character is used for padding.
These characters are generally implemented by most character sets and are not often used as controlled characters in Internet protocols. So when you encode content with base64, you can be fairly confident that your data is going to arrive uncorrupted.
Whereas when you transfer your data in their original, “bits and bytes” state, the data might be screwed up due to protocols misinterpreting special characters.
What is it used for?
The original use case for base64 was simply as a safe way to transmit data across machines. Overtime, base64 has been integrated into the implementation of certain core Internet technologies such as encryption and file embedding.
Data transmission: Base64 can simply be used as a way to transfer and store data without the risk of data corruption. It is often used to transmit JSON data and cookie information for a user.
File embedding: Base64 can be used to embed files within scripts and webpages, so as to avoid depending on external files. Email attachments are also often sent this way.
Data obfuscation: Base64 can be used to obfuscate data since the resulting text is not human readable. However, this should not be used as a security mechanism as the encoding is easily reversible.
Data hashing: Data hashing schemes such as SHA and MD5 often produce results that are not readable or transmittable. Therefore, hashes are almost always base64 encoded so that they could be easily displayed and used for file integrity checks.
Cryptography: Similarly, encrypted data often contain sequences of bytes that are not easily transmitted or stored. When encrypted data needs to be stored in a database or sent over the Internet, base64 is often used. In addition to cyphertext, public key certificates and other encryption keys are also commonly stored in base64 format.
Other transfer-safe encoding schemes
But why is base64 so widely used? Aren’t there other transfer-safe encoding schemes, like Hex and Decimal?This is because base64 is very compact compared to other transfer-safe encoding schemes.
For example, Hex encoding encodes each byte as a pair of Hex characters. This means that each byte of data would become two bytes after encoding.
Whereas Decimal is even less efficient: each byte of data would be represented as three numbers, which means that each byte of unencoded data would take up three bytes as encoded data.
Base64 maps every three bytes of data into four bytes of encoded data. This means that the data would only bloat 4/3 times once it’s base64 encoded.
How does base64 work?
Base64 encoding converts every three bytes of data (three bytes is 3*8=24 bits) into four base64 characters.
Each six-bit sequence is uniquely mapped to one of the 64 characters used:
For example, the text Hi! has the binary representation of
01001000 01101001 00100001
Which makes up a total of three bytes (24 bits). Base64 encoding will divide the binary data up to six-bit chunks and map them to a base64 character according to the table above.
010010 | 000110 | 100100 | 100001
S G k h
Therefore, the base64 encoding of Hi! is SGkh.
Padding
When the number of characters to be encoded does not come with a multiple of six bits, zeros will be used to complete the last six-bit sequence.
For example, the text Hi has the binary representation of:
01001000 01101001
The binary representation only contains 16 bits, which is not divisible by six.
010010 | 000110 | 1001
To encode the text properly, base64 will add zeros to the end of the bit sequence.
010010 | 000110 | 100100 |
S G k =
So the base64 representation of Hi is SGk=. (The = padding character is added so that the last encoded block will have four base64 characters.)
Decoding base64
To decode base64, you simply have to reverse the above operation:
- First, you remove any padding characters from the end of the encoded string.
- Then, you translate each base64 character back to their six-bit binary representation.
- Finally, you divide the bits into byte-sized (eight-bit) chunks and translate the data back to its original format.
Other base64 implementations
Apart from the standard base64 encoding scheme that was mentioned above, there are many different implementations of base64 for specific use cases. For example:
- Base64 for filenames uses “-” in place of “/”. This is to work around the fact that Unix and Windows filenames cannot contain the character “/” since it’s used in file paths.
- Base64 for URLs uses “-” and “_” in the place of “+” and “/” and omits padding the encoded string with “=”. This is because URLs require special characters like +, / and = to be URL encoded into %2b, %2f and %3d, which makes the encoded string unnecessarily long.
And that’s the basics of how base64 encoding works! Base64 is a simple yet powerful way of encoding binary data, and it powers a lot of our Internet. Thanks for reading!