Hi everyone! In this first installment of the Whoa, that’s fantastic! series, we’re gonna talk about UTF-8 and it’s mysterious, mind-bending ways.
But let’s take this from the beginning.
You see, back in my day, if you wanted to have a voice chat with your BFF you needed to put coins into these huge-ass smartphones they had bolted on to the sidewalk, then turn a dial based on some random number assigned to your friend, and then that’d make a slightly smaller smartphone plugged into a wall at your friend’s house make noises, and it meant someone wanted to have a voice chat.
My point is, I’m old. And as an old person, I grew up using ASCII.
And ASCII is simple:
One byte per character. One character per byte. Beautiful.
Then I took a break from computering for a few years, and when I came back there was this whole UTF-8 thing.
So I learned about runes and whatever (that’s Golang if you’re wondering), and for a while, functionally at least, everything was fine.
But y’know, I’m a curious person, and there’s nothing I don’t get curious about eventually.
Which led me to…
I mean, of course, right? UTF-8 needs more than one character per byte, else how would it encode a billion different characters?
So those funny characters take more than one byte (and in UTF-8 a character may take up to four).
But… how does it work?
1. Do accent bytes connect with ASCII bytes to form the new characters?
2. Are there separator bytes in between character bytes, meaning anything in between separators get lumped together?
3. Is this black magic of some sort?
No and no. Well gosh darn it.
(And about #3, my otherwordly consultants said it isn’t.)
At this point I spent ages trying to solve this on my own, and then I RTFM at last. Here’s the important bit, adapted from WP:
- If your byte starts with a zero, that’s a normal ASCII character.
- If your byte starts with a 110, you got a two-byte character.
- If your byte starts with a 1110, you got a three-byte character.
- 11110 means four.
- And all 10* bytes are “connectors,” so to speak.
To see if this works:
Let’s ponder this before I show you the output. We have four characters there, and their code points are: U+0061, U+00E3, U+9999, and U+1F914. They should be one, two, three, and four bytes long respectively.
Meaning we should have:
- For our first character: one byte starting with a zero.
- Second character: one byte starting with 110, followed by another starting with a 10.
- Third character: first byte starts with 1110, next two start with 10.
- Fourth character: first byte starts with 11110, the next three start with 10.
Reshuffled a bit for clarity:
2. 11000011 10100011
3. 11101001 10100110 10011001
4. 11110000 10011111 10100100 10010100
Now let’s see if the remaining bits, apart from all that signaling, actually form the numbers we’re looking for:
- 11000011 10100011 → 00011 100011
- 11101001 10100110 10011001 → 1001 100110 011001
- 11110000 10011111 10100100 10010100 → 000 011111 100100 010100
Which we can use to:
The output here is e3 9999 1f914. In other words:
- 00011100011 → 0xE3 (ã)
- 1001100110011001 → 0x9999 (香)
- 000011111100100010100 → 0x1F914 (🤔)
So there we go.
All is good in the world now.
We know how UTF-8 does its magic.
Or do we?
Do you ever wonder what happens when a four-byte character…
…meets a special four-byte character, and then the two of them…
…engage in twenty seven bytes worth of naughtiness and…
…they end up a big, happy, twenty five-bytes long family?
You do? Awesome! Stay tuned for the next post!
Thank you for reading! If you enjoyed this article please share it, and make sure to subscribe to dEffective Go!