String encoding One-o’-One

Théo Carrive
Cheerz Engineering
Published in
7 min readSep 24, 2020

Hands-on introduction on strings encoding with Ruby 💎

Strings are a key-part of every programming language and interaction between systems. Learning how strings work in details is far from being impossible, but a lot of developers avoid it so much, that when they stumble upon an encoding problem, they let it ruin their life for hours, if not days.

So, let’s take you to a journey to discover the joy of how encoding works, so that when the day will come, you’ll be ready for that.

Step 1. Disassemble the string

Let’s start small. To analyze a string, a good place to start is to split it in an array of its individual characters. After all, a string is nothing more than a concatenated array of characters, right ?

Alright. Let’s start with an easy one :

"Hello"

With not much surprise, you use the “chars” method :

"Hello".chars
=> ["H", "e", "l", "l", "o"]

That was easy. 🙃

Let’s skip to a more challing part: Let’s analyze the data that actually represents our string:

"Hello".bytes
=> [72, 101, 108, 108, 111]

Alright, now we’re talking.

What we see here is an array of bytes, and what is really interesting with this method, is that it exposes the real data that is behind the characters you see when you print your string.

I won’t spend much time explaining how binary numeration work, people have done it way better elsewere on the internet, the only important part to know here is that a byte is itself an array of 8 bits, thus, a byte can hold a value of 2⁸ = 256 different values. Thus, we could say that a byte represents a value from 0 to 255. This is what you see in this tiny array.

Step 2. Understand what encoding is

But the Ruby string contains something else than just those bytes: It also contains the encoding which is used to interpret those bytes.

Let’s check the name of the encoding our string actually uses:

"Hello".encoding
=> #<Encoding:US-ASCII>

You can see that the current encoding for this string is US-ASCII. OK, but what is this encoding for?

Well, you’ve got yourself your string broken apart into bytes. And each of those bytes is actually mapped to a character. Encoding is just the way you map bytes into characters.

Here, our encoding is ASCII, and the main parts of the ASCII encoding are the mapp:

  • The byte value 48 maps to the digit “0”. Thus, 49 maps to “1”, 50 to “2”, etc., until 57 that maps to “9”.
  • The byte value 65 maps to the uppercase letter “A”, (so 90 = “Z”)
  • The byte value 97 maps to the lowercase letter “a” (so 122 = “z”)
  • Other byte values correspond to other characters (like a space, or “%”, return carriage, etc.)

So if we look again at our array:

"Hello".bytes
=> [72, 101, 108, 108, 111]

We can guess that in ASCII, 72 maps to “H”, 101 to “e”, etc.

Great.

But… Wait a minute! If we only have 256 values, is it really enough to contain two times the alphabet (lower and uppercase), special characters, numbers, accents, chinese characters, japanese characters, and even worst… emojis 🥺🥺🥺🥺?

No.

They just can’t hold in 256 values.

Actually, ASCII stands for American Standard Code for Information Interchange, and didn’t take in account all those non-american characters at this time. And by the way, for historical reasons, ASCII don’t even contain 256 characters, but only 128, and can fit in 7-bits only. It means that in ASCII, chinese characters, emojis, arabic characters, etc., they just don’t simply exist.

This is why we had to came up with other encoding than ASCII, like UTF-8.

So, how is UTF-8 different than ASCII ?

Well, in UTF-8, if you read an array of bytes, and that one byte has a value > 192, you know that you need to read the next byte to know which character those bytes represents.

And actually, if the value of the first byte is > 240, you would even have to read up to the next three bytes, since UTF-8 can store a character on up to 4 bytes.

Wait, would it mean that the number of bytes in a string is no longer equal to the number of chars?

YES.

Step 3. The multi-bytes characters

You understand me well, we will now have the chance to play with nothing less than multi-bytes characters! 🎉🎉

"Voilà".chars
=> ["V", "o", "i", "l", "à"]

That was easy. Now, let’s look at the bytes:

"Voilà".bytes
=> [86, 111, 105, 108, 195, 160]

We have one more byte than the number of chars.

Without surprise, you can see that I didn’t lie that much, since the first byte of the “à” is 195, and that 195 is > 192, and this is how your machine will know that when it will read the 195, it also has to read the next byte (the 160).

Another way to look more easily at how the chars are decomposed:

"Voilà".chars.map{ |c| c.bytes }
=> [[86], [111], [105], [108], [195, 160]]

You can there see that each “american” character is represented by one byte, and that the “à” is actually represented by two bytes.

If you wanted to analyze quickly a very long string, you could even push it just a bit further and take advantage of the nice Ruby syntax:

"Voilà".chars.map{ |c| [c, c.bytes] }.to_h
=> {"V"=>[86], "o"=>[111], "i"=>[105], "l"=>[108], "à"=>[195, 160]}

Great. But since we are converting string to its actual bytes values, could we do it the other way around?

Absolutely. This is why the Ruby “pack” method is for.

Step 4. Pack your bytes, back to a string

With the pack method, you can take an array of values, and tell to Ruby to consider that each value is a byte, and convert it as a String:

[86, 111, 105, 108, 195, 160].pack('C*')
=> "Voil\xC3\xA0"

Mmmh. This is not great. What happened to our last letter “à” ?

Actually, in a Ruby string, when your machine displays “\xC3”, it’s not 4 chars, but one. It’s the Ruby way of telling you that at this character position, the byte value is “C3” (in Hexadecimal), but that it didnt manage to map it to a character.

Let’s take a look at what is the integer value of the hexadecimal “C3” and “A0”:

"C3".to_i(16)
=> 195
"A0".to_i(16)
=> 160

Ah. Those are our original bytes.

If Ruby had used the proper UTF-8 encoding, we would’nt see this problem, so let’s just tell Ruby which encoding to use when reading the string:

[86, 111, 105, 108, 195, 160].pack('C*').force_encoding('UTF-8')
=> "Voilà"

Great. That’s it. When you know how to decode a string, you can manage to display it well.

In web API response for instance, the remote server can give you the hint, by specifying the encoding in the “Content-Type” header of the response:

Content-Type: text/html; charset=utf-8

This tells the client that receives the bytes of the response that it should use UTF-8 to decode it.

Step 5. Bytes vs codepoints

One last thing before you stop reading: Let’s talk about codepoints, and Unicode.

"Voilà".codepoints
=> [86, 111, 105, 108, 224]

When you play with this method “codepoints”, it will return an array. So far, it looks like an array of bytes. But let’s apply it on a different string, with some funnier stuff inside:

"Voilà🥕".codepoints
=> [86, 111, 105, 108, 224, 129365]

Et voilà! You can see that there is a value that is way superior to 255: It means that those numbers are not bytes values, but good old decimal numbers!

Indeed, codepoints are just decimal numbers, that represent each character. And the mapping has been established and is a standard called Unicode. It means that each character in the world, has its number associated with it. But this number is not an encoding, since it’s not about how systems should read bytes, it’s just a standard.

Recently, while working with the API of a provider, we have seen something confusing. The response string of the API was something like:

"Cl\xE9ment"

The response was supposed to be in UTF-8, but we have this decoding error. We tried to change the encoding, but it didn’t help:

str = "M\xE9chant"
str.force_encoding('UTF-8')
=> "M\xE9chant"

We played a bit with the string:

"M\xE9chant".chars
=> ["M", "\xE9", "c", "h", "a", "n", "t"]
"M\xE9chant".chars.map{|c| c.valid_encoding? }
=> [true, false, true, true, true, true, true]

We clearly see that there is an encoding problem.

Let’s try to brute-force the encoding (the entire list is available here):

encoding_formats = %w[ASCII UTF-7 UTF-8 UTF-16 UTF-32 BIG5]
str = "M\xE9chant"
encoding_formats.map do |encoding|
{encoding => str.force_encoding(encoding).valid_encoding?}
end
=> [{"ASCII"=>false}, {"UTF-7"=>true}, {"UTF-8"=>false}, {"UTF-16"=>false}, {"UTF-32"=>false}, {"BIG5"=>true}]

Meh. 😕

But a quick search on the internet tells us that “E9” (or 233 in decimal representation) is actually the codepoint of “é”.

The good solution is to understand why you received codepoints instead of UTF-8 encoded bytes in a first place, but a quick solution to this problem would be the following:

bytes = "M\xE9chant".bytes
=> [77, 233, 99, 104, 97, 110, 116]
bytes.pack('U*')
=> "Méchant"

Victory. 🍾

Yes, we take the bytes, that should not be read as UTF-8 bytes but as Unicode codepoints, and we pack them back, specifying that those are Unicode codepoints.

Let’s wrap it up

Useful methods to break strings appart

"Hi 🧀".chars
=> ["H", "i", " ", "🧀"]
"Hi 🧀".bytes
=> [72, 105, 32, 240, 159, 167, 128]
"Hi 🧀".chars.map{ |c| [c, c.bytes] }.to_h
=> {"H"=>[72], "i"=>[105], " "=>[32], "🧀"=>[240, 159, 167, 128]}
"Hi 🧀".codepoints
=> [72, 105, 32, 129472]

Useful methods to change encoding

"Hi".encode('ASCII')
=> "Hi"
"Hi 🧀".encode('ASCII')
Traceback (most recent call last):
Encoding::UndefinedConversionError (U+1F9C0 from UTF-8 to US-ASCII)
"M\xC3\xA9chant".force_encoding('UTF-8')
=> "Méchant"

Useful methods to reassemble strings

[72, 105, 32, 129472].pack('U*')
=> "Hi 🧀"
[72, 105, 32, 240, 159, 167, 128].pack('C*').force_encoding('UTF-8')
=> "Hi 🧀"

That’s it.

Now, you know how to analyze a string, how to change its decoding, convert it to bytes, or codepoints, convert bytes back to a string. Suddenly, your understanding of universe is a bit better, and the sun shines somewhere on the planet.

Thanks for reading! 😘

(2023 update: Great related reading here https://tonsky.me/blog/unicode/)

--

--