Image by giedre vaitekune on Shutterstock

How Long Is a String?

A Rust Brain Teaser

Herbert Wolverson
5 min readMar 23, 2022

--

https://pragprog.com/newsletter/
https://pragprog.com/newsletter/

Guess the output of the following program:

Try to guess what the output is before moving to the next section.

The program will display the following output:

Halló heimur is 13 characters long.

Discussion

Your eyes aren’t deceiving you — “Halló heimur” contains 12 characters (including the space). Let’s step back and take a look at how Rust’s String type works. The internal struct definition of a String is quite straightforward:

Strings are just a vector of bytes ( u8), representing Unicode characters in an encoding named UTF-8. Rust automatically translates your string to UTF-8. The encoding looks like this:

Your original string, “Halló heimur” consists of 11 ASCII characters (including the space) and one “Latin-1 Supplement” character: the ó. ASCII characters require 1 byte to encode, Latin supplements require two bytes.

Rust’s string encoding is smart enough to not store extra zeroes for each Unicode character. If it did, String would be a vector of char types. Rust’s char is exactly 4 bytes long — the maximum size of a single Unicode character. Char variables don’t represent a single ASCII character; instead, they represent a Unicode scalar value. The scalar value can represent a single glyph or modification to another glyph.

String Length

String.len() counts the number of bytes in the string’s backing vector. If a String was storing every character as a char, you’d expect Halló heimur to occupy 48 bytes of memory. Rust’s String isn’t storing characters; it’s storing a byte array representing just the bytes needed to output the stored text.

Not all UTF-8 characters require all 4 bytes to render. For example, a space requires only 1 byte (0x20) , while most Latin Extension characters use 2 bytes. The first byte ( 0xC3) indicates that the character uses the Latin Extension character region, and the second byte ( 0xB3 for ó) identifies the character.

The string Halló heimur contains 11 ASCII characters — each using 1 byte of memory — occupies 11 bytes. Add two bytes for the “ó” and your string occupies 13 bytes of memory.

Counting Characters

You can correctly count the characters in Halló heimur with the following code:

When you call my_str.chars() , you’re requesting an iterator that returns each element of the string represented as a char. Rust correctly deduces that there are a total of 12 glyphs — or Unicode scalar values — making up the string. The iterator passes each of them to your consumer as a 4-byte char. Even if a glyph only requires 1 or 2 bytes of memory, Rust will allocate all 4 bytes for the char type. Traversing the iterator uses very little extra memory. If you call collect() on the iterator — to create a vector of char data — the vector will consume 40 bytes of memory.

Use my_str.chars() to access individual characters in a String. It’s an iterator, so you can use nth, for_each and other iterator functions to find what you’re looking for. For example, you can access the 4th character in a string with my_str.chars().nth(4).

Impact of UTF-8 Sizing

Unicode string sizing can be confusing at times, which can lead to surprising results in your code. You need to be aware of the distinction between characters and bytes:

  • When you’re validating string length, know what counts and what doesn’t. For example, if you only accept usernames that are 10 characters or less, you need to decide if you mean glyphs or bytes.
  • When storing strings in databases, you need to remember to allocate enough space for non-English character set strings.
  • When transmitting or receiving information to/from a remote API, you need to agree on a length standard for encoding strings in transit.
  • If you’re writing a program for a memory constrained system, parsing Unicode string character by character can consume a lot more memory than you expected. The string love: ❤ is 7 characters long, requires 12 bytes of storage in a String — and 32 bytes of memory when processed as individual characters. This may seem like a small amount of memory, but if your reader enters the entirety of War and Peace into your program’s input box, per-character parsing may require more resources than you expected.
  • When accessing individual characters in a string, it’s much safer to use chars as opposed to directly accessing the byte array. Characters are aware of Unicode boundaries — bytes are not. Printing the first 6 bytes of “Können” will only print “Könne”. Printing the first 6 characters will output the entire word.

Further Reading

You’ll find this puzzle and more in Rust Brain Teasers from The Pragmatic Bookshelf. You can save 35 percent on the ebook versions of Herbert’s books with promo code rust_2022 now through April 30, 2022. Promo codes are not valid on prior purchases.

--

--