How Long Is a String?
A Rust Brain Teaser
Guess the output of the following program:
Try to guess what the output is before moving to the next section.
The program will display the following output:
Halló heimur is 13 characters long.
Discussion
Your eyes aren’t deceiving you — “Halló heimur” contains 12 characters (including the space). Let’s step back and take a look at how Rust’s String
type works. The internal struct definition of a String
is quite straightforward:
Strings are just a vector of bytes ( u8
), representing Unicode characters in an encoding named UTF-8. Rust automatically translates your string to UTF-8. The encoding looks like this:
Your original string, “Halló heimur” consists of 11 ASCII characters (including the space) and one “Latin-1 Supplement” character: the ó. ASCII characters require 1 byte to encode, Latin supplements require two bytes.
Rust’s string encoding is smart enough to not store extra zeroes for each Unicode character. If it did, String would be a vector of char types. Rust’s char
is exactly 4 bytes long — the maximum size of a single Unicode character. Char
variables don’t represent a single ASCII character; instead, they represent a Unicode scalar value. The scalar value can represent a single glyph or modification to another glyph.
String Length
String.len()
counts the number of bytes in the string’s backing vector. If a String
was storing every character as a char, you’d expect Halló heimur
to occupy 48 bytes of memory. Rust’s String
isn’t storing characters; it’s storing a byte array representing just the bytes needed to output the stored text.
Not all UTF-8 characters require all 4 bytes to render. For example, a space requires only 1 byte (0x20)
, while most Latin Extension characters use 2 bytes. The first byte ( 0xC3
) indicates that the character uses the Latin Extension character region, and the second byte ( 0xB3
for ó
) identifies the character.
The string Halló heimur
contains 11 ASCII characters — each using 1 byte of memory — occupies 11 bytes. Add two bytes for the “ó” and your string occupies 13 bytes of memory.
Counting Characters
You can correctly count the characters in Halló heimur
with the following code:
When you call my_str.chars()
, you’re requesting an iterator that returns each element of the string represented as a char
. Rust correctly deduces that there are a total of 12 glyphs — or Unicode scalar values — making up the string. The iterator passes each of them to your consumer as a 4-byte char. Even if a glyph only requires 1 or 2 bytes of memory, Rust will allocate all 4 bytes for the char type. Traversing the iterator uses very little extra memory. If you call collect()
on the iterator — to create a vector of char data — the vector will consume 40 bytes of memory.
Use my_str.chars()
to access individual characters in a String. It’s an iterator, so you can use nth
, for_each
and other iterator functions to find what you’re looking for. For example, you can access the 4th character in a string with my_str.chars().nth(4)
.
Impact of UTF-8 Sizing
Unicode string sizing can be confusing at times, which can lead to surprising results in your code. You need to be aware of the distinction between characters and bytes:
- When you’re validating string length, know what counts and what doesn’t. For example, if you only accept usernames that are 10 characters or less, you need to decide if you mean glyphs or bytes.
- When storing strings in databases, you need to remember to allocate enough space for non-English character set strings.
- When transmitting or receiving information to/from a remote API, you need to agree on a length standard for encoding strings in transit.
- If you’re writing a program for a memory constrained system, parsing Unicode string character by character can consume a lot more memory than you expected. The string love: ❤ is 7 characters long, requires 12 bytes of storage in a String — and 32 bytes of memory when processed as individual characters. This may seem like a small amount of memory, but if your reader enters the entirety of War and Peace into your program’s input box, per-character parsing may require more resources than you expected.
- When accessing individual characters in a string, it’s much safer to use chars as opposed to directly accessing the byte array. Characters are aware of Unicode boundaries — bytes are not. Printing the first 6 bytes of “Können” will only print “Könne”. Printing the first 6 characters will output the entire word.
Further Reading
You’ll find this puzzle and more in Rust Brain Teasers from The Pragmatic Bookshelf. You can save 35 percent on the ebook versions of Herbert’s books with promo code rust_2022 now through April 30, 2022. Promo codes are not valid on prior purchases.