Nitty gritty details of char iterators

George Shuklin
journey to rust
Published in
4 min readSep 21, 2022

I found my own way to learn Rust. After initial ‘impossible’ wall, there is other kind of problem, when you stuck in the middle of an expression and can’t make value to be of a suitable type.

This is one of such cases, and I’ll rehearse what I’ve learned.

The task

There is an input string, and and a function need to produce output string, where every . (dot) is replaced with [.] (dot in square brackets). Code must be an iterator over the input sting with .collect() at the end producing the output string. No String.replace or other cheating!

Signature:

fn myfunc(input: String) -> String;assert_eq!("abc".to_string(), myfunc("abc".to_string());
assert_eq!("ab[.]c".to_string(), myfunc("ab.c".to_string());
assert_eq!("ab[.][.]c".to_string(), myfunc("ab..c".to_string());
assert_eq!("[.]".to_string(), myfunc(".".to_string());

The skeleton of the function:

intput.chars().map??(|c|{
match c{
'.' => {?something with "[.]"},
c => {?just a value of c}
}
}).collect()

we don’t know what is ‘map??’ function is, and {?} is a total mystery.

Impossible things

Both arms of match must return the same type (or bottom type, which is not suitable here). Here the first problem: c is Char, and “something with [.]” can not be a char, it’s a sequence of chars. There is no ‘generics’ here.

The problem with returning “[.]” is fundamental here, because it’s a sequence no matter how you twist it. So, to make second arm (with ‘c’) to match in types, we need to make it sequence too.

Convention: I got tired to type [.] and c, so I’ll call them ‘dot-arm’ and ‘c-arm’. Bear with me.

Can we use std::iter::once? It creates iterator with a single value… If we do so, we need to convert dot-arm to some kind of iterator too.

Okay, let’s do it:

'.' => "[.]".chars(),
c => std::iter::once(c)

But of course it won’t work. Chars iterator return type Chars, and once iterator return Once. Even if both are iterators and have Iterator trait implemented, they are clearly different, and Rust do not allow different arms of the match to be different. Even if we put some additional iterator at the end (like .take(3)), the return type of each will be different.

Oh, also, if we use iterators as arms, we need to use flat_map function instead of just map.

“May be possible” things

Technically, we can construct strings in both arms, and return them. This will works, but we’ll do crazy amount of allocations. Each character (c-arm) will be wrapped into String, that means, a new allocation on each characters, and then, during flat_map, collapsing it back into a single string. Nope, no micro-strings, please.

How about iterators?

intput.chars().flat_map(|c|{
match c{
'.' => ['[','.',']'].iter(),
c => [c].iter()
}
}).collect()

Sounds reasonable, right? Each arm is an iterator over an array, and returns iterator over itself. The problem is that we construct array out of character c, and this array is not belonging to anything, so it’s deallocated instantly, and we can’t pass iterator over it. Rust won’t allow that.

What if we return arrays ‘as is’? (Forget about collect thing). We can’t, because [char;1] and [char;3] are different types.

Thinking harder

We want to return one or more characters. We can shovel them into Vec, but Vec is almost as much inefficient as String for that. Tons of allocations.

But, may be, we can use slice. How to make a slice from a str? … Slice of what?

This is point of thinking. str is not an array of Char's. What type should we use? It looks like Chars, because we iterate over chars (.chars iterator!) Wait. Why do we use chars ? Can we iterate over str by having a str with a single character at a time?

I looked for something and there is no dedicated function for such iterator. I asked in Reddit and got the answer:

str.split_inclusive(|_| true)

which yields slices of of str, where every slice has only one character. It’s unexpected twist, but it works.

With this iterator we can write the first working solution:

fn myfunc(input: String) -> String {
input
.split_inclusive(|_| true)
.map(|c| match c {
"." => "[.]",
c => c,
})
.collect()
}

We are yielding &str’s of different length and they are collected into String .

Other way

People on the Discourse pointed to other possible solutions, so let’s look on them:

pub fn myfunc(input: &str) -> String {
let bytes: Vec<u8> = input
.as_bytes()
.into_iter()
.flat_map(|c| match c {
b'.' => b"[.]",
c => std::slice::from_ref(c),
})
.copied()
.collect();
String::from_utf8(bytes).unwrap()
}

String is been deconstructed into bytes and iterated over. For each character it’s either str, or function I never heard about: std::slice::from_ref

Which I believe is very useful, as it allow to convert any reference to slice over it.

Additional notes:

  1. I need to pay more attention to slice type. I think, it’s more important than it looks.
  2. I missed b-notation. It allow to construct either u8 type (x=b’.’) or an array of u8 (b"hello" is the same as [b'h', b'e', b'l', b'o']).

There also was a terrific note about unicode-segmentation library which pointed to this article: https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/… I found that Unicode is even more twisted, than I thought… (0xFDFD, omg)…

Conclusion

  1. Use String.replace and do not cause stirring.

split_inclusive is something I wasn’t expected to use, but very interesting, and it can be used for many types of lossless splitting. std::slice::from_ref is really convenient way to shovel things into iterators. b notation is great!

More importantly, I got a rather solid ground about ‘no type magic in match’. You need both arms to have the same type, and there is no way around. Also, collect::<String>() with flat_map over slices is an interesting way to process strings…

--

--

George Shuklin
journey to rust

I work at Servers.com, most of my stories are about Ansible, Ceph, Python, Openstack and Linux. My hobby is Rust.