Reference Iterators in Rust

One thing you’ve likely noticed when writing Rust is how its ownership system makes heap allocations fairly explicit. One way to avoid these when iterating is to yield references to heap-allocated objects, rather than copies. While simple in theory, this concept can get complex rather quickly in practice.

To illustrate some of the challenges involved, we’ll start with a simple example. Let’s build an iterator that accepts a string and yields its non-whitespace tokens:

input: "some string data"
output: ["some", "string", "data"]

Since the tokens are just slices of the original string, we want to avoid unnecessary allocations by yielding &str items. Here’s the type definition:

struct TokenIterator {
data: String,
token_start: usize,
token_end: usize,
}

We’ll need to implement the Iterator trait for this type, too:

impl Iterator for TokenIterator {
type Item = &str;

fn next(&mut self) -> Option<Self::Item> {
...
}
}

I’ve omitted the implementation details, since they’re just a distraction for now (you can find those here if you’re interested). Believe it or not, this won’t compile. If you tried, you’d see an error like this:

error[E0106]: missing lifetime specifier
--> src/main.rs:18:17
|
18 | type Item = &str;
| ^ expected lifetime parameter

We’re telling Rust that this iterator is going to yield references to string slices, but we’ve not specified where the original data will live. How long can the yielded&str stick around for? We need to associate these references to a source so that Rust can ensure it will be around at least as long as the reference.

Given that, let’s add a lifetime parameter to our TokenIterator type and use that for our references:

struct TokenIterator<'a> {
data: String,
token_start: usize,
token_end: usize,
}
impl<'a> Iterator for TokenIterator<'a> {
type Item = &'a str;

...
}

Seems logical, right? This won’t compile, either, but for a different reason:

error[E0392]: parameter `'a` is never used
--> src/main.rs:1:22
|
1 | struct TokenIterator<'a> {
| ^^ unused type parameter
|

We’re defining a lifetime for the TokenIterator type, but we’re not using it in the struct (the reference to it in the iterator impl block isn’t enough). There are two ways to solve this problem:

  1. We can leverage Rust’s PhantomData type to add a zero-sized field to the struct that uses the lifetime, strictly to satisfy the compiler. Unfortunately, its type signature is generic over a type, and our TokenIterator isn’t; it’s not a great fit here.
  2. We can introduce a separate type to hold the data, and borrow that data for the iterator. This is the generally-accepted best practice in this scenario for reasons we’ll soon see. Let’s go that route:
struct TokenSet {
data: String
}
impl TokenSet {
pub fn iter<'a>(&'a self) -> TokenIterator<'a> {
TokenIterator::new(&self.data)
}
}
struct TokenIterator<'a> {
data: &'a str,
token_start: usize,
token_end: usize,
}

Now that we’re using the 'a lifetime, the TokenIterator type definition is valid and the program compiles. There are still a few drawbacks:

  1. The intermediary type is exposed to the consumer. If you’re returning this type as part of an API, users will need to know to call .iter() on the returned type to actually get the iterator. You can alleviate some of this pain by implementing the IntoIterator trait on the intermediary type, which will allow for loops to work seamlessly, at least.
  2. The yielded reference cannot be mutable, at least not without resorting to an unsafe block. Adding mutability to the reference prevents the example from compiling:
error[E0495]: cannot infer an appropriate lifetime for lifetime parameter in function call due to conflicting requirements
--> src/main.rs:57:23
|
57 | Some(&mut self.data[self.token_start..self.token_end])
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
help: consider using an explicit lifetime parameter as shown:
fn next(&'a mut self) -> Option<Self::Item>
--> src/main.rs:34:5
|
34 | fn next(&mut self) -> Option<Self::Item> {
| ^

Adding the recommended explicit lifetime doesn’t help, as it violates the Iterator trait’s next signature. This is intentional: the iterator design in Rust prevents linking the lifetime of yielded items to the iterator itself. In simpler terms, the iterator shouldn’t own the data it’s yielding. It’s meant to be a transient type that sits outside of the collection, providing a convenient means of accessing the data.

If iterators were able to own the data they provide references to, that data would go out of scope with the iterator. We’d have to explicitly keep the iterator around, even after we were finished with it. While entirely possible, that sort of design isn’t very idiomatic. The iterator in many cases is intended to be a transient type instantiated as part of a chained method call (see how iter() goes out of scope in the next example).

As for the lack of mutability, it’s helpful to think about a real world example. A notable feature of iterators is that they can be collected into a Vec:

let token_set = TokenSet::new(String::from("data"));
let tokens: Vec<&str> = token_set.iter().collect();

For this to work, all yielded items must be accessible at the same time. You can’t assume they’ll only be in scope for the current iteration; that Vec<&str> needs to hold everything at once. The only way to accomplish that without introducing multiple mutable references to the same data is to ensure the data references don’t overlap. Rust’s ownership system isn’t quite granular enough to express that, so that implementation would need to live in an unsafe block. You can see examples of this in the Rust standard library (look for collections that implement iter_mut).

Hopefully this sheds some light on reference-yielding iterators, and incidentally, Rust’s ownership system. Studying their design can provide a more complete understanding and appreciation of its underlying principles.