Enumerate All the Things

Published in

CarMax Engineering Blog

10 min readJul 31, 2024

I’ve been writing .NET applications in C# for well over a decade, during which the language has evolved and advanced considerably. Recently, I’ve had cause to deep-dive into a fundamental API that’s been around for even longer: the IEnumerable<T> interface, as well as its much-newer cousin, the IAsyncEnumerable<T> interface. These interfaces are commonly used as one-size-fits-all abstractions even in very simple code, but they have some impressive capabilities you can unlock when you understand the principles they follow. In this post, I’m going to walk through several different ways you can use these interfaces to keep your code flexible and lightweight.

IEnumerable<T> for Collections

I don’t have data on this, but if you asked me to guess what the most common use of an IEnumerable<T> is, it would be “an abstraction of a collection.” IEnumerable<T> lacks certain common APIs you’d often want in a collection, most notably a stored count of the items. But it’s very flexible when all you need is to iterate over a sequence of items.

Thus, it’s very useful for the API design practice of not requiring an input to be a more derived type than you actually need. An IEnumerable<T> parameter in a public method can accept a T[], a List<T>, a HashSet<T>, an ICollection<T>, or many other collection types in .NET (to say nothing of custom types), without being unnecessarily opinionated about the API consumer’s collection handling. It doesn’t even technically require the parameter to be a stored collection at all, as we’ll soon see.

IEnumerable<T> for LINQ

You’re probably aware that theSystem.Linq namespace includes a ton of extension methods for IEnumerable<T> that allow you to filter, project, aggregate, sort, and otherwise manipulate the enumerated results without changing the original enumerable. Sometimes the behavior of these methods is referred to as “deferred execution,” or “lazy,” because the operation isn’t actually performed until you enumerate the enumerable.

Terms like “deferred” or “lazy” sometimes mean that code won’t be executed until you need it, but then once it’s been executed, the result will be persisted in some fashion to optimize subsequent access. For example, a singleton may not be constructed until the first time it’s needed, but then it may remain available for the lifetime of the application without being constructed again. This is a key principle for setting up dependency injection in ASP.NET Core. However, it isn’t an accurate description of how the LINQ extension methods work.

When you use a LINQ extension method like IEnumerable<T>.Select(), the operation is performed on the items at the time they are enumerated. But, actually, it’s performed on the items every time they are enumerated. This can be proven very easily:

IEnumerable<string> stuffs = ["foo", "bar", "baz"];

IEnumerable<string> uppercaseStuffs = stuffs.Select(s =>
{
    Console.WriteLine("Stop shouting!");
    return s.ToUpper();
});

uppercaseStuffs.ToList();
uppercaseStuffs.ToList();

Here, we’re using IEnumerable<T>.Select() (a projection method) to uppercase our strings, but we’ve also snuck in a side effect in the form of a console log (this isn’t production code, I’m just illustrating a point here). When we run this code, that console log prints not three times, but six times: three for the first call of IEnumerable<T>.ToList() (which enumerates the enumerable), and three more for the second call (which enumerates it again). So, the results of that projection aren’t actually being stored anywhere. Every time you enumerate, the projection will be performed again on each item that you enumerate.

If the operation is trivial performance-wise, this might not matter to you. But imagine that your LINQ lambda (or named method you’re passing in) is doing something expensive. Or imagine that you have a very large number of items with less-expensive but still non-trivial logic, and your code is enumerating over and over again as a result of multiple foreach loops, IEnumerable<T>.FirstOrDefault() calls, and/or other operations. In such situations, we should consider first enumerating into a concrete type like List<T> so that the items aren’t being recalculated every time we iterate over the collection. Those types generally implement IEnumerable<T>, so it’s still possible to work with the results using that interface; it just changes the underlying structure to a stored collection instead of on-demand calculations for each item.

IEnumerable<T> for Methods

The LINQ extension methods aren’t the only way to enumerate results in a “lazy” fashion; we can also write our own methods that do the same. If you’ve read enough C# code, you may have come across methods that assemble an IEnumerable<T> to return using logic similar to this:

public IEnumerable<string> GetThingNames(
    IEnumerable<string> foodThings, 
    IEnumerable<string> widgetThings,
    IEnumerable<string> thingamabobThings)
{
    List<string> thingNames = [];
    foreach (string thingName in foodThings)
    {
        things.Add($"{thingName} Food Thing");
    }
    foreach (string thingName in widgetThings)
    {
        things.Add($"{thingName} Widget Thing");
    }
    foreach (string thingName in thingamabobThings)
    {
        things.Add($"{thingName} Thingamabob Thing");
    }
    return thingNames;
}

Of course, there are ways to refactor this code using List<T>.AddRange(), IEnumerable<T>.Select(), and/or the .. spread operator. But they’ll all work essentially the same way, iterating over the three inputs, computing new results, adding those results to a List<string>, and then returning that list.

For some use-cases, this is fine. But consider a scenario where the main use-case for this method is to find, say, whether any item in the resulting set is more than 50 characters long. That doesn’t necessarily require enumerating all of the items; once you find one that meets the condition, you can stop enumerating. Imagine the set is very large, or the logic required for each item is more than the simple string interpolation in the example above. Enumerating the entire set of items isn’t a very efficient approach to this problem if the first or second item already meets your criteria. In fact, methods like IEnumerable<T>.Any() are optimized specifically to avoid that behavior when possible, so canceling that out by pre-enumerating the entire set isn’t ideal.

IEnumerable<T> provides a way to write a method that will enumerate results individually just like those LINQ extension methods. In this case, it also happens to avoid the allocation for a List<string> that we don’t really need. The key is the yield return statement:

public IEnumerable<string> GetThingNames(
    IEnumerable<string> foodThings, 
    IEnumerable<string> widgetThings,
    IEnumerable<string> thingamabobThings)
{
    foreach (string thingName in foodThings)
    {
        yield return $"{thingName} Food Thing";
    }
    foreach (string thingName in widgetThings)
    {
        yield return $"{thingName} Widget Thing";
    }
    foreach (string thingName in thingamabobThings)
    {
        yield return $"{thingName} Thingamabob Thing";
    }
}

This turns our method into an iterator method. It’s important to understand that yield return effectively pauses the execution of the method (in a procedural sense, not a threading sense), handing control back to the caller. It does not run the entire block of code to calculate all the items and then return them as a stored collection like the original method did. This means that if the caller is iterating over the enumerable returned from this method, and stops iterating for whatever reason, such as a break statement or an optimized LINQ extension method like IEnumerable<T>.Any() that doesn’t continue enumerating once it finds what it’s looking for, this method will not proceed to calculate the rest of the items.

This is obviously a highly simplified example, but it shouldn’t be too hard to imagine more complex scenarios where this could substantially improve performance versus the approach of eagerly calculating all possible items for an enumerable. Of course, as I suggested above, in some cases the eager approach may be better; the concepts in the previous section still apply, and which approach is better will vary based on how you intend to use the results of the method.

The “lazy” approach favors situations where being able to enumerate only some leading portion of the items will save effort. The “eager” approach favors situations where you want to iterate over the items multiple times. You can also use hybrid approaches, such as writing an iterator method to allow for an initial, efficient partial enumeration, but then calling a method like IEnumerable<T>.ToList() on those partially enumerated results to store them for further logic that will involve multiple passes over the collection.

There are also scenarios where the performance impact is negligible either way, but using yield return is simpler and cleaner than allocating a List<T> and calling List<T>.Add() a bunch of times.

One final note for this section: yield return also works in property get accessor bodies when the property type is IEnumerable<T>.

IAsyncEnumerable<T> for Async Methods

To be honest, when I first encountered IAsyncEnumerable<T> popping up in Microsoft SDKs, it annoyed me. I used the ToListAsync method from System.Linq.Async a lot because I didn’t want to deal with this funky new type. But I was only thinking of it as an abstraction of a collection; I wasn’t yet thinking of it in terms of the other concepts we’ve covered. When you realize that IAsyncEnumerable<T> effectively lets you do all of those same things but with async operations involved, it suddenly becomes awesomely powerful, particularly for batch processing with async operations at both ends.

Let’s revisit the example from before, but this time assume that we need to call an async method to get our final strings, perhaps with some dependency call under the hood:

public async IAsyncEnumerable<string> GetThingNamesAsync(
    IEnumerable<string> foodThings, 
    IEnumerable<string> widgetThings,
    IEnumerable<string> thingamabobThings)
{
    foreach (string thingName in foodThings)
    {
        yield return await GetAsync(thingName, "Food");
    }
    foreach (string thingName in widgetThings)
    {
        yield return await GetAsync(thingName, "Widget");
    }
    foreach (string thingName in thingamabobThings)
    {
        yield return await GetAsync(thingName, "Thingamabob");
    }
}

(Note: You don’t necessarily have to yield return and await on the same line, we’re just doing so here for simplicity.)

Even without changing our assumptions about the use-cases for the method, it should now be clear how much we don’t want to process every single item if we don’t actually need them all. All sorts of expensive operations could be inside those GetAsync calls. If we were just using Task<IEnumerable<string>> as the return type for this method, we wouldn’t be able to use yield return, and we’d always have to build each and every possible value, no matter what.

But let’s consider a different use-case. Let’s say that the results of this method need to be fed into another dependency call, a remote API that requires an HTTP POST for each item we’re adding. With the Task<IEnumerable<string>> approach, we’d be building all of the items, accumulating them in memory, and then sending them all to the remote API either sequentially or in parallel. Depending on the specifics, that might be fine, or it might create a performance bottleneck: maybe making the remote API wait to receive the first item until after you’ve spent 30 seconds to build the rest of the items isn’t desirable for this application. It might prevent partial successes that you’d prefer to accept: for example, if the 24th GetAsync call throws an exception, none of the preceding 23 items will go to the remote API without additional code to handle the error. If the items could be really large and/or numerous, it might create an unnecessarily high memory requirement for the application.

But with the IAsyncEnumerable return type, we can call the method and send its results to our remote API like this:

public async Task ProcessThingsAsync()
{
    await foreach (string thingName in GetThingNamesAsync(/* ... */))
    {
        await PostAsync(thingName);
    }
}

This method will process each “thing” end-to-end in sequence, instead of having to do all the “get” operations in one batch followed by all the “post” operations in another batch. With this approach, you can create stacks of methods that stream data sequentially as fast as it can be processed, always keeping it moving, without having to stop and store it for each iteration along the way.

Do note that, once again, the same “deferred execution” rules apply. If you’re going to be enumerating multiple times, there’s a good chance you don’t want to be making all those async calls every time you do. You probably want to use some combination of stored collections, async enumerables, and regular enumerables to avoid repeating expensive operations.

Parallelized Enumeration

You may already have noticed that, in some situations, the last solution above won’t scale well. When working with a very large set of items, you might not want to load the whole thing into memory at once, but processing items only one at a time might be agonizingly slow. This is where one of my favorite modern .NET APIs comes in, introduced a few years ago in .NET 6: Parallel.ForEachAsync().

This API makes it easy to parallelize an iteration over an enumerable. If you already have your set of thing names for your remote API as a regular IEnumerable<T>, you can handle them a few at a time:

public async Task ProcessThingsAsync()
{
    IEnumerable<string> thingNames = GetThingNames(/* ... */);
    await Parallel.ForEachAsync(thingNames,
        new ParallelOptions() { MaxDegreeOfParallelism = 4 },
        async (thingName, cancellationToken) =>
        {
            await PostAsync(thingName);
        });
}

The API also has an overload that accepts IAsyncEnumerable<T> instead of IEnumerable<T>, and that’s where it gets even more useful. That overload will drain multiple items from the async enumerable at a time, up to the degree you specify. It effectively allows the async enumerable to proceed (to a point) with enumerating more items, even if some previously yielded items are still being processed by the caller. Of course, it works this way with regular enumerables too, but the advantage may be less noticeable unless the synchronous operations for each item are really expensive and have variable performance; it really shines with async enumerables.

Parallel.ForEachAsync() is not a “batch” or “chunk” API. It doesn’t split your set into groups of length N = MaxDegreeOfParallelism. It behaves more like a “throttling” API. If you allow a maximum of four operations at a time, and one of the first four operations takes way longer than the others, a chunking API would wait until that slowest operation has finished before starting the next chunk of four. Parallel.ForEachAsync() will start new operations to replace the three that have finished, even if the fourth is still in progress. It doesn’t “pause” processing new items every time one hangs for a few seconds the way a chunking API would.

Conclusion

IEnumerable<T> is a very widely used .NET API that is much more powerful than its most simple use-cases reveal. IAsyncEnumerable<T> is rising in popularity and provides a way to leverage the principles of enumerables with async operations, a valuable proposition in an ecosystem increasingly filled with async code. Built-in APIs like LINQ extension methods and Parallel.ForEachAsync(), along with libraries like System.Linq.Async, offer tools that will let you optimize how you use enumerables, provided you understand the rules behind how they work. A key factor in finding the right approach is determining whether your use-case’s performance and logic will benefit more from optimizing for partial iterations (lazy) or optimizing for multiple iterations (eager).