Iterators, Generators, and Uncertainty

Published in

Python Supply

11 min readFeb 18, 2020

Suppose you are working on a Python API that provides access to a real-time data stream (perhaps from an array of sensors or from a web service that handles user requests). You would like to deliver to the consumers of your API a simple but flexible abstraction that allows them to operate on new items from the stream when they choose to do so. Furthermore, you would like the API to allow users to do the following three things:

specify fall-back or default data streams (e.g., if their first choice of stream is exhausted);
interleave items coming from multiple streams (presenting them as a single, new stream); and
process the items from a stream in parallel using multiprocessing.

What abstraction should you use? How much of it must be custom-built and how much can be done using native Python features? When working with data streams, state spaces, and other abstractions that represent large or unbounded structures, it can be tempting to custom-build solutions that may become increasingly complex and difficult to maintain. Understanding a range of features that are already available in a language or its built-in libraries can help mitigate this while saving significant time and effort (both your own and that of others who build upon your work).

Iterators and generators are powerful tools in the Python language that have compelling applications in a number of contexts. This article reviews how they are defined, how they can be used, how they are related to one another, and how they can help you work in an elegant and flexible way with data structures and data streams of an unknown or infinite size.

Iterables, Iterators, and Generators

When discussing Python, the terms iterable, iterator, and generator often appear in similar contexts or are even used interchangeably. These language features also solve similar problems. This can lead to some confusion, but there is a reason that on occasion these are conflated.

One way to understand the term iterator is that it refers to any Python data structure that has an interface that supports iteration over objects one at a time. A data structure is iterable if there is at least one way to construct an iterator that traverses it in some way. On the other hand, a generator is a particular kind of data structure, defined in a specific way within Python, that maintains an internal state and constructs or retrieves zero or more objects or values one at a time.

Thus, a generator by virtue of its characteristics can have an interface that allows it to qualify as an iterator, which consequently also makes it an iterable. In fact, all generators are iterators and iterable. However, not all iterators or iterable data structures are generators because there exist other approaches for implementing a Python object possessing the kind of interface an iterator or iterable is expected to have.

Iterators

If you want to implement an iterator data structure directly, you need to include a special method __next__ in the class definition, which will be invoked whenever the built-in function next is applied to an instance of that data structure. The skips data structure below can emit every other positive integer via its definition of a __next__ method.

class skips:
    def __init__(self):
        self.integer = 0

    def __next__(self):
        self.integer += 2
        return self.integer

Now it is possible to use the built-in next function to retrieve each item one at a time from an instance of the skips data structure.

>>> ns = skips()
>>> [next(ns), next(ns), next(ns)]
[2, 4, 6]

The number of items over which the data structure will iterate can be limited by raising the StopIteration exception when more items can not (or should not) be returned.

class skips:
    def __init__(self, start, end):
        self.integer = start-2
        self.end = end

    def __next__(self):
        self.integer += 2
        if self.integer > self.end:
            raise StopIteration
        return self.integer

It is then the responsibility of any code that uses an instance of this iterator to catch this exception and handle it appropriately. It is worth acknowledging that this is a somewhat unusual use of a language feature normally associated with catching errors (because an iterator being exhausted is not always an error condition).

>>> ns = skips(0, 10)
>>> while True:
...     try:
...         print(next(ns))
...     except StopIteration:
...         break
...
0
2
4
6
8
10

Iterables

In Python, there is a distinction between an iterator and an iterable data structure. This distinction is useful to maintain for a variety of reasons, including the ones below.

You may not want to clutter a data structure (as it may represent a spreadsheet, a database table, a large graph, and so on) with the state necessary to keep track of an iteration process.
You may want the data structure to support multiple iterators, either semantically (e.g., iteration over rows versus over columns) or in terms of implementation (e.g., breadth-first search versus depth-first search).
You may want to make it easy to reset iteration without fiddling with the internal state of a data structure instance (i.e., resetting a traversal of the data structure instance could involve simply creating a fresh iterator).

As an example, consider a data structure interval that represents all positive integers in some range. Users might be allowed to obtain two different kinds of iterators for an instance of this data structure: those that iterate over only the even integers and those that iterate over only the odd integers.

class interval:
    def __init__(self, lower, upper):
        self.lower = lower
        self.upper = upper

    def evens(self):
        return skips(
            self.lower + (0 if (self.lower % 2) == 0 else 1),
            self.upper
        )
    
    def odds(self):
        return skips(
            self.lower + (0 if (self.lower % 2) == 1 else 1),
            self.upper
        )

The example below illustrates how an iterator returned by one of the methods in the definition of interval can be used.

>>> ns = interval(0, 10).odds()
>>> while True: # Keep iterating and printing until exhaustion.
...     try:
...         print(next(ns))
...     except StopIteration:
...         break
...
1
3
5
7
9

So far in this article, the distinction between iterators and iterable data structures has been explicit for clarity. However, the convention that is supported (and sometimes expected) throughout Python is that an iterable data structure has a single iterator that can be used to iterate over it. This iterator is returned by a special method __iter__ that is included in the class definition. In the example below, the interval class supports the creation of an iterator that visits every integer in the interval.

class every:
    def __init__(self, start, end):
        self.integer = start - 1
        self.end = end

    def __next__(self):
        self.integer += 1
        if self.integer > self.end:
            raise StopIteration
        return self.integer

class interval:
    def __init__(self, lower, upper):
        self.lower = lower
        self.upper = upper

    def __iter__(self):
        return every(self.lower, self.upper)

Python’s built-in iter function can be used to invoke __iter__ for an instance of this data structure.

>>> ns = iter(interval(1, 3))
>>> while True: # Keep iterating and printing until exhaustion.
...     try:
...         print(next(ns))
...     except StopIteration:
...         break
...
1
2
3

Including a definition for an __iter__ method also makes it possible to use many of Python's built-in functions and language constructs that expect an iterable data structure. This includes functions such as list and set, which use iter to obtain an iterator for their inputs.

>>> list(interval(0, 10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> set(interval(0, 10))
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

This also includes comprehensions and for loops.

>>> for n in interval(1, 4):
>>>     print([k for k in interval(1, n)])
[1]
[1, 2]
[1, 2, 3]
[1, 2, 3, 4]

There is nothing stopping you from making the iterator itself an iterable by having it return itself, as in the variant below.

class every:
    def __init__(self, start, end):
        self.integer = start - 1
        self.end = end

    def __next__(self):
        self.integer += 1
        if self.integer > self.end:
            raise StopIteration
        return self.integer

    def __iter__(self):
        return self

This approach ensures that there is no ambiguity (from a programmer’s perspective) about what will happen when built-in functions such as list are applied to an instance of the data structure.

>>> list(every(0, 10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

This practice is common and is the cause of some of the confusion and conflation that occurs between iterators and iterables. In addition to the potential for confusion, users of such a data structure must be careful to use the iterator as an iterable only once (or, alternatively, the object must reset its internal state every time __iter__ is invoked).

>>> ns = every(0, 10)
>>> list(ns), list(ns) # Only returns contents the first time.
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [])

Nevertheless, this can also be a useful practice. Going back to the example with evens and odds, ensuring the iterators returned by these methods are also iterable means they can be fed directly into contexts that expect an iterable.

class skips:
    def __init__(self, start, end):
        self.integer = start - 2
        self.end = end

    def __next__(self):
        self.integer += 2
        if self.integer > self.end:
            raise StopIteration
        return self.integer

    def __iter__(self):
        return self

class interval:
    def __init__(self, lower, upper):
        self.lower = lower
        self.upper = upper

    def evens(self):
        return skips(
            self.lower + (0 if (self.lower % 2) == 0 else 1),
            self.upper
        )
    
    def odds(self):
        return skips(
            self.lower + (0 if (self.lower % 2) == 1 else 1),
            self.upper
        )

The example below illustrates how this kind of interface can be used.

>>> i = interval(0, 10)
>>> list(i.evens()), set(i.odds())
([0, 2, 4, 6, 8, 10], {1, 3, 5, 7, 9})

Generators

Generators are data structures defined using either the yield statement or comprehension notation (also known as a generator expression). The example below defines a generator skips using both approaches.

def skips(start, end):
    integer = start
    while integer <= end:
        yield integer
        integer += 2

def skips(start, end):
    return (
        integer
        for integer in range(start, end)
        if (integer - start) % 2 == 0
    )

When it is evaluated, a generator returns an iterator (more precisely called a generator iterator). For example, as with any iterator, next can be applied directly to instances of this data structure.

>>> ns = skips(0, 10)
>>> next(ns), next(ns), next(ns)
(0, 2, 4)

As with any iterator, exhaustion can be detected by catching the StopIteration exception.

>>> ns = skips(0, 2)
>>> try:
...     next(ns), next(ns), next(ns)
... except StopIteration:
...     print("Exhausted generator iterator.")
...
Exhausted generator iterator.

Finally, an instance of the data structure can be used in any context that expects an iterable.

>>> list(skips(0, 10))
[0, 2, 4, 6, 8]

It is possible to confirm that the result of evaluating skips is indeed a generator by checking its type.

>>> import types
>>> isinstance(skips(0, 10), types.GeneratorType)
True

It is also possible to inspect its type to confirm that skips indeed evaluates to an iterator.

>>> import collections
>>> isinstance(skips(0, 10), collections.abc.Iterator)
True

Data Structures of Infinite or Unknown Size

Among the use cases that demonstrate how iterators/generators serve as a powerful language feature are scenarios involving data structures whose size is unknown or unbounded/infinite (such as streams, very large files, databases, and so on). You have already seen that you can define an iterable that can produce new objects or values indefinitely, so iterables are an effective way to represent and encapsulate such structures.

Returning to the example described at the beginning of the article, recall that you are faced with creating a Python API for working with data streams that might (or might not) run out of items that can be drawn from them. The advantages of leveraging iterables and generators should be evident at this point, so suppose you move ahead with this option and implement an iterable to represent a data stream. How can you address the three specific requirements (i.e., default/fall-back streams, interleaving, and splitting for parallelism) using these features?

To satisfy the first requirement, you must allow a user to exhaust one iterable and then switch to another one. This is straightforward to do by constructing a generator that concatenates two iterables.

def concatenate(xs, ys):
    for x in xs:
        yield x
    for y in ys:
        yield y

Concatenating two instances of an iterable data structure is now straightforward.

>>> list(concatenate(skips(0,5), skips(6,11)))
[0, 2, 4, 6, 8, 10]

Notice that if the first iterable is never exhausted, the second one will never be used.

To address the second requirement, first consider a simpler scenario. What if you would like to “line up” or “pair up” entries in two or more iterables? You can use the built-in zip function.

>>> list(zip(skips(0,5), skips(6,11)))
[(0, 6), (2, 8), (4, 10)]

Notice that the result of evaluating zip is indeed an iterator.

>>> import collections
>>> isinstance(
...     zip(skips(0,5), skips(6,11)),
...     collections.abc.Iterator
... )
...
True

Combining zip with comprehension syntax, you can now define a generator that interleaves two iterables (switching back and forth between emitting an item from one and then the other).

def interleave(xs, ys):
    return (
        z 
        for (x, y) in zip(xs, ys) 
        for z in (x, y)
    )

As with concatenation, interleaving is now concise and straightforward.

>>> list(interleave(skips(0,5), skips(6,11)))
[0, 6, 2, 8, 4, 10]

Finally, how can you help users process items from a stream in parallel? Because you are already using iterables, users have some options available to them from the built-in itertools library.

One option is islice, which behaves in a similar manner to Python slice notation (such as xs[0:10] to extract the first ten entries from a list xs). Users can use this function to extract items in batches and (1) pass each item in a batch to its own separate thread or (2) pass batches of items to separate threads. A basic batching method is presented below.

def batch(xs, size):
    from itertools import islice
    ys = list(islice(xs, 0, size))
    while len(ys) > 0:
        yield ys
        ys = list(islice(xs, 0, size))

Notice that this method inherits the graceful behavior of slice notation when the boundaries of the slices do not line up exactly with the number entries in the data structure instance.

>>> list(batch(skips(0,21), 3))
[[0, 2, 4], [6, 8, 10], [12, 14, 16], [18, 20]]

Can you define a generator that returns batches of batches (e.g., at most n batches each of size at most k)?

Another option is to use the tee function, which can duplicate a single iterable into multiple iterables. However, this function is really only simulating this effect by storing a large amount of auxiliary information from one of the iterables. Thus, it may use a significant amount of memory and is not safe to use with multiprocessing. It is best suited for situations in which the iterables are known to have a small number of items, as in the example below.

>>> from itertools import tee
>>> (a, b) = tee(skips(0,11), 2)
>>> (list(a), list(b))
([0, 2, 4, 6, 8, 10], [0, 2, 4, 6, 8, 10])

The example above is arguably implemented in a more clear and familiar way by simply wrapping the iterables using list.

>>> ns = list(skips(0,11))
>>> (ns, ns)
([0, 2, 4, 6, 8, 10], [0, 2, 4, 6, 8, 10])