What is “yield” or “generator” in Python?
A lot of new python learners have a hard time wrapping their brain around PEP 380. I am usually asked:
- What does the “yield” keyword do?
- What are the situations where “yield from” is useful?
- What is the classic use case?
- Why is it compared to micro-threads?
To understand what
yielddoes, you must understand what generators are. And before you can understand generators, you must understand iterables.
An iterable is an object that has an
__iter__ method which returns an iterator, or which defines a
__getitem__ method that can take sequential indexes starting from zero (and raises an
IndexError when the indexes are no longer valid). So an iterable is an object that you can get an iterator from.
- anything that can be looped over (i.e. you can loop over a string or file) or
- anything that can appear on the right-side of a for-loop:
for x in iterable: ...or
- anything you can call with
iter()that will return an ITERATOR:
>>> mylist = [1, 2, 3]
>>> # mylist = [x*x for x in range(3)] # or a list comprehension
>>> for i in mylist:
Generators are iterators, a kind of iterable you can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly:
>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
It is just the same except you used
() instead of
. BUT, you cannot perform
for i in mygenerator a second time since generators can only be used once: they calculate 0, then forget about it and calculate 1, and end calculating 4, one by one.
yield is a keyword that is used like
return, except the function will return a generator.
>>> def func(): # a function with yield
... yield 'I am' # is still a function
... yield 'a generator!'
>>> gen = func()
>>> type(gen) # but it returns a generator
>>> hasattr(gen, '__iter__') # that's an iterable
>>> hasattr(gen, 'next') # and with .next (.__next__ in Python 3)
True # implements the iterator protocol.
Here it’s a useless example, but it’s handy when you know your function will return a huge set of values that you will only need to read once.
yield, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky :-)
The first time the
for calls the generator object created from your function, it will run the code in your function from the beginning until it hits
yield, then it'll return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting
yield. That can be because the loop has come to an end, or because you no longer satisfy an
['I am', 'a generator!']
You’ll have to make another if you want to use its functionality again:
['I am', 'a generator!']
- A function with
yield, when called, returns a Generator.
- Generators are iterators because they implement the iterator protocol, so you can iterate over them.
yieldis only legal inside of a function definition, and the inclusion of
yieldin a function definition makes it return a generator.
yield provides an easy way of implementing the iterator protocol, defined by the following two methods:
next (Python 2) or
__next__ (Python 3). Both of those methods make an object an iterator that you could type-check with the
Iterator Abstract Base Class from the
yield keyword is reduced to two simple facts:
- If the compiler detects the
yieldkeyword anywhere inside a function, that function no longer returns via the
returnstatement. Instead, it immediately returns a lazy "pending list" object called a generator
- A generator is iterable. What is an iterable? It’s anything like a
rangeor dict-view, with a built-in protocol for visiting each element in a certain order.
In a nutshell: a generator is a lazy, incrementally-pending list, and
yieldstatements allow you to use function notation to program the list values the generator should incrementally spit out.
yield forms an expression that allows data to be sent into the generator.
def bank_account(deposited, interest_rate):
calculated_interest = interest_rate * deposited
received = yield calculated_interest
deposited += received
my_account = bank_account(1000, .05)first_year_interest = next(my_account)
next_year_interest = my_account.send(first_year_interest + 1000)
Cooperative Delegation to Sub-Coroutine
under_management = yield # must receive deposited value
additional_investment = yield expected_rate * under_management
under_management += additional_investment
def investment_account(deposited, manager):
very simple model of an investment account
that delegates to a manager
next(manager) # must queue up manager
yield from manager
return manager.close()my_manager = money_manager(.06)
my_account = investment_account(1000, my_manager)
first_year_return = next(my_account)
next_year_return = my_account.send(first_year_return + 1000)
The concept that
yield from generator is equivalent to
for value in generator: yield value does not even begin to do justice to what
yield from is all about. Because, let's face it, if all
yield from does is expand the
for loop, then it does not warrant adding
yield from to the language and preclude a whole bunch of new features from being implemented in Python 2.x.
yield from does is it establishes a transparent bidirectional connection between the caller and the sub-generator:
- The connection is “transparent” in the sense that it will propagate everything correctly too, not just the elements being generated (e.g. exceptions are propagated).
- The connection is “bidirectional” in the sense that data can be both sent from and to a generator.
Why is it compared to micro-threads?
I think what this section in the PEP is talking about is that every generator does have its own isolated execution context. Together with the fact that execution is switched between the generator-iterator and the caller using
__next__(), respectively, this is similar to threads, where the operating system switches the executing thread from time to time, along with the execution context (stack, registers, ...).
The effect of this is also comparable: Both the generator-iterator and the caller progress in their execution state at the same time, their executions are interleaved. For example, if the generator does some kind of computation and the caller prints out the results, you’ll see the results as soon as they’re available. This is a form of concurrency.
That analogy isn’t anything specific to
yield from, though - it's rather a general property of generators in Python.
A Bug in Python “yield”:
Note: this was a bug in the CPython’s handling of
yieldin comprehensions and generator expressions, fixed in Python 3.8, with a deprecation warning in Python 3.7. See the Python bug report and the What's New entries for Python 3.7 and Python 3.8.
>>> [(yield i) for i in range(3)]
<generator object <listcomp> at 0x0245C148>
>>> list([(yield i) for i in range(3)])
[0, 1, 2]
>>> list((yield i) for i in range(3))
[0, None, 1, None, 2, None]
It seems odd:
- that a list comprehension returns a generator and not a list
- and that the generator expression converted to a list and the corresponding list comprehension contain different values.
Generator expressions, and set and dict comprehensions are compiled to (generator) function objects. In Python 3, list comprehensions get the same treatment; they are all, in essence, a new nested scope.
You can see this if you try to disassemble a generator expression:
>>> dis.dis(compile("(i for i in range(3))", '', 'exec'))
1 0 LOAD_CONST 0 (<code object <genexpr> at 0x10f7530c0, file "", line 1>)
3 LOAD_CONST 1 ('<genexpr>')
6 MAKE_FUNCTION 0
9 LOAD_NAME 0 (range)
12 LOAD_CONST 2 (3)
15 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
19 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
23 LOAD_CONST 3 (None)
>>> dis.dis(compile("(i for i in range(3))", '', 'exec').co_consts)
1 0 LOAD_FAST 0 (.0)
>> 3 FOR_ITER 11 (to 17)
6 STORE_FAST 1 (i)
9 LOAD_FAST 1 (i)
14 JUMP_ABSOLUTE 3
>> 17 LOAD_CONST 0 (None)
The above shows that a generator expression is compiled to a code object, loaded as a function (
MAKE_FUNCTION creates the function object from the code object). The
.co_consts reference lets us see the code object generated for the expression, and it uses
YIELD_VALUE just like a generator function would.
As such, the
yield expression works in that context, as the compiler sees these as functions-in-disguise.
This is a bug;
yield has no place in these expressions. The Python grammar before Python 3.7 allows it (which is why the code is compilable), but the
yield expression specification shows that using
yield here should not actually work:
The yield expression is only used when defining a generator function and thus can only be used in the body of a function definition.
This has been confirmed to be a bug in issue 10544. The resolution of the bug is that using
yield from will raise a
SyntaxError in Python 3.8; in Python 3.7 it raises a
DeprecationWarning to ensure code stops using this construct. You'll see the same warning in Python 2.7.15 and up if you use the
-3 command line switch enabling Python 3 compatibility warnings.
Lazy Method for Reading Big File in Python
To write a lazy function, just use
def read_in_chunks(file_object, chunk_size=1024):
Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k.
data = file_object.read(chunk_size)
if not data:
with open('really_big_file.dat') as f:
for piece in read_in_chunks(f):
What can you use a Generator function for?
Generators give you lazy evaluation. You use them by iterating over them, either explicitly with ‘for’ or implicitly by passing it to any function or construct that iterates.
Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don’t know if you are going to need all results, or where you don’t want to allocate the memory for all results at the same time.
Another use for generators (that is really the same) is to replace callbacks with iteration. In some situations you want a function to do a lot of work and occasionally report back to the caller. Traditionally you’d use a callback function for this. You pass this callback to the work-function and it would periodically call this callback. The generator approach is that the work-function (now a generator) knows nothing about the callback, and merely yields whenever it wants to report something. The caller, instead of writing a separate callback and passing that to the work-function, does all the reporting work in a little ‘for’ loop around the generator.
For example, say you wrote a ‘filesystem search’ program. You could perform the search in its entirety, collect the results and then display them one at a time. All of the results would have to be collected before you showed the first, and all of the results would be in memory at the same time. Or you could display the results while you find them, which would be more memory efficient and much friendlier towards the user. The latter could be done by passing the result-printing function to the filesystem-search function, or it could be done by just making the search function a generator and iterating over the result.
If you want to see an example of the latter two approaches, see os.path.walk() (the old filesystem-walking function with callback) and os.walk() (the new filesystem-walking generator.) Of course, if you really wanted to collect all results in a list, the generator approach is trivial to convert to the big-list approach:
big_list = list(the_generator)
When a Generator function should not be used
Use a list instead of a generator when:
You need to access the data multiple times (i.e. cache the results instead of recomputing them):
for i in outer: # used once, okay to be a generator
for j in inner: # used multiple times, reuse a list
You need random access (or any access other than forward sequential order):
for i in reversed(data): ... # generators aren't reversibles[i], s[j] = s[j], s[i] # generators aren't indexable
You need to join strings (which requires two passes over the data):
s = ''.join(data) # lists are faster than generators in this use case
You are using PyPy which sometimes can’t optimize generator code as much as it can with normal function calls and list manipulations.