Python generators

A guide for the data practitioner

Seydou Dia
6 min readApr 27, 2018

The goal of this post is to demystify Python's generator. I will also provide several examples of how it can be leveraged to solve real world data engineering challenge. Over my career I've found generator to be both very useful and powerful abstraction when dealing with stream of data; in this post I will share my experience and hopefully it can be useful to you as well!

let’s generate some music — sce unsplash

tl;dr

  • Use generator when you deal with stream of data — possibly infinite stream. On the other hand if you need to iterate over and over on a set of values leverage the list.
  • Know the difference between generator expression and generator function
  • Howto write a generator function is here
  • Howto pipe many streams is here
  • Howto iterate over all the records of all files in all folders for a given path is here
  • Howto dump a database to S3 using generators is here
  • Howto consume a Kinesis stream is here

Generator or list?

Let's look at the following code snippet:

The difference between the 2 expressions seems insignificant: one uses brackets and the other one uses parenthesis. However the difference is major: the first one is a list comprehension and the second one is a generator expression and these are 2 distincts object in Python.

Let's start with the bracket version; At evaluation time a list comprehension is simply a list; it is available in memory and any list operations can be perform on it.

On the other hand a generator expression evaluates to a generator and that is a different animal than a list. The only common point between a generator and a list is that both of them can be iterated over. The major difference is that generator expression has a lazy behaviour; meaning elements of a generator expression are not available in memory, they are computed when requested: on an on-demand basis.

To put it another way, let's first review the type of operations available on a Python list; then let's make the comparison with generator objects. I group the operations that can be performed on a list in 3 categories: (a) index operations, (b) mutation operations and (c) iteration operation. We will see that for a generator a only the iteration operation is available.

Index operations on list

The main operations are accessing list elements by index, or slicing the list. Note that the len method is put in this category because the computation of the length of a list is equivalent to returning its maximum index. So these are mainly read-only operations perform on the list object, e.g.

Mutation operations on list

These are operations that change a list. Note that we are talking about in place modification as opposed to making a new copy of the list.

Iteration on list

Iterating over the element of a list is probably the most common operation in Python. The main point to note here is that a list can be iterated over and over again, e.g.

So…, what is a generator?

We have just seen 3 types of operation allowed on a list and it is important to note that 2 of them (indexing and mutation) are possible only because list elements are available in memory. Since a generator is lazy — meaning its elements are computed on-demandindexing and mutation cannot be performed. Only iteration can be performed on a generator object, and even then iteration can be performed only once.

So, you might ask what is the point of using a generator over a list, if the only operation allowed is iteration?

The answer is the usage depends of the use-case. I like to think about generator as an abstraction for stream. I use generators every time that I deal with stream of data — possibly infinite stream. If you think about it it makes sense, because element in stream does not need to be indexed and it does not make sense to mutate a stream… unless you mean on the fly operations, i.e. on an on-demand basis. In conclusion of this section:

Use generator when you deal with stream of data — possibly infinite. On the other hand if you need to iterate over and over on a set of values leverage the list.

In the following sections we will demonstrate some use-case of generators.

Generators examples

E0. Introducing Yield

You are already familiar with generator expression:

Here is another way to write it using generator function:

Note the use of the keyword yield in gen_func; the yield keyword is what makes a function a generator function. Yield statement is similar to return statement in the sense that both return a computed value.

The difference is that after a return a function will exit and that's it. On the other hand, after a yield the function does not exit, instead it looks for the next yield and wait there for someone to ask for it; in other words the function yields computed value on an on-demand basis. And the method to ask for a value to a generator function is by iterating over it.

E1. An infinite stream of numbers

Here is a simple example of an infinite numbers generator:

Now let's pipe our infinite stream into different streams each one performing a specific operations, e.g filtering, maths, etc. For the purpose of the demonstration we've added a break condition in the previous while loop; otherwise the script below will run indefinitely… remember, it's an infinite number generator!

Note how easy it is to modify our initial stream; note also that all operations performed aren't evaluated until the last for loop, when we actually fetch elements from the final stream.

There is several way to write the script above, it is a matter of taste. Let's rewrite it using generator function only:

Note how each generator builds upon the previous one. As always no computation is performed until the last for loop when the elements are fetched.

E2. Iterate over all files in a folder hierarchy

Using our knowledge from previous example let's build a script that iterates over all the rows in all the files in all the directories in a given folder hierarchy. We are only interested in given file format — let's say json gzipped — and records format. This is a typical data engineering problem that can be solved using generator, let's see how!

Hopefully nothing new here; Note in gen_records function, how Python context manager with is playing nicely with the with keyword; setup and cleaning-up are so nicely handled! Finally note that thanks to the lazy evaluation behaviour of generator the previous script is very memory efficient. Also not knowing beforehand the total number of files and their size is not a big deal from a memory perspective.

E3. Dump a database to S3

In this section, using what we learn so far, we will implement a program to dump the content of database table to S3 as batch of json gzipped file.

E4. Consume a Kinesis stream

In this example we show how to consume a real world stream: an AWS Kinesis stream. One of the challenge here is that the reading position in the stream needs to be persisted — otherwise you end up reading the entire stream. Also, unless your goal is to design a long running program that consumes the stream, you need a way to interrupt your script with a stop condition —we leave it as an exercise for the reader :-), it's not as easy as it looks.

Okay… this ends up being a bit more complex than I thought, but hopefully you can see the power of generator in action!

That's all folks!

Thanks for reading this, in a post I will show more applications of generators, e.g. how to combine generator and multi-threading/processing to make parallel/ concurrent program. Maybe if I am in a good mood I will show how a generator can help in writing a parser and compilers in general. Take care!

bye — sce unsplash

--

--