Python Generators: Memory-efficient programming tool

The idea behind writing this article was to provide a comprehensive understanding of the basics of generators which might take days to build (if studied from numerous online resources).

What are Generators?

Generators are memory-efficient ways of processing huge datasets. They process the data incrementally and do not allocate memory to all the results at the same time. They really come in handy when implementing data science pipelines for huge datasets in a resource-constrained environment (in terms of RAM). Thus, generators become an effective tool to improve the scalability of a program and make it more responsive to user requests.

Why would we need Generators?

Before we lay our hands on generator functions it’s best to start by looking at iterators. Iterators are programming objects amenable to looping. In more pythonic language they are python objects that follow the Iteration Protocol. The Object class must consist of two methods: __iter__() and __next__(), in order to be classified as an iterator.

Iterators are suitable for the class of programming called Lazy Implementation. Does that ring a bell for the Scala programmers? Yes, the iterators are effective programming entities for dealing with big datasets. Lazy Implementation simply means that these entities are allocated resources of our computer only when they are told to perform the action, not when they are instantiated. So, when you define an iterator (like how we have done in our tutorial either by using a class definition or a generator function definition) and then create an object of this class (called an instance), no computation, as defined by any set of internal statements/methods, happens. This means that no resource allocation happens with regards to memory or computation till we explicitly call the computation function as defined in the class or, in case of a generator, till we call the generator function in a for loop. In python, there are very few tools that could be used to carry out Lazy Implementation.

Unfortunately, when we are defining custom-made iterators, we are forced to write a complex code as shown below. In the example below, we have tried to create a simple class called Counter that acts as a countdown machine.

Implementing iterators using classes

Once we have defined this class, we have to instantiate it and iterate through it using a for loop. (Doesn’t it look horribly cumbersome)

Creating Counter class instance and looping it

Now, we would try to do the same thing using a generator function. A generator function is simply a function that could iteratively generate values and instead of returning them it simply would yield them. A function with a yield statement is a generator function.

Implementing iterators using generators

Once again, we create an instance of this generator and iterate through it using a for loop. (This looks so much better)

Creating Counter generator instance and looping it

So, we have seen how easy it is to create iterators using generators. In short, Generators are easy and efficient ways of creating custom-made iterators.

How to implement simple generator functions?

When we are interested in implementing simple tasks using generators, we could have one-line definitions similar to list comprehensions. The only difference being that these one-liners are enclosed by round brackets as opposed to square brackets found in list comprehensions.

In the example below, we are creating an iterator function that generates squares of even numbers from 0 to 8.

Simple generator implementation, similar to list comprehension

Many people argue that instead of simply using these simple generator function definitions shown above we could use the filter() method, available in python 3, but this could be done only for applications in which we want to choose a particular set of rows/numbers from a given table or a given range of numbers respectively. So, the filter() method can be used with conditional statements only whose outputs are either True or False. For.eg choosing squares of even numbers within a given range of whole numbers, we could use function available in python 3. In the following code example, we are displaying the squares of even numbers in the range [0,100). Filter statements also create generators.

Filter statements create generators

We’ve seen how generators can be created, but what can we do with them?

The most important method/statement available with generators is next. This statement is used to manually iterate as opposed to using a for loop. In the following example we iterate through the same generator function, once using next statement and then using a for loop. The simple generator function triples the numbers given a range.

next() statement in action

The other important statement is send(). This method simply resumes the generator meaning that it wakes up the generator to resume work on a new value and sends this value, which could be an integer/float or a list or a string that would be used to process the next yield step. For.eg. we could use the send() method to pass an integer to the generator function so that it displays the square of this integer. When we pass a None value or nothing, send method simply acts like the next method.

A very important thing to keep in mind is that before we could use this method to pass values for the next yield function, we must either use the next statement or send statement with None, since at first when the generator function is created, it has nothing to yield.

In the following example we demonstrate a generator function that simply prints the value that is passed to it using this method. But when we pass a list or a string containing multiple words, the string gets split into its constituent words or the list gets split by the elements. This method could be used to create a dictionary of words found in a document or text file. (useful in NLP applications)

send() statement in action

Additionally, this statement could be used to yield values indefinitely without raising a StopIteration exception. A StopIteration exception is usually caused when all the values that we passed to our generator function, have been yielded and the generator does have values to yield further. For implementing this functionality, we must include the yield method within an infinite loop else it will raise StopIteration, after the function has nothing to yield further. Another observation is that we need to assign yield to a variable and then use it to perform the tasks that we are interested in performing. In the following example, we effectively demonstrate how the arguments passed to the generator directly, are handled in a different way than value passed using the send() method.

Differentiating arguments passed through generator function and send() statement

We also have the throw() and close() methods that are associated with generators. The throw() method is used to generate a predefined exception. But the loop could be taken forward to the end from the point the throw method was called.

throw() statement in action

The close() method is simply used to end the generator prematurely as demonstrated by the following example.

close() statement in action

Are there efficient ways of iterating through a group of generators?

The answer is yes. The yield from statement allows us to integrate a sequence of generators and efficiently implement a single generator that encompasses all these other set of generators. This could be very useful if we want to execute several tasks on different parts of a dataset using a single universal generator function. In the following example we have generator1 and generator2 integrated into generator3. generator1 function removes ‘.’ character from a sentence and generator2 function converts all letters to lowercases.

yield from statements help in combining generators

Now, for the exciting part. We define a context that notifies the time of execution of a predefined set of statements using yield method.

In this example, we are timing our context labelled ‘counting’ that counts down from a particular number (in this case 10). At the end, we print the total time taken for execution of the same.

Time monitoring application

This is a very useful feature that could be used to time different modules in our program, to find out the more inefficient portions of our code.

In summary, generators make our lives easy by providing us easier, faster and more efficient ways of creating the lazy implementors, the iterators.

--

--

Ramya Balasubramaniam
Learning better ways of interpretting and using data

I am starting a new chapter in this beautiful journey from control engineering to data science.