Using Python generators to write readable parsers — part I

Text parsing is a repeating task that is often hard to do properly. Common implementations frequently suffer from errors or lack certain features, such as lazy evaluation.

In traditional programming languages, such as Java or C, one has to choose between two approaches:

  • The most straightforward option is to write code structured normally, reading the full input and produce a complete input.
  • If one needs the ability to “pause” the parser, one has to resort to writing a state machine. This forces the code to be structured in a complete different way and the resulting is much less readable.

Fortunately, in languages like Python, there is a middle way. One does not have to sacrifice readability to achieve laziness and performance. In this article we will show how to do that.

Case study

Let’s start with a trivial parser, so we can focus on the syntax and architecture, rather than functionality.

Suppose we are reading an book and we want to process individual sentences. For the sake of this example, let’s define a sentence as a string of any characters (except fullstop), trailed by a fullstop.

It is trivial to write a parser doing just that:

def sentence_parser1(article):
sentences = []
while 1:
index = article.find(".")
if index >= 0:
sentences.append(article[:index + 1])
article = article[index + 1:]
else:
return sentences

for sentence in sentence_parser1("It works. Python is the best. But"):
print(sentence)

# Prints:
# It works.
# Python is the best.

Our parser tries to find the first fullstop in the article it is given as an input. If it finds one, it uses its index to cut out the first sentence and append it to a list of outputs. If the are no more fullstops in the article, our find method returns a negative index, which is a signal for us to stop and return results.

The test code proves that our parser works. It prints two sentences, each on its own line. Note that there was an unfinished sentence in the input, which we skipped. That’s correct, we are only interrested in complete, valid data.

And apart from that, Python is the best, no buts.

The code that we have written works, is well readable, but has some drawbacks:

  • The whole article must be loaded in memory before the parsing starts.
  • The whole article must be parsed before the first sentence can be printed.

Python provides means that can help us with both problems, but let’s focus on the latter first.

Introducing generators

In addition to normal functions, Python provides us with so-called generator functions. On the first sight, they are the same. But generator functions have the ability to pause their execution and later resume using a yield keyword.

The Fibonacci sequence is a classing textbook example of a generator (I have added loggingto show when the code is executed):

def fibonacci():
print("Entering fibonacci")
a = 1
b = 1
print("We have the first result")
yield a
print("More than one result requested")
while 1:
print("We have another result")
yield b
print("More results requested")
c = a+b
a = b
b = c
print("We are about to call the fibonacci")
fib = fibonacci()
print("First three")
print(next(fib))
print(next(fib))
print(next(fib))
print("Other three")
print(next(fib))
print(next(fib))
print(next(fib))

In the Fibonacci generator function, we can see an infinite cycle. But our program does not hang when we call the function. Actually, when we call the function, not much happens at all. We are only given a generator object, on which we can request values.

The actual generated values are loaded using the global function next(), which takes generator as its argument. Let’s run the example to observe when various code is executed (logs from the main script are in bold):

We are about to call the fibonacci
First three
Entering fibonacci
We have the first result
1
More than one result requested
We have another result
1
More results requested
We have another result
2
Other three

More results requested
We have another result
3
More results requested
We have another result
5
More results requested
We have another result
8

We can see that the generator is actually first run only after the first next() has been called. Each time a yield is encountered in the generator function, the value is passed as a result to the next() and the generator is “frozen”. It means that all local variables are stored in memory as well as yield instruction where it stopped. Whenever next() is called again, the generator resumes from where it left off as if nothing happened.

Things to remember:

  • Calling the generator function does not execute our code.
  • Our code is first executed when next() is called, and it runs until yield is reached.
  • Argument of the yield becomes return value to next()
  • Repeated next() will resume our code until the next yield
  • If no more next()’s are called, our generator, including its frozen data, is eventually garbage collected.

Finite and infinite generators

The Fibonacci sequence is infinite and so our fibonacci generator could theoretically run indefinitely and still yield more and more results. We could try to iterate over them using a familiar Python for loop:

for i in fibonacci():   # Warning: May run for a long time
print(i)

On the other hand, there can be generators that just yield some values and then stop:

def abc():
yield "a"
yield "b"
yield "c"
for x in abc():
print(x)

There are no surprises here, the code prints letters from the beginning of the alphabet and ends peacefully.

How can we determine if a generator is finite or infinite? No, looking at its code does not count.

There is no way how to determine it, but what we can do is to try fetch the next value:

gen = abc()
next(gen) # "a"
next(gen) # "b"
next(gen) # "c"
next(gen) # raises StopIteration

next()does not tell us how many values are left, but we will be notified if we are past the end. In that case, StopIteration exception is raised.

How come that we did not have to deal with an exception in the for loop before?

Python does that for us. At the beginning of each loop of the for, just before our variable is assigned its new value, next() is called. If it encountersStopIteration, it simply does so.

Simple generators in parsing

Now that we have defined what Python generators can do, how can we leverage that in our parser?

Our code does not change much:

def sentence_parser2(article):
while 1:
index = article.find(".")
if index >= 0:
yield article[:index + 1]
article = article[index + 1:]
else:
return

for
sentence in sentence_parser2("It works. Python is the best. But"):
print(sentence)
# Prints:
# It works.
# Python is the best.

We have only got rid of the list that held the sentences. Where we would have appended to the list, we now yield. And because there is nothing left to return, we just leave the return keyword to break the loop, without a value.

What happens if we did return an actual value, you may ask? If would be available on the StopIteration for us to process.

The best part is, that the surrounding code has not changed at all. We can still use the good old for loop, as if we were iterating over a list.

The difference is that now there is no list of sentences present in memory at any time. Each sentence is printed as soon as it is parsed.

But still, the whole article has to be loaded upfront.

This problem will be solved in a minute.

Generators that receive data

In the last example, we have successfully mitigated the problem of having to store parser results to memory before we could process them further. But we still have to load the whole input to memory before we can start parsing it.

This may be a serious problem. Suppose we are not parsing a short article, but a whole book.

Commonly, the data to parse are coming in chunks. The chunks can have a fixed or variable size (this does not matter to the parser), but they are definitely not aligned with our outputs.

Think of the chunks as lines in a book. Some lines may contain exactly one sentence and nothing else, although that would be rather rare. More frequently, we see lines than finish a sentence from the previous line, start another one for the next line and may have more complete sentences in between. Some writers even like to write their sentences so long that they span over multiple lines!

Up to this point, our parsers consumed only the data they were given upon creation. From now on, we need to to invent a way how to continuously keep passing more and more data.

Fortunately, Python offers a mechanism doing exactly that. Our yields can actually not only return, but also accept data!

Let’s see how this new discovery affects our parsing generator and then we’ll work out the details.

def sentence_parser3():
data = yield None
while
1:
index = data.find(".")
if index >= 0:
result = data[:index + 1]
data = data[index + 1:]
else:
result = None

moredata = yield result

if moredata:
data = data + moredata

First off, our third generation parser generator does not accept any parameters. We do not need to have any data upfront.

How does it get the first data? He asks for them using yield None. (This would be exactly the same as yield alone, but I like to be explicit here.) This is the parser’s way of saying “I have None results for you, feed me more data.”.

The yield is actually an expression, whose result we can store to a variable. Actually, all the yields in recent versions of Python are expressions, even the ones that we have seen so far in this article. The previous ones, however, only ever evaluated to None, which did not matter because their value was not used.

Then again, we search for the fullstop in the input and if we find it, we split the data to the resulting sentence and the rest of data. If no fullstop is found, no worries. It may come in the next chunk, so we just say that so far we don’t have a result.

Now comes the cornerstone of our design. This yield is responsible for at least half of the magic that is happening here.

There may or may not be a result. If there is, we yield it. If there is not, we yield None, which we have already explained as “Need more data” signal.

No matter of the result, our caller may or may not have more data available. If they do, they may send them to us and we append them to our internal buffer of unprocessed data. If they do not, they may just send None. That’s their way of saying “Give me more results.”.

Before we go further towards the caller code, let’s demonstrate the communication protocol we have come up with so far on an example:

  1. The caller sends “Python is the best. Period. But ”
  2. The generator finds a fullstop and yields the first sentence “Python is the best.”
  3. The caller expects there may be more, threrefore sends None. (“Give me more results” signal)
  4. There indeed is another fullstop, so the generator yields “ Period.”
  5. The caller tries again to send None.
  6. There is more data, but no fullstop, so the generator yields None. (“Give me more data.”)
  7. The caller knows that if they were to keep requesting more data, they would only get more Nones from the generator. Therefore, they need to gather more data to send. Eventually, they come up with another chunk “Python with ”.
  8. The generator’s internal buffer now contains “ But Python with ”, which sounds promising, but is not a complete sentence. Therefore, None is yielded.
  9. The caller just got a second None in a row, meaning that they need to dig up even more data somewhere. Finally, they send “generators is even better.” to the generator.
  10. That, combined with the data stored internally, is a complete sentence, so the generator happily yields “ But Python with generators is even better.”
  11. The caller receives the message, but for what they know, there may be more. So they send None to request more results.
  12. The generator’s buffer is now empty, so it just reponds with None.
  13. The caller is satisfied with the message and releases the generator for garbage collection.

Feeding generators with data

So far we have learned how yields can return data, but where do these data come from?

There is an alternative to next(), called send(data). send(data) is a newer addition to the language and is more powerful than next(). More precisely, next(gen) is exactly the same as writing gen.send(None). That is also the reason why all our yields so far had a value of None.

The API is somewhat inconsistent at this point, because next() is a global function that takes generator as an argument, send(data) is method of the generator. Therefore, we write next(gen), but gen.send(123).

The send() method does, in essence, four things:

  1. It resumes the generator until next next yield.
  2. The generator is then paused.
  3. The argument to the yield will become return value of the send().
  4. The data passed as an argument to send() will become the value of the yield after the generator is resumed next time.

Although this makes complete sense, it may be difficult to grasp at first: The data passed to send() will only be used next time someone calls next() or send().

There is another limitation to send(): When the generator is first created, only send(None) or next() can be called on it. It is not possible to pass any meaningful value. It all makes sense when we think about it. When the generator is created, no code is run at first. It is waiting for the first next(). But it is not paused on a yield, it’s on the beginning of the function. So if we passed some data, where would it go?

Caller routine

Now we have all the pieces needed to interact with our parser. But first, let’s review our part of the contract:

  1. The caller send() the next data chunk to the parser.
  2. As long as the parser sends results, the caller keeps requesting them by sending None.
  3. After the parser responds withNone, the caller stops (and may come back later after it gathers more data chunks).

It turns out this logic can elegantly be implemented using… tadaaa… another generator!

def send_and_receive(generator, data):
result = generator.send(data)
while result is not None:
yield result
result = generator.send(None)

Putting it all together

Let’s again review the final code we have come up:

def sentence_parser3():
data = yield None
while
1:
index = data.find(".")
if index >= 0:
result = data[:index + 1]
data = data[index + 1:]
else:
result = None

moredata = yield result

if moredata:
data = data + moredata

def send_and_receive(generator, data):
result = generator.send(data)
while result is not None:
yield result
result = generator.send(None)

And the scenario from the previous chapter:

gen = sentence_parser3()
next(gen) # Needed to start the parser up
for res in send_and_receive(gen, "Python is the best. Period. But"):
print("After 1st chunk: " + res)
for res in send_and_receive(gen, " Python with"):
print("After 2nd chunk: " + res)
for res in send_and_receive(gen, " generators is even better."):
print("After 3rd chunk: " + res)
### Prints:
# After 1st chunk: Python is the best.
# After 1st chunk: Period.
# After 3rd chunk: But Python with generators is even better.

We can see that it works as expected. The first chunk of data contains two complete sentences, the second one has none and the third one finishes the long final sentence.

Conclusion

In this article we have explored how to move from readable, but eagerly evaluated parsers, which needed both all input and all output data to be in memory, to lazily evaluated ones. We have managed to do thatusing Python generators without sacrificing readability and without making fundamental changes to the code.

The sentence parser was used to demonstrate the patterns and to illustrate that the code really does not change much. We don’t expect it to be used in practice very often.

However, the function presented in the last chapter, send_and_receive(), can be considered production-ready and will be used very much next time.

In this part of the series we have concentrated on the concepts of the proposed architecture. In the next part we will present how to build complex parsers out of basic parts, propose a set of building blocks, and finally we will write a parser that actually does something useful.