A closer look at \Doctrine\ORM\Query::toIterable when processing large results

Victor Todoran
5 min readApr 8, 2023

--

Photo by Jonathan Farber on Unsplash

Let’s say you want to iterate a large database result set in a PHP application using Doctrine.

There are many ways to solve this issue, some more complex than others. Queues and consumers immediately come to mind.

That being said, in the wild, sometimes you can’t or don’t have the time to comfortably implement the best solution.

The quickest solution (from the development perspective) is to iterate over the whole result, in the same process.

If the result set is big, say a couple million, and the table has a large number of columns, say over fifty, loading everything into memory at once is not really an option.

If we turn to our favorite search engine and write the words doctrine batch iterate you might stumble across this page entitled Batch Processing.

If you scroll towards the end of the page, at the time that I’m writing this, you would see this:

In hindsight, looking at this page now, there are some important red flags. which I’ll share towards the end of this article. But when I experienced this situation in the wild this section is all I focused or cared about so this section is what I’m showing you for now.

Let us investigate a little further what results are we getting when we call \Doctrine\ORM\Query::toIterable on a one million row result set.

We are going to use the Symfony Profiler to profile and compare our code and a tiny Symfony app that has:

  • A Contract Entity with ten string properties.
  • A Fixture that populates 1 million Contracts in the database.
  • A simple Controller with the logic presented in the docs

Let’s take a look at the Controller:


class TestController extends AbstractController
{
public function __construct(
private readonly ContractRepository $contractRepository,
private readonly EntityManagerInterface $entityManager
) {
}

#[Route('/', name: 'test')]
public function execute(): Response
{
set_time_limit(40);

$contractQuery = $this->contractRepository->createQueryBuilder('c')->getQuery();
foreach ($contractQuery->toIterable() as $contract) {
$this->entityManager->clear();
}

return $this->render('base.html.twig');
}
}

After executing this, the Profiler is showing us this:

The whole thing took about 31 seconds, 1073 MB peak memory usage and one single query was made to the Database which took about 14.5 seconds.

The fact that the whole one million Contracts are brought withing one single query is important and it was not entirely obvious to me from the beginning.

One day I might update this article with closer look at how those 1073 MB of memory are actually used but for now let’s say that is has to do with the whole data set being loaded into memory at once and with how Doctrine keeps some stuff into memory which is only released after \Doctrine\ORM\Query::toIterable() finishes executing.

The only batching done here is the actual hydration of the Contract Entity which we do in batches of 1 since we call EntityManagerInterface::clear after each iteration.

I want to actually batch the trips to the database and hopefully use less memory while doing this.

I could not find anything in the docs already available for this so I decided to implement my own.

There are multiple ways of extracting a subset of rows from a large data set, one is offset/limit which becomes very wasteful in the last pieces of a large data set. This is decently explained here.

Another option is leveraging a primary key and limit. Like with many other cool things, there are some limitations, but they don’t impact my use case.

private function iterateContracts(int $batchSize): \Generator
{
$leftBoundary = 0;
$queryBuilder = $this->contractRepository->createQueryBuilder('c');

do {
$qb = clone $queryBuilder;
$qb->andWhere('c.id > :leftBoundary')
->setParameter('leftBoundary', $leftBoundary)
->orderBy('c.id', 'ASC')
->setMaxResults($batchSize)
;

$lastReturnedContract = null;
foreach ($qb->getQuery()->toIterable() as $lastReturnedContract) {
yield $lastReturnedContract;
}

if ($lastReturnedContract) {
$leftBoundary = $lastReturnedContract->getId();
}


} while (null !== $lastReturnedContract);
}

What we do in this method is a lot less complicated than it looks.

We select chunks of Contracts, starting from left to right, by leveraging the ordered property of the primary key.

Initially the left boundary is 0 and subsequently the left boundary becomes the id the of the last returned Contract in a batch.

With each batch we load the Contracts with an id greater than the left boundary with a limit equal to the batch size.

Assuming that no Contracts were ever deleted the batches and subsequent left boundaries (in red) might look something like this:

Visual representation of possible batches and the left boundaries across the process.

I called this method using a 10k batch size:

public function execute(): Response
{
set_time_limit(40);

foreach ($this->iterateContracts(10000) as $contract) {
$this->entityManager->clear();
}

return $this->render('base.html.twig');
}

Now, let’s see how it does in terms of performance:

We can see it went a little slower, 32.5 seconds, and we can see that 101 actual queries were made to the database (one cold beer for whoever knows why the 101 query was necessary given we have 1 mil records and used a 10k batch size).

We can see that we have an undeniable(>10x) peak memory usage improvement.

Clearly this method has some limitations, an obvious one is the order, using a different order won’t work.

Another limitation is scalability. It’s a healthy sized function so copy pasting in won’t cut it, extracting it in a reusable method is a possibility but I’m having trouble seeing how you could make it type safe.

The name of the primary key could be a parameter but what about getId()? How do you constrain that what ever is coming out of toIterable() has the method getId() or otherwise how do you know what method to call instead?

Before I end this article I promised some gems from the docs:

That is all. Thanks for reading :)

Disclaimer: Consider this to be a living document, which means it’s subject to undocumented changes and it might even die in the future.

--

--

Victor Todoran

I write, read and think about software and its users for a living