Iterating over billions of objects in Doctrine

Having millions or billions of objects to iterate over in PHP using Doctrine? How to do this with a limited amount of memory. Without any hacks or internal ORM knowledge.

.com software
3 min readDec 29, 2021
Photo by Eran Menashri on Unsplash

Did you ever traverse millions or billions of objects using Doctrine? It’s not as if you could simply ->findAll() because it’s going to try to load the entire set to the memory, which will fail spectacularly as soon as the memory limit is hit.

How would you approach tackling such an issue? Increase memory limit? Please don’t do that. Embrace pagination & generators, the features of now getting old PHP 7.

A signature of an imaginary user repository could look like as follows:

<?php

interface UserRepository
{
/**
* @return iterable<array-key, User>
*/
public function findAll(): iterable;
}

With a solid implementation in PHP 7.4:

<?php

declare(strict_types=1);

final class DoctrineUserRepository implements UserRepository
{
private EntityManagerInterface $doctrine;

public function __construct(EntityManagerInterface $doctrine)
{
$this->doctrine = $doctrine;
}

/** {@inheritDoc} */
public function findAll(): iterable
{
$r = $this->doctrine->getRepository(User::class);

// how many objects at the time to load to memory
$batchCount = 100;

// current offset to navigate over the entire set
$offset = 0;

do {
/** @var User[] $users */
$users = $r->findBy([], null, $batchCount, $offset);

foreach ($users as $user) {
yield $user;
}

// increase the offset
$offset += $batchCount;

// important thing to remember
// release objects from the ORM's internal memory
$this->doctrine->clear(User::class);
} while (\count($users) > 0);
}
}

The usage, eg. in a console command, would be dead simple:

<?php

public function execute(InputInterface $input, OutputInterface $output): int
{
foreach ($this->users->findAll() as $user) {
// iterate over every user
// act on behalf of the user
$user->foo();
}

return 0;
}

Problem solved! We can apply the same pattern to the objects inside Doctrine Collections using the “lazy” mapping and the “slice” method to only fetch a subset of records from the underlying storage. Let’s optimize the code further to be re-usable across the code-base. We could convert the code to a trait:

<?php

trait DoctrineEntityIteratingTrait
{
/**
* @template T of object
*
* @param class-string<T> $className Entity class to fetch
* @param positive-int $batchCount Batch cound to load at once
*
* @return iterable<array-key, T>
*/
private function iterateOverEntities(
EntityManagerInterface $doctrine,
string $className,
int $batchCount = 100
): iterable {
$r = $doctrine->getRepository($className);

// current offset to navigate over the entire set
$offset = 0;

do {
/** @var T[] $entities */
$entities = $r->findBy([], null, $batchCount, $offset);

foreach ($entities as $entity) {
yield $entity;
}

// increase the offset
$offset += $batchCount;

// important thing to remember
// release objects from the ORM's internal memory
$doctrine->clear($className);

} while (\count($entities) > 0);
}
}

The improved repository would look as follows:

<?php

final class DoctrineUserRepository implements UserRepository
{
use DoctrineEntityIteratingTrait;

private EntityManagerInterface $doctrine;

public function __construct(EntityManagerInterface $doctrine)
{
$this->doctrine = $doctrine;
}

/** {@inheritDoc} */
public function findAll(): iterable
{
return $this->iterateOverEntities($this->doctrine, User::class);
}
}

With this approach, one can modify the implementation details of the repository as you wish. As long as the contract of the implemented interface does not change, your application won’t break as well.

--

--

.com software

Father • PHP developer • entrepreneur • working for a €1bn unicorn startup as a backend engineer >>> https://bit.ly/dotcom-software