Doctrine Criterias and Lazy Collections magic

Scary words for an easy and powerful feature that most developers never use.

Nicolas Valverde
The SensioLabs Tech Blog
11 min readJul 2, 2024

--

A man with an orange sweat shirt riding a blue elephant with books on the floor and a barrel with the name “ORM” written on it. A few speech bubbles with emojis are flying on top of the man and the elephant

Doctrine has many powerful features, but let’s see just one that can reveal the full power of an ORM.

TL;DR: Because our Collections are loaded lazily, Doctrine is able to use our Criterias to filter a Collection on the database side, by effectively tweaking the query to filter the resulting Collection.

Working code examples used below are available on my GitHub here.

Whether this TL;DR already rings a bell for you or not, there’s a high chance that you’ve never used the said feature in the way shown below, but if you already know what are Lazy Collections and Doctrine Criterias, you can jump directly to the second part.

NB: This article is made with Doctrine ORM V3. Everything showcased here works with V2 as well, but some signatures differ, refer to Doctrine documentation in such case.

1. Required definitions

— Doctrine Lazy Collections

Lazy Collections means you can retrieve relationships of your entities lazily. That ultimately means Doctrine will issue the actual database query to retrieve a Collection content only when you first access the Collection in your code.

This is something you have already done in the past, because this is the default behavior of Doctrine, which makes it super powerful out of the box.

Basically with Lazy Collections, the relationship will not be loaded using a JOIN in the first SELECT, but instead delayed to another SELECT happening later only when you really need it, if that ever happens. So at the end, both requests are expected to be lightweight — and most importantly fast — unless you effectively have no other choice than loading entire big Collections in one shot, which should be quite rare.

This is the behavior you can configure with the fetch: keyword when you define the entity relationship. Acceptable values are lazy , extra-lazy and eager . Be aware that the below will not work for a relation set to fetch: eager because this tells Doctrine to load the entire collection right away. But I said the default is fetch: lazy .

— Doctrine Criterias

Criterias are special Doctrine objects that holds a set of Doctrine Expressions. Doctrine Expressions are objects that are usually created through the ExpressionBuilder, which is something you have probably already used too, through the QueryBuilder itself.

Criterias can be used in various places, which makes them a sweet choice to have a reusable piece of query.

The first place where you can use them is the QueryBuilder, with the method addCriteria() . This is one good point for choosing the QueryBuilder over DQL.

Actually, any Doctrine method that accepts an array $criterias usually accepts them as well, like the well known findBy() method.

This is already pretty cool, but there’s another hidden and super powerful one, which is our main topic for this article.

2. See Collection::matching()

Alright, suspense over, we can use our shiny Criterias directly on a Collection. You might have the feeling that this is loading the Collection and filtering afterwards on the PHP side, which is something we definitely want to avoid, but it is actually the opposite happening.

Because our Collections are loaded lazily, Doctrine is able to use our Criterias to filter a Collection on the database side, by effectively tweaking the query to filter the resulting Collection.

3. Some code worth more than a thousand words

Let’s consider those 2 simple entities, abbreviated with relevant code only, for the sake of clarity. The full working example is available on my GitHub here.

<?php

#[ORM\Entity()]
class Book
{
#[ORM\ManyToMany(targetEntity: Author::class, inversedBy: 'books')]
private Collection $authors;

#[ORM\Column(nullable: true)]
private ?float $rating = null;

public function __construct()
{
$this->authors = new ArrayCollection()
}

/**
* @return Collection<Author>
*/
public function getAuthors(): Collection
{
return $this->authors;
}

// Other accessors omitted
}
#[ORM\Entity()]
class Author
{
#[ORM\ManyToMany(targetEntity: Book::class, mappedBy: 'authors')]
private Collection $books;

#[ORM\Column(unique: true)]
private ?string $name = null;

public function __construct()
{
$this->books = new ArrayCollection()
}

/**
* @return Collection<Book>
*/
public function getBooks(): Collection
{
return $this->books;
}

public function getName(): ?string
{
return $this->name;
}

public function setName(string $name): self
{
$this->name = $name;

return $this;
}

// Other accessors omitted
}

Now let’s say we want to be able to retrieve the best-sellers book for a given author.

— A straight forward way

<?php

// Real repository signature and constructor omitted for clarity
class BookRepository
{
public function findBestSellers(Author $author, int $limit = 3): array
{
// We could use DQL here instead, that would be the same.
// But that would prevent us from using the Criteria afterward.
return $this->createQueryBuilder('b')
->select()
->join('b.author', 'a', 'WITH', 'a.name = :author')
->where('b.rating <> null')
->orderBy('b.rating', 'DESC')
->setMaxResults($limit)
->setParameter('author', $author->getName())
->getQuery()
->getResult()
;
}
}

This works well, but has quite a few drawbacks.

  • The method signature requires an Author, that means the client code using this method must already be using an Author object. While we could accept just a string for the Author name in this simple example, this helps me to highlight that we have already loaded an Author from the database at this point. For example, imagine a page dedicated to each Author, this page must load the Author from database to show all its information, but only needs to show the best-sellers books.
  • Future developers need to be aware this method exists and lives in the BookRepository. Otherwise, because our Author has a getBooks() method, they could be tempted to just get the Collection from that, and manipulate the Collection like an array on PHP side, which would be bad. An integrator working with the Author object in the view could be tempted to filter it there as well, with the template engine of your choice, like Twig.
  • Our entities are anemic. The business logic lives in the repository and only there. The condition used to determine what makes a book a best-seller cannot be reused and is never reflected in our domain model.
  • The best-sellers list has to be retrieved separately and passed down to the view, in addition to the already loaded Author. So each new feature added later on this Author page will require a new method in the repository, with a new variable to pass down to the view. Or, in real world, you could also end up with a complex query to gather all the data in different sub-selects and aliases to pass down a big array to the view, not really better.
  • This would have the same drawbacks if we were to write that in the AuthorRepository instead of BookRepository. That would just change the way the query is written.
  • We are using only half of our ORM capabilities. It creates a situation where the ORM can quickly become a drawback, issuing too many sub-optimal queries at the end, or one big — hopefully optimized — one. Eventually hitting our performance.

— The full ORM way

Now, what do you think about a method on our Author, let’s say Author::getBestSellers() ? That sounds sweet and straight forward, doesn’t it? It fixes all the above points, even removing the need to pass down additional variables to the view, and advertises our business domain right into the model.

The real question is, how to implement such method in our entity. We don’t have access to Dependency Injection in there. And just in case, let me say it again, we don’t want to load the entire Collection to filter it afterwards with PHP like a simple array. That could hurt performance to retrieve only 3 books.

This is where our Criteria will finally shine. Let’s see that.

We need to first create this Criteria, let’s do it in our BookRepository.

<?php

class BookRepository
{
public const DEFAULT_BEST_SELLERS_LIMIT = 3;

// This lives in the BookRepository because this Criteria applies to a Collection of Book.
// Since everything is static here, it could live anywhere,
// but it is ultimately a piece of query, so that's a reasonable choice
public static function createBestSellersCriteria(int $limit = self::DEFAULT_BEST_SELLERS_LIMIT): Criteria
{
return Criteria::create()
->andWhere(Criteria::expr()->neq('rating', null))
->orderBy(['rating' => Order::Descending])
->setMaxResults($limit)
// We could have more business conditions, here would be a good place to enforce them,
// because it is highly likely that those conditions will only live here forever since it is now reusable.
// E.g we could have a max allowed limit for what we consider best-sellers (do we still talk about best sellers with a limit of 100?)
// Or check that the book was published at least some days/weeks ago so that we can trust the rating score
// etc...
;
}

// We can refactor our repository method to use our Criteria,
// so that it can still be used for the case when we don't have an Author loaded,
// note we now accept a string **only**,
// hinting client code that this method is useful when working with a scalar instead of already loaded Author.
// E.g This argument could come from a user input when no entities are loaded yet.
public function findBestSellers(string $author, int $limit = self::DEFAULT_BEST_SELLERS_LIMIT): array
{
return $this->createQueryBuilder('b')
->select()
->join('b.author', 'a', 'WITH', 'a.name = :author')
->addCriteria(self::createBestSellersCriteria($limit))
->setParameter('author', $author)
->getQuery()
->getResult()
;
}
}

Next we can add our getBestSellers() method right into the Author entity, re-using the Criteria from the above BookRepository .

#[ORM\Entity()]
class Author
{
// Previous code omitted

/**
* @return Collection<Book>
*/
public function getBestSellers(int $limit = BookRepository::DEFAULT_BEST_SELLERS_LIMIT): Collection
{
return $this->books
->matching(BookRepository::createBestSellersCriteria($limit))
;
}
}

Now with this, we can just call our Author::getBestSellers() , to retrieve a Lazy Collection filtered on the database side. So if we already have an Author loaded from database, we don’t need anymore to inject an additional repository in our client code to retrieve the best-sellers books efficiently.

We can also reuse our Criteria in another query or another Collection in the future, or use it directly from the client code, if that suits a use case.

One can argue that this implementation comes with additional coupling. But the thing is, Doctrine entities are already coupled to their repositories, and coupled to their relations, so at the end, it’s not really a new issue.

And there’s a bonus in the view. I said we don’t need to pass an additional variable anymore, let’s see the last piece of magic made available by our simple implementation.

{# author.html.twig #}

{% for book in author.bestSellers %}
...
{% endfor %}

Thanks to PropertyAccess, this will effectively use Author::getBestSellers(), and so ultimately load the Lazy Collection with another SELECT filtered by our additional conditions. I am pretty sure you can’t do anything simpler than that.

Depending on the relation type, this second SELECT statement will either be a straight one, or include a simple JOIN ON PRIMARY KEY to the pivot table. This is super efficient for most cases. We can’t really see a notable performance improvement with our simple example because our entities are super tiny, and we don’t have a big dataset to stress out the database.

But one sure thing is, it is not worst than querying twice the same table in our use case, so it definitely worth using it where that makes sense, according to your own use cases and model.

Because this just relies on Doctrine ORM behavior on Collections,
this is not the same as explicitly issuing a DB query from the view, which would be very bad.

Here the view is not aware that the Collection is filtered on the database side, it is only aware of the Author object and its public API, nicely reflecting our business domain and intention on the road.

In fact, if we later turn our relationship to fetch: eager for some reason, this would not filter from the database side anymore, but the view knows nothing about that, and everything would still perfectly work.

And last but not least, remember this retrieves another Collection, so if you instead turn this relation to fetch: extra-lazy , you can use all the additional methods allowed by Doctrine extra-laziness without triggering the actual database content loading. Refer to Doctrine ORM documentation for extra-lazy association behavior as this is out of scope for this article.

4. Queries under the hood

— Querying the repository with a string $author (e.g. a search form, no entity loaded)

Doing only one query can be the best solution in a lot of situations, but this can’t be done if we have already loaded our entity. Also when this query and/or the model goes bigger, chances are that you will have a hard time optimizing it, it could require adding several indexes.

And we’re not fully using our ORM here (which sometimes is a valid choice, please don’t get me wrong).

Note this uses a INNER JOIN, while I did not ask for that.

— Querying the repository while an Author is already loaded

On this one we see the sub-optimal querying taking place, effectively querying twice the author table. As I said, timing metrics can’t really be interpreted reliably with this simple example. Plus I am using SQLite here. So don’t try to interpret those numbers without using a real use case.

Note this also use a INNER JOIN, while I still did not ask for that.

— Querying through the Lazy Collection with our Criteria

Now we’re talking! See how we are now issuing only a JOIN to the pivot table on PRIMARY KEY, completely dropping the additional JOIN back to the author table. With a big model and/or dataset, maybe missing some indexes, this should be the most simple and efficient queries taking place here.

Note this now finally uses a LEFT join, which is definitely what I expect. Because the JOIN can now be done only on PRIMARY KEY to the pivot table, Doctrine can safely turn the INNER JOIN to a LEFT one for better performance. In all previous cases, it would have need me to explicitly ask for a LEFT JOIN for Doctrine to stop using the INNER one, because it can’t safely assume that this fit my model needs, even if it does. ORM is smart.

5. Conclusion

Doctrine is powerful, and the ORM behavior is not always intuitive if you don’t keep in mind how things work under the hood. But when you do, everything makes sense at the end, and you would probably have a hard time achieving better performance if you choose to not use it, instead of using its full power, additionally leaving opened the possibility for other developers to miss your super optimized manual implementation, ruining all your efforts at the end.

Hope this helps to solve some performance issues, or at the very least to get a better understanding of Doctrine laziness. Cheers!

--

--