Multi-table filters in SQLAlchemy

Bartosz Leper
ShopRunner
Published in
6 min readNov 3, 2017

Note: This post was originally published on November 3, 2017 on Spring Engineering. It’s now re-published here after ShopRunner acquired Spring, Inc.

For my team at Spring, one of the benefits of moving from Go to Python was being able to use a mature, field-tested ORM library such as SQLAlchemy. When the system you’re putting together grows over fifty tables, you learn to appreciate this powerful tool.

The thing is, even with the most powerful tools, you’re never allowed to forget about the basics. You always need to keep in mind how your ORM engine of choice works under the hood and how the database interprets your intentions.

The mysterious case of Jane Doe

I opened my inbox one day to learn about an innocent-looking bug in our internal order management system. It affected the order search tool used by our customer service team. In some rare cases, the system just didn’t return all the data. The example query was very simple: “give me all orders of customer Jane Doe”.

The customer service team knew that Jane Doe placed at least two orders in our system. However, on that particular day, only one of them showed up.

So I fired the request to the API to see what’s under the hood, and here’s what I saw:

{
"count": 2,
"orders": [
{
"id": "C101",
"line_items": [
"Pink Dress",
"Blue Blouse",
"Yellow Hat",
"Red Pants"
]
}
]
}

Well, the API actually found both orders. But only one was returned. What the heck?!

The number “2” referred to all orders on all result pages, suggesting that somehow, some part of the system “knew” about the missing records. I called the API once again, this time asking it explicitly to deliver the second page of the results. Here’s what I got:

{
"count": 2,
"orders": [
{
"id": "C101",
"line_items": [
"Pink Dress",
"Blue Blouse",
"Yellow Hat",
"Red Pants"
]
},
{
"id": "C102",
"line_items": [
"Black Shoes"
]
}
]
}

We have order C101 present both on the first and the second page. The second page, however, also includes the missing order C102.

At this point, it’s the high time to introduce two important details. First, let’s say hello to the domain model. It’s grossly simplified (as well as the example API responses), but it should be enough for us to demonstrate the concept:

We have an order that is composed of multiple line items. When some part of the order gets shipped, we attach a fulfillment to it. We use the term “fulfillment” instead of “shipment”, because we believe that using difficult words makes us sound really smart.

The second important part is the code that is used to construct the query:

This code selects records from the Order entity, joining them with LineItems and Fulfillments. It then applies filters for various fields, where applicable. To some of you, it may be already obvious what’s wrong, but the bug unveils itself in its full glory only after turning on query logging in SQLAlchemy. You can do it using the following statement:

logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

When we ask for the first page of Jane’s orders, we get a single object. But now that we turned on SQL statement logging, we know that after performing parameter substitutions, the query that gets executed looks like this:

SELECT orders.id AS orders_id, orders.customer_name AS orders_customer_name 
FROM orders JOIN line_items ON orders.id = line_items.order_id LEFT OUTER JOIN fulfillments ON orders.id = fulfillments.order_id
WHERE orders.customer_name = 'Jane Doe'
LIMIT 5 OFFSET 0

Let’s execute it manually, and see the results. Here’s the first page:

+-------------+------------------------+
| orders_id | orders_customer_name |
|-------------+------------------------|
| C101 | Jane Doe |
| C101 | Jane Doe |
| C101 | Jane Doe |
| C101 | Jane Doe |
| C101 | Jane Doe |
+-------------+------------------------+

And here’s the second one (OFFSET 5):

+-------------+------------------------+
| orders_id | orders_customer_name |
|-------------+------------------------|
| C101 | Jane Doe |
| C101 | Jane Doe |
| C101 | Jane Doe |
| C102 | Jane Doe |
+-------------+------------------------+

Now we can easily explain one mystery: how the SQLAlchemy ORM engine bit us. It doesn’t allow a query to return more than one object with the same identity. Under the hood, it uses an identity map that is tied to the session. If the database returns more than one row with the same primary key, they simply get deduplicated. Normally, this feature is very useful; however, it can easily cover a lurking bug or a performance problem.

In our case, it was both of them.

But there’s one more element of this puzzle that needs to be demystified: how we ended up with the duplicated rows in the first place. The answer is “joins”, of course. To make it more obvious, let’s modify our query to return more data and get rid of paging:

SELECT orders.id, customer_name AS customer, product_name, tracking_number
FROM orders
JOIN line_items ON orders.id = line_items.order_id
LEFT OUTER JOIN fulfillments ON orders.id = fulfillments.order_id
WHERE orders.customer_name = 'Jane Doe';

+------+------------+----------------+-------------------+
| id | customer | product_name | tracking_number |
|------+------------+----------------+-------------------|
| C101 | Jane Doe | Red Pants | 12345 |
| C101 | Jane Doe | Yellow Hat | 12345 |
| C101 | Jane Doe | Blue Blouse | 12345 |
| C101 | Jane Doe | Pink Dress | 12345 |
| C101 | Jane Doe | Red Pants | 54321 |
| C101 | Jane Doe | Yellow Hat | 54321 |
| C101 | Jane Doe | Blue Blouse | 54321 |
| C101 | Jane Doe | Pink Dress | 54321 |
| C102 | Jane Doe | Black Shoes | <null> |
+------+------------+----------------+-------------------+

You see, regardless of what data you actually return in the SELECT statement, the database performs a join first, and it can result in a combinatorial explosion of variants. In our case, it’s enough to have four products shipped in two packages to overflow a result page with a single order.

The engineer’s guide to solving relationship problems

Despite using difficult words from the e-commerce dictionary, we no longer look smart, so let’s get back to the code and try to fix this mess.

One may think it’s enough to be more picky when applying joins: only perform a join with table t if you apply a filter to t. Unfortunately, this solves the problem only partially — or, speaking more accurately, buries it deeper. That’s because this solution goes wrong again as soon as you apply a filter that accepts more than one value:

SELECT orders.id, customer_name AS customer, product_name
FROM orders JOIN line_items ON orders.id = line_items.order_id
WHERE product_name in ('Red Pants', 'Blue Blouse');

+------+------------+----------------+
| id | customer | product_name |
|------+------------+----------------|
| C101 | Jane Doe | Blue Blouse |
| C101 | Jane Doe | Red Pants |
+------+------------+----------------+

Oh, dear. We’re gonna need a bigger boat.

What we need right now are subqueries. The idea is to transform our joins into SQL constructs that look more or less like this:

SELECT orders.id AS orders_id, orders.customer_name AS orders_customer_name 
FROM orders
WHERE EXISTS (
SELECT *
FROM line_items
WHERE line_items.order_id = orders.id
AND line_items.product_name IN ('Blue Blouse', 'Red Pants')
)
LIMIT 5 OFFSET 0

This way we can apply a filter on the “many” side of the relationship, without putting these data into the result set in the first place. There are many ways to achieve this in SQLAlchemy. Here’s one option:

But boy, this is ugly. Isn’t there a better way? Yes, there is! SQLAlchemy provides a tool that produces an analogous query, but is much easier to use: the any() method of a collection field. Using it assumes that your relationships are correctly defined in your model classes (but they are anyway, right?). Just take a look:

Now it’s perfectly clear what we want to achieve — even more clear than in raw SQL.

Epilogue (and a cliffhanger, too)

The benefit of the described approach is not only correctness: performance is also at stake. In a more real-life case, we detected that some of our queries resulted in more than one hundred records (!) returned per order. As a result of this modification, not only we fixed a bug, we also reduced the average time required for this query from 345ms to 193ms.

Still, the performance of the cart search feature was less than satisfying. Why? Well, that’s a topic for another story.

This article was written using PostgreSQL 9.3 and SQLAlchemy 1.1.4.

--

--

Bartosz Leper
ShopRunner

Now I am become Engineer, the creator of worlds.