Debugging background tasks inside loops and transactions

I picked up a bug fix the other day that I was able to solve fairly quickly based on a hunch, but even after I’d solved it, I didn’t totally understand why. So I went hunting (because c’mon, the why is the fun part!)

The crux of the problem was this: We were doing an operation on a list of objects, the whole loop wrapped in a database transaction (more detail on defining your own transactions and using it as a context manager here), and then calling a background task for each of them. Users were reporting that the action that was supposed to be performed in the background task wasn’t happening for all of the users in the loop — only some of them. To make this even more fun, I was able to replicate the issue fairly consistently on production, but not at all locally. Oh, concurrency (jk, I actually do 💙 distributed systems).

Here’s a pared down version of the problem code:

from django.db import transaction
with transaction.atomic():  
for
user in users:
...
user.save()
my_background_task.delay(user_id=user.id)

The issue here lies in the transaction. If the task is enqueued and picked up before the transaction closes, it’s possible that the database object on which it’s operating doesn’t actually exist yet. This issue is documented here, and can be resolved as the docs point out — by adding the function to a list of operations to be run only once the transaction closes. Like so:

from django.db import transaction, connection
with transaction.atomic():
for user in users:
...
user.save()
connection.on_commit(lambda: my_background_task.delay(user_id=user.id)

This solved the original problem by guaranteeing that a database record existed and could be modified in the background task, but introduced a new bug, which manifested itself in a nearly identical way — the background task was only being run for exactly one user: the last one in the loop. The good news: this new bug was 100% consistent, while the previous bug had been only almost totally consistent. The plot thickens.

Another benefit of this new bug is that I was able to reproduce it locally. So I stuck some debuggers in the code, one of them right after my connection.on_commit(...) call. A connection has an attribute called run_on_commit, which is just a list of tasks that will be added to the queue of background tasks only after it closes. When I looked at this attribute after each iteration of the loop, I saw one function for each iteration (as expected). The interesting bit was in looking at the func_closure on each of the functions in the list: it was the same for all of them — the object from my most recent iteration, not the one that had been passed to the function. To the docs we go!

Python uses what are called late-binding closures: “Python’s closures are late binding. This means that the values of variables used in closures are looked up at the time the inner function is called.”

Here’s a simpler example to demonstrate the implications of this:

functions = []
for i in range(5):
functions.append(lambda: i)
>>> [func() for func in functions]
[4, 4, 4, 4]

This means that the problem has nothing to do with the transaction anymore, and everything to do with the loop. Because the object assigned to theuser variable was changing with each iteration, by the time I finished my loop, all of the enqueued tasks were called on a single user. My solution was to keep track of the users I’d modified, and remove the enqueuing of the tasks to outside the scope of the context manager — and more importantly, the loop:

from django.db import transaction, connection
...
updated_user_ids = []
with transaction.atomic():
for user in users:
...
user.save()
updated_user_ids.append(user.id)
for user_id in updated_user_ids:
connection.on_commit(lambda: my_background_task.delay(user_id=user_id)

It’s worth noting that calling this within connection.on_commit is not strictly necessary, now that I’m no longer within my transaction. However, since any utility function might be called somewhere else that is within a transaction, and adding it doesn’t hurt, I would advocate for background tasks always being enqueued usingon_commit.

Happy queuing!