The Tale of a PR Review and the Evolution of a Database Query
One of my favorite things about my company’s workflow is Pull Request reviews. Before anyone’s work is in, it is reviewed by at least one other engineer, and often more. And I don’t just mean “does it work?” reviewed, but actually, “is it good?” reviewed. This is a tale of a basic PR review, and some query optimization lessons while we’re at it.
The short version of what I was trying to accomplish: the backend for a new front-end feature that required annotating each instance of an object (a
PublishedCourse) in a QuerySet with either “Registered” or “Complete” for a given user before passing it into the template.
- A filter applied to each published course as the template looped through the queryset. The feedback: filter: bad. It calls the database twice for each
PublishedCourse: bad. (I’m paraphrasing, obviously. My team is kinder than that).
- I moved the logic into the view and sent each published course into the template via the context already annotated. But my logic didn’t change much, so it wasn’t actually any better — I hit the database the same number of times. The feedback: Do a single database query to pull the relevant data (actually 2 database queries, since there are two relevant sets of objects that need to be collected for the user, but that’s better than two queries per
PublishedCourse), and organize it into dictionaries to be queried.
- Pull one set of a data into a dict and the other into a list, per the above feedback, and query it from there. The feedback: Since I was querying the dataset in a list for existence of an object, use a
setinstead. I also had some weird variable naming going on — my team was rightfully confused and called me out on it.
- At this point I was basically there. The final piece of feedback I received was this: “When you do
published_course.course.id, the ORM lazily does a SELECT on the Course. When you do
published_course.course_id, it doesn't need to do this lookup, because it has already cached the course_id for each
PublishedCourseobject. Main point: ORM eagerly downloads an object’s attributes (including related object’s IDs), but lazily downloads object’s related objects.”
I said this at the beginning, but let me reiterate — these code reviews are thorough. I got feedback at the level of not only general optimizations to reduce my number of database calls, but replacing
published_course.course_id because of the increase in efficiency it would provide. Now, that change in one place in the code might not make a noticeable difference today, but as I take these tiny lessons and am able to expand on them and apply them over the course of my career? What a difference. I don’t love making value judgements, but I feel pretty good about saying this: you should do PR reviews too.