How a forgotten legacy code made invoice generation a 3-hour process — case study

Published in

Docplanner Tech

7 min readAug 12, 2020

Recently my teammate (who I thank btw.) spotted some performance problems during creation of an invoice for one of our biggest customers. He found out that the generation process was slow due to specific reason found in our error monitoring tool — NewRelic. The problematic query was simple-looking yet executed many times during the generation process and became a serious bottleneck. Problematic query looked as follows:

Query explanation

This query was about finding out whether a customer is “active”. In this particular context we’ve defined a customer being active as a customer that had at least 1 active product out of 2 possible attached to his agreement. Back in February 2018 we had this defined similarly but the query was covering only 1 product type and had only 1 join which was an INNER join back then:

As time flew, new products were developed, this query became obsolete and had to be rewritten to cover a case that the customer can also have a second type of product that is still active as well (we’ve skipped other products but that’s not the case here). So, we’d rewritten the query and split 1 INNER join into 2 LEFT joins. LEFT was used, because a customer could have product A and not have product B and the other way around, of course. 2 INNER joins would enforce having both at the same time. For some unknown reason we also thought about grouping results by the agreement.id column, which we hadn’t done previously.

Apparently that was the game changer. Comparing the performance of those queries I saw that the latter took 5 seconds instead of 130–200ms (still too long, I know) of the previous one.

We obviously had query performance problems, so what can You do at the very beginning?

The analysis

First things first — EXPLAIN:

Old query:

New query:

Hmm, at first I thought we didn’t introduce so much complexity yet still a simple-looking query now took an average of 5 seconds to complete. Furthermore, this query was executed many times during invoice generation. It turned out to be executed 2200 times. Let’s do the math, shall we:

2200 * 5 / 3600 = ~3 hours of running time!

But, but … EXPLAIN isn’t that bad, right? After all, numbers in rows column on every line are quite small. And there’s a LIMIT 1 that should do the job, right?

Looking at this EXPLAIN I was almost convinced of that too — that there’s nothing wrong with the query … but then I profiled it:

Manual for SHOW PROFILE can be found here

Profile for this query looked like that:

We can immediately see that the “Sending data” part took 99,96% of the whole query time. We can also spot huge numbers in this row compared to the numbers from other rows. “Sending data” means roughly that tables and/or indexes are being scanned to compute the query result so the problem is probably with the query execution plan and not with the DB’s overall performance.

The cause

According to theEXPLAIN query used the indexes and had relatively small numbers in rows column. I'd changed INNER join to LEFT join in the original query (the one from Feb 2018) but, as expected, that didn't change anything — the execution plan was still the same. This led me to the GROUP BY clause that was also introduced.

Eureka! Removing GROUP BY reduced query time by a mile! But wait, is the query still correct? Hmm, not anymore, but let's forget about the correctness for now — I wanted to know why the query was fast all of a sudden!

Well, one of the reasons was actually bad indexes. “Key” and “Extra” columns of the EXPLAIN also hinted that a lot but I didn't notice it at first glance. Have You? Think about it for a while…

Bad index

The JOIN used the agreement_id as a key to fetch rows in order to compute the resulting combination. status field wasn't a part of the key so DB had to hit table data in order to check the second condition of the JOIN ... ON clause. This led me to check how many rows must be checked in order to compute the join result and I found out that quite a few really:

Query couldn’t fully rely on the index, so for every agreement entry (fortunately 1 for this customer) DB had to examine 77924 rows in order to filter out matching ones ( AND status IN ('active', 'expiring') part ) to produce the result of the join. Then, for every combination there was another LEFT JOIN that had to examine another 1486 rows, filter out matching ones and produce the result. The situation was getting worse over time because we were getting more and more entries in joined tables. That was the reason we hadn’t noticed the problem for a long time — we just didn’t have so much data for this specific customer back then.

But hey, we still have a LIMIT 1. Shouldn't that optimise the whole thing?!

Yes, I thought so too, but not in this case. To fully understand the problem some knowledge about order of operations is required. LIMIT n ends the query as soon as there are n rows produced as the query result. And we relied on this knowledge. But we didn’t think about how GROUP BY will dramatically change this behaviour.

The villain — `GROUP BY`

GROUP BY creates groups of data in order to compute resulting rows. In this case the query would exit on the first matching row found, but GROUP BY caused the rows to be first assembled together in a group, out of which we only told it to SELECT 1. In fact that was the same behaviour as finding the first row without using groups at all. So… the GROUP BY clause didn't make sense obviously but due to its usage theLIMIT clause was applied after rows grouping. As a consequence, in order to produce a simple 1 from the group, all of 77924 and 1486 rows had to be examined to compute the resulting rows. The rows which could be easily trashed afterwards because we wanted to just SELECT 1. Arghh! Not very smart of us to say the least. 😜

Conclusions

Looking only at the result of EXPLAIN gave us false confidence that the query was OK, both in Feb 2018 and now. Don’t get me wrong, describing a query with EXPLAIN is usually the first thing a developer does to see why a query has bad performance and it’s good. This can tell You lots of stuff… assuming You can understand the output, of course. However in this case, it obviously wasn’t enough and we had to dig deeper. Some knowledge of how the SQL engine executes the query step-by-step was actually mandatory to fully understand the issue.

Nowadays, when we already have 8 (and more to come) different product types doing 8 joins would be just… oh, just answer this one Yourself ;) But apart from that, let’s imagine the overall complexity of the query (and thus time needed to compute the result) if we had these 77924 rows in every join in 3 years from now… this math just hurts my brain.

What to do then, if we would like to keep doing queries? Well, instead of a single, more and more complex query, we can do many more optimised queries, for example. We can “early exit” in the application code as soon as first query produces 1 or, if we wanted to reduce RTT we can e.g. combine several queries into 1 bigger query using UNION operator, although this would always compute every single query from the union in order to compute the union result set.

Summary

After all, we had completely rewritten this mechanism, because 2200 exactly same queries was an excess. This surely has taught us a lesson how unsupervised legacy code that we’ve always thought of “if it works don’t touch it” can come back and literally kick our butts. Don’t underestimate Your legacy code, don’t take for granted that if it did the job a few years ago, then it will still do in a year or two. And, of course, measure Your performance — it’s the key to actually know what is happening. Without it You’ll be unaware of potential problems and risks. Fortunately for us, this problem didn’t have any other negative impact besides the performance. But, if money had been lost because of it, that would have been a much different situation, for sure.

All in all, with just a few tricks we’ve managed to cap the time needed to generate invoices for our customers from 3 hours to a little over 12 minutes. And if You don’t talk numbers, let me put that in something we can all enjoy — a chart:

Cheers!

If you enjoyed this post, please hit the clap button below :) You can also follow us on Facebook, Twitter, and LinkedIn.