Understanding SQL Queries Performance Using SQL EXPLAIN and ANALYZE

and how this understanding helps us in optimizing slow queries in our products

Nyoman Abiwinanda

Published in

Xendit Engineering

10 min readNov 15, 2021

Let’s admit this

Database engines are complex these days

I mean just how many of you are actually understand how your queries run? Probably not many. If you think about database engines nowadays, such as PostgreSQL, Oracle, Microsoft SQL Server, all of these engines have become so advanced that in a way people don’t really think about the underlying low-level logic that these engines are executing.

To some extent, there is a sense of magic in how SQL query works that you don’t really think about them too much anymore. Perhaps some of you might not even care how your queries run. You just expect the database to return the data that you ask for.

However, due to the growing data that my team is handling, we noticed that database timeouts were occurring at a rate that we found bothering. We believe it was a good idea to invest a little bit of our time to understand how the database executes our queries. During the process of understanding the query, you will see a couple of SQL commands that we use which enabled us to slowly turn database from initially seems obscure into something that is more comprehensible that has helped us to take a “smart guess” and successfully optimize our queries for the sake of future scalability.

Why Should We Care About Database Performance?

First of all, let’s ask ourselves this question

Why should developers have concerns about database performance?

If you look at APIs you have made so far, you will realise that almost all of them are involving database CRUD operations. Sometimes what the customers end up seeing on a webpage comes straight from the database. If you are an experienced software developer, perhaps some of you might have experienced once in your career where you have one query in your application, and when it runs it kind of grinds everything to a halt. In fact, one interesting thing about it is that

A slow query can be the difference between a system that is super snappy and literally unusable and sometimes the difference between these two extremes may only be an index apart.

This makes slow queries as one of the most common causes of poor application performance hence the more relevant question that we should ask ourselves is

Do developers know how to design efficient queries?

Now there are many books or resources out there on the internet that explain this topic, thus a single blog post is definitely not enough to explain the whole possible design processes. Yet, I will try to take a couple of samples of slow queries from our product and then show what my team and I did to optimize them.

Slow Query 1

The first sample of slow query that we have is this

Simple query yet important for our monitoring

We run this query every 15 minutes and the purpose of this is to monitor pending virtual accounts in our product. For those of you who do not know, a virtual account is basically a bank account that has no physical existence but transacts on behalf of a real, physical account. Virtual account makes it easy for our customers to be able to receive payments via transfer from multiple banks without having to open an account at each bank. This payment method is important to have because almost 80% of payment transactions that occur in Indonesia are via bank transfers. With a Virtual Account, you will get a notification when your customer has made a payment without having to check it manually.

Back to the query, it is very simple. It selects virtual accounts created at least one hour ago and the status is not ACTIVE nor INACTIVE or in other words, it is specifically trying to find PENDING virtual accounts.

The reason why we decided to optimize this query is that ever since the product has been launched, the virtual account data has been growing so much that one day we got repetitive timeout errors every 15 minutes from this query. This timeout causes us to occasionally lose our power for virtual account monitoring.

To understand why this query runs slow, there are a couple of handy keywords in SQL which allow us to gain insight into our query. They are EXPLAIN and ANALYZE. What EXPLAIN and ANALYZE together do is that they will print out the execution plan of your SQL query and they will show you just how much time the execution plan will take to execute.

Once we append the sample query with EXPLAIN ANALYZE, turns out that we could see the query is running for 4 seconds, and the reason why it is running slow is that the query performs sequential scans to the whole table!

Execution plan and time of the query given by EXPLAIN and ANALYZE respectively

Now you might be asking, why don’t we add indexes to the created and status columns? Because the query is searching data using a condition from these two columns right? Well indeed, but we already have an index for both columns so the question we were asking at the time was why does the database perform a sequential scan? turns out the problem is in the NOT IN clause of the query.

Status is a column that stores an arbitrary string hence when we declaratively state that we want to find virtual accounts where the status is not ACTIVE nor INACTIVE, the database has no way of knowing that we are specifically intended to find PENDING virtual accounts hence the database ignores all the existing indexes and performs the query sequentially instead.

Small refactor but you will later how meaningful it is

By refactoring the query slightly and making it a little bit more specific such as shown above, we improve the query by almost 900 times faster! This is a lot of improvement especially with a slight modification that took very little effort.

900x improvement with a little refactor that only takes like seconds

But there is still one peculiar thing if you pay attention to the execution plan because if you look at it the database uses a composite index that consists of status and expiration_date columns. This shows that there is missing or improper indexing in the database to support the query in the first place hence what we decided to do to improve the query performance further is by creating a composite index that indexes both statuses and created columns. Pretty simple right? Yeah (at least this is what I think at first)… but there is one thing about the composite index that many developers like to forget about which is

Columns order matters in Index

So what we can do is two things, either we create an index of (created, status) or an index of (status, created). The question is which one is better? Now you could answer this by creating both indexes and see which one is faster by analyzing your query with EXPLAIN and ANALYZE but for this one, we can go a little bit analytical.

Let’s think of the index (created, status) as a table that is first sorted by the created column, and then for each group or same value of the created, we sorted them by the status. In this case, if the database tries to find the data by first filtering using the created column, what the query really says is that find virtual account that is created from 1 hour ago till the beginning since the product exists. This is pretty much almost asking for the entire table! Even after the first filter by the created column, the second filter by the status column will still involve almost the entire table whereas if we first filter by the status column, the database will filter a lot of data already and then for the next filter by the created column the database will only filter a small subset of data.

Analytically it is clear that an index of (status, created) is more helpful for the database than an index of (created, status). Once we have created such an index, we improve the query further to just 0.030 ms which is a ~150k times net improvement compared to the initial performance!

A net improvement of ~150k times after optimisation!

Slow Query 2

The previous query is rather simple. Next, we have a much bigger query which bothers us a lot because it is a customer-facing query that gives us a lot of timeouts even in weekends.

This query looks intimidating but bear with me…

This query shown above is used in our product dashboard to get merchant’s transactions such as the one shown below

This page is powered by the query shown before

The reason why the query is slow is that when we search transactions using the search field, it searches data through many columns (that is why we have a lot of OR in the query). In fact, with ANALYZE we could see that a sample of the query really does run extremely slow

One of the sample query runs for ~28s and the execution plan is not shown intentionally because your head will explode from reading it

Ideally, a search engine such as Elasticsearch would be more ideal for the use case but due to time constraints, we decided to understand the query further and do whatever we can to optimize it.

From the execution plan of the query given by EXPLAIN, we notice that the query runs slow because just like from previous query, it searches through the data sequentially. Searching sequentially through theproduct_transactions table would not be a good idea because it is one of the biggest tables in our database.

The fact that the database scans the table sequentially gives us a clue that it might be because of a missing index and in our case it is! We haven’t indexed the product_data column (at least properly). To improve that, of course, we could have an index to the product_data column but since product_data is a jsonb type column then we need to think about what kind of index we need to use. There is actually a general rule for this

if you know exactly what keys your query is searching by in the json column then BTREE index is most likely better

however

if you don’t know and you search through the json column using any arbitrary keys then GIN index is better

In web applications queries are kind of determined so we know exactly which key we are searching for. In this case, it makes sense that the default BTREE index is better but one thing to note about BTREE, especially for jsonb column type, is that you need to index the key inside the column, not the entire product_data column.

Notice just how an index could save this query

Once we apply the BTREE index to the keys inside the product_data column that are being used in the query, the results are very satisfying. Initially, we got timeout every day and now we kind of flatten the graph completely in this case. No more timeouts that bother the peacefulness of our weekends since then :)

No more timeouts bothering us after the optimisation, especially during weekends

Conclusions

After a couple of query optimizations shown before you might think that it is because of our high understanding of the database that enables us to figure out what we have done but honestly it is because of the support of EXPLAIN and ANALYZE. They give us useful feedback that helps us to figure out what to do next and whether our optimization attempt makes our queries better or worse. If you are wondering, we use PostgresSQL as our database. If you are using other SQL databases, chances are your database might plan or execute differently to a query hence optimizing it by figuring out how your query runs too analytically would be pointless and it is more time-efficient to ask the database directly how they execute it.

One thing for sure is you don’t need a Ph.D. in the database nor you have to be a certified database expert to be able to optimize database queries. I think a more sensible approach is to just start looking at your applications, try to find queries that you think are slow, take a sample of that queries, and then use EXPLAIN ANALYZE to see how your queries are performing. You could use the execution plan shown by EXPLAIN as a hint for improvement and execution time shown by ANALYZE as a reference or benchmark on whether your optimization attempts will improve or make your query worse.