Moneybird relies on PostgreSQL indexes to have queries execute in a timely manner. With an ever growing dataset we have to come up with new indexes from time to time to keep things running smoothly. Usually it's a matter of analyzing the slow query by prepending it with EXPLAIN ANALYZE. That shows us something like this:
(note: run the analyze query twice in quick succession to make sure your results aren’t biased by cached pages. You will see a significant drop in execution time the second time around)
Adding an index
An index with matching conditions can be be added by executing:
Let's see if there’s any improvement by analyzing the results:
The results are identical to the first EXPLAIN ANALYZE, the query plan still includes a sequential scan on the primary key index of ‘some_table’. This is very slow because every record in ‘some_table’ is in that index, in this case millions of rows. Only a few dozen of the millions of records match the condition, which is why PostgreSQL has to scan the entire table to hit the limit.
Perhaps its a matter of running an ANALYZE on the table, to explicitly inform the query planner about the data composition:
(note: the autovacuum daemon runs the ANALYZE command by default, calling ANALYZE is just a way to make sure it's up to date)
Rerunning EXPLAIN ANALYZE yields the same results. The query planner still chooses the slow primary key index over the ‘some_table_some_index’. Apparently it expects the data to be distributed in such a way that running a sequential scan consumes less time. For simple, single column, conditions the pg_statistic_catalog keeps a good record. However, with multiple conditions Postgres starts making assumptions about the data composition that aren’t correct. We can prove that LIMIT is to blame by running EXPLAIN ANALYZE without the LIMIT clause:
A significant improvement, by merely dropping the LIMIT clause from the query. For our data, raising the LIMIT to >30k yields similar results. The planner will include the index in its plans when it crosses that threshold.
There are two workarounds, both yield comparable results.
Common Table Expression
The issue can be tackled using a Common Table Expression, this forces the planner to use ‘some_table_some_index’ because the limit is applied at a different stage.
Unused ORDER clause
The query planner can also be tricked into planning an index scan by adding an unused order column. In this case we specify an additional ORDER clause on “some_table”.”some_variable” (which happens to be in ‘some_table_some_index’). PostgreSQL will now use the index because it thinks it will have to sort on some_variable, even though the “some_table”.”id” column is unique.
The aforementioned query is a result of calling find_each on an ActiveRecord model. We decided to implement the unused order clause workaround by patching ActiveRecord::Batches. A gist of our patch: https://gist.github.com/eliank/f6b30f7ad20567573f86d33738f99ac7