5 Things I wish my grandfather told me about ActiveRecord and Postgres
It’s been more than a year since we had a huge TV campaign, we had tables growing up to 30M rows. Since then we had a book club focused on Postgres performance and we analysed several real world scenarios. I’m going to condense here our learnings and changes we made in the code to improve our queries
When I started at carwow 2.5 years ago we had ~ 77 thousand users, now we have about 2.6 million
Explain analyze, it’s your BFF
In order to understand why a query performs badly it’s good to have a good know of how to read explain analyze. One year ago I had no idea, and currently, it’s the best tool I have in hand to optimise queries.
When you run explain <query>
Postgres shows you the optimal plan it found to execute the query e.g. explain select count(*) from users where email = ‘test@example.com’
produces:
QUERY PLAN
— — — — — — — — — — — — — — — — — —
Aggregate (cost=4.09..4.09 rows=1 width=8) -> Index Only Scan using index_users_on_email on users (cost=0.09..4.09 rows=1 width=0)
Index Cond: (email = ‘test@example.com’::text)
In human words this means:
I’ll execute an index only can using the index index_users_on_email on users table, the expected ‘cost’ for to return the first row is 0.09 and by 4.09 I’ll fetch the last row.
I’m expecting to find 1 row, with width in bytes of 0 (I’m just fetching the index, not gathering the actual data)
The condition on the index is (email = ‘test@example.com’::text)
I will then run an aggregate (to count) the number of rows, and the estimated time to return the first row of the result is 4.09, the last will arrive at 4.09. I’ll return only one row with the count.
There are few things you can notice here:
- The first line of the plan shows you the last node that will be executed; Effectively what postgres shows you here is a tree, where the root node (in this case the aggregate) calculates the final result.
- There are 2 different nodes types in this plan, there are several node types, but I won’t go into details about the different node types.
- The aggregate has as ‘cost’ x..y and x..y are the same, the reason for this is that it needs to have all results from the index only scan before returning the count. Another node type that has the same behaviour is the sort node type.
The explain command returns an estimate of the query. To have an idea of the real execution cost you can use Explain Analyze:
explain analyze select count(*) from users where email = ‘test@example.com’;
QUERY PLAN
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Aggregate (cost=4.09..4.09 rows=1 width=8) (actual time=10.202..10.202 rows=1 loops=1) -> Index Only Scan using index_users_on_email on users (cost=0.09..4.09 rows=1 width=0) (actual time=10.191..10.195 rows=1 loops=1) Index Cond: (email = ‘test@example.com’::text)
Heap Fetches: 1Planning time: 0.065 ms
Execution time: 10.271 ms
(6 rows)
the result is similar, but next to the estimated cost from the planner you get the actual time that took to execute the query.
How does Postgres know which plan to use?
- First, the planner/optimizer generate different plans to use
- Based on meta information about the tables and cost estimate settings it picks the plan with the lowest cost.
Keep this in mind, will help us later.
1st: use pg_statement timeouts
We’re using heroku and postgres. Heroku returns a timeout to the user, an error page, if the request takes more than 30 seconds to load. Even if an error page is shown to the user the database will continue executing the query until a result is found.
Queries running in this way will use resources on the db, causing other queries, maybe simpler, to run for longer causing other users to be affected.
In your database.yml you can set a timeout for queries: if anything takes longer than expected the database will interrupt the query and active record will throw an exception, the query will be cancelled and yes you’re showing an error to the user but you will be able to tackle the exception, get more information about the request and fix the problem from the source. The code to set it up looks like this:
development:
adapter: postgresql
encoding: unicode
database: carwow_quotes_site_development
pool: 10
prepared_statements: false
host: localhost
variables:
statement_timeout: <%= ENV['PG_STATEMENT_TIMEOUT'] || 1000 %>
Here we’re setting a default of 1000 ms. It can be overridden on background jobs using the ENV variable.
Adding this setting will not only help you finding slow queries that are currently affecting your site. It will also help to fix performance issues caused by the growth of your user base.
2nd: Counts are as expensive as the full query
We were using kaminari gem, if you are not familiar it’s a pagination gem. It shows a number of pages at the bottom of the list that help navigation:
« First ‹ Prev … 2 3 4 5 6 7 8 9 10 … Next › Last »
When our tables started growing we noticed those pages had performance issues.
But why is it expensive?
With pagination the query to fetch the data would look like this:
select * from users order by created_at limit 100;
and the corresponding query plan looks like:
Limit (cost=0.09..18.49 rows=100 width=935) (actual time=0.014..0.125 rows=100 loops=1)
-> Index Scan using index_users_on_created_at on users (cost=0.09..462735.29 rows=2514107 width=935) (actual time=0.013..0.109 rows=100 loops=1)
Planning time: 0.111 ms
Execution time: 0.162 ms
You can see the cost of retrieving the first and the last row is minimum, this is because the index scan returns a ‘stream of rows’ and the limit just cuts the execution as soon as the 100th row is fetched.
But digging into the issue we noticed the gem was also triggering a count query, which looked like this:
select count(*) from users;
And that brings up the following plan:
Aggregate (cost=234895.37..234895.38 rows=1 width=8) (actual time=814.283..814.283 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..233638.32 rows=2514107 width=0) (actual time=0.043..611.391 rows=2514239 loops=1)
Planning time: 0.101 ms
Execution time: 814.316 ms
(4 rows)
The aggregate function has to wait for all rows returned by the sequential scan before returning the correct value.
How to deal with them
- Use a different style of pagination e.g. ‘load more’ or infinite scrolling
- Do not use counts, kaminari has without_count method you can use for large datasets
- Use rails counter_cache option for has_many and belongs_to relations
- If you have to use count multiple time in a page store the value or use the
length
method instead
3rd: Sorts
How does sorting works?
In the generic scenario, to sort a recordset, the database takes all records returned by some operation and sorts the data in memory, a generic sort would look like the following:
EXPLAIN SELECT * FROM models ORDER BY created_at LIMIT 100;
And the corresponding query plan is the following:
QUERY PLAN— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Limit (cost=104.04..104.09 rows=100 width=1044)
-> Sort (cost=104.04..104.51 rows=943 width=1044)
Sort Key: created_at
-> Seq Scan on models (cost=0.00..96.83 rows=943 width=1044)
As you can see to sort the data the DB has to scan the full models table, wait for all the rows to return before returning the sorted set and return the first 100 rows.
But btree indexes on Postgres are also created with an order, so if the planner decides to use an index, that could also be used to fetch the data in the correct order, so, for instance, a similar query on an indexed column:
explain select * from models order by make_id desc limit 100;
Uses the following plan:
QUERY PLAN
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —Limit (cost=0.06..38.46 rows=200 width=951)
-> Index Scan Backward using index_models_on_make_id on models
(cost=0.06..178.62 rows=930 width=951)
as you can see to return the first row it takes less time, because the result of the index scan is already sorted.
How can it help me?
Sorting data doesn’t come for free unless an index is used.
A real world example is if you want to see recent messages the user receive, without an index the DB has to go through all messages and sort them.
4th: Limits and offsets
I always though the cost of a limit and limit + offset was given by the number of results expected by the query. Is it the case?
It’s not.
Limits are executed by Postgres like normal queries, the difference is that the result set is taken out from the full result of the query. So if you are fetching from the 10000th result to the 11000th Postgres will need to calculate the full result set and take out the part that was requested as a result.
This concept is especially useful if you need to do big updates on millions of rows on a large table. We found this problem the first time when on a query we used the find_in_batches
method provided by active record. find_in_batches lets you load chunks of rows from a table and not load the full result set of the query in memory. For example if you want to most of records in the user table you would run
Users.where(active: true).find_in_batches {|u| u.update_status}
Active record will only execute the block for 1000 users at the time, solving the memory problem.
The problem is that find in batches uses limit + offset, and once you reach a big offset the query will take longer to execute.
A solution is to use an indexed column instead. So instead of using limit + offset you can use the index to get a limited results set.
5th: Adding indexes, always good?
Adding indexes on tables sometimes is seen as a good catch-all solution: it will speed up queries and they are necessary for the database to filter up on certain fields. But are they always necessary? How does the database use them and what’s the drawback?
First of all not all indexes are created equals: There are indexes the database will never use and some indexes would be used not when you expect them to be.
Let me give you an example. Let’s say you have a book: you have the table of content at the beginning of the book, and the word index at the end, with all terms used in the book. Let’s say the author decided to include all words included in the book even ‘the’ and ‘a’.
Let’s say you want to search for the first usage of the word ‘the’ in the fifth chapter, which index would you use? would you run through all usages of the word ‘the’ in the index in the last page or simply go to the table of content, go to the first page of chapter 5 and then scan through until you find the first occurrence of ‘the’?
You know ‘the’ is pretty popular and I hope you would go through the second choice.
The DB knowing what you are looking for, selects the index that is going to filter out as much data as possible and then loop through the results until it finds what it needs.
This means that if you have for instance a boolean field for archived content an index for it, and half of the data in your table is archived that index will never be used.
This can be somehow counter intuitive, and also keep in mind the planner might change the plan depending on the volume of data it expects to find.
So if you have a ‘status’ column on users table, containing ‘active’ ‘inactive’ and ‘dead to us’ values, the database won’t use the index for active users but only for ‘dead to us’ values.
Unused indexes can make your insert and update statements take longer to execute, possibly causing bottlenecks. We did find in a table some unused, very memory heavy indexes were blocking users to be created.
How can we make it better?
- Measure it! explain analyze queries you would like to run on those indexes
- Most dbs have index usage statistics, you can use them to remove unused indexes.
- Keep in mind that in Postgres and oracle null values are not part of the index (in MySql they are). If you search for them the db would probably have to use a separate index if possible or scan the whole table. If you need them you could always write conditional indexes.
Remember
The db won’t use the new index unless proven otherwise
Conclusions
We just scratched the surface to understand what that black box of the database does for us, and how it decides
- Measure, and tackle the slowest bit first, sometimes the database is not even the bottleneck, with good measurements you’ll find what to focus on next
- Get used to use explain and explain analyze
- Use statement_timeouts
And remember things that work now might not work in one year time
Interested in making an Impact? Join the carwow-team!
Feeling social? Connect with us on Twitter and LinkedIn :-)