How Partial Indexes Affect UPDATE Performance in Postgres
Based on a true story
Partial index is a good way to save some disk space and improve record lookup performance. The common rule is “use it when you can afford it” — in other words, if all your queries involve some filter, it’s usually considered a good idea to include such filter to index definition, to reduce its size and improve its performance (the smaller your index is, the faster the corresponding IndexScan or IndexOnlyScan will be executed).
But do partial indexes always give you better performance? Well, let’s see.
Consider a simple example:
create table asset(
id bigserial primary key,
owner_id int8 not null,
name text not null,
price int8
);
Let’s assume that in this table we store information about some assets, and let’s fill it with some sample data:
insert into asset(
owner_id,
name,
price
)
select
i,
round(random()*100000)::text,
round(random() * 10000)
from generate_series(1, 600000) _(i);
And let’s add more sample data, this time with “unknown” price:
insert into asset(
owner_id,
name,
price
)
select
i,
round(random()*100000)::text,
null -- price is unknown (yet)
from generate_series(600001, 1000000) _(i);vacuum analyze asset;
Now we got 1 million records, 40% of them having yet-unknown (null
) price.
Prices change from time to time, so any record might be UPDATEd to have a new, corrected price value.
Next, consider situation when we need to quickly find all assets which belong to a specific user, but we are never interested in records with yet-unknown price:
select *
from asset
where
owner_id = :owner_id
and price is not null;
What index do we need to use? It is quite natural to think that this partial index will serve our need perfectly:
create index i_asset_price_partial
on asset
using btree(owner_id)
where price is not null;
But does it really help to have the best performance?
Well, it depends.
First of all, such index will be much smaller in size, that regular full-table index:
create index i_asset_price_full
on asset
using btree(owner_id);
Let’s compare their sizes:
test=# \x
Expanded display is on.test=# \di+ i_asset_price_partial
-[ RECORD 1 ]----------------------
Schema | public
Name | i_asset_price_partial
Type | index
Owner | nikolay
Table | asset
Size | 13 MB
Description |test=# \di+ i_asset_price_full
-[ RECORD 1 ]--------------
Schema | public
Name | i_asset_price_full
Type | index
Owner | nikolay
Table | asset
Size | 21 MB
Description |
Well, as expected, the number for the full-table index is higher — 21 MB vs 13 MB, so it’s ~60% bigger than partial index. This is the point when you can decide — well, I’ll better go with the partial index (as I did in one of Postgres instances I optimized). But don’t hurry, wait a little bit.
What about SELECT
performance? Let’s quickly check it with pgbench, the native benchmarking tool for Postgres:
echo "\set owner_id random(1, 1 * 1000000)" > selects.bench
echo "select from asset where owner_id = :owner_id and price is not null;" >> selects.benchpgbench -n -T 30 -j 4 -c 12 -M prepared -f selects.bench -r test
Without indexes, on my laptop I had only 13.12 TPS (transactions per seconds):
$ pgbench -n -T 30 -j 4 -c 12 -M prepared -r test -f selects.bench
transaction type: selects.bench
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 30 s
number of transactions actually processed: 397
latency average = 915.767 ms
tps = 13.103765 (including connections establishing)
tps = 13.113909 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.003 \set owner_id random(1, 1 * 1000000)
908.755 select from asset where owner_id = :owner_id and price is not null;
Sequential scans are not good for searching needles in a haystack. No surprises here.
Full-table index (defined alone, without partial one):
$ pgbench -n -T 30 -j 4 -c 12 -M prepared -r test -f selects.bench
transaction type: selects.bench
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 30 s
number of transactions actually processed: 779801
latency average = 0.462 ms
tps = 25963.230818 (including connections establishing)
tps = 25972.470987 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set owner_id random(1, 1 * 1000000)
0.460 select from asset where owner_id = :owner_id and price is not null;
And partial index (again, defined without full-text one):
$ pgbench -n -T 30 -j 4 -c 12 -M prepared -r test -f selects.bench
transaction type: selects.bench
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 30 s
number of transactions actually processed: 817490
latency average = 0.440 ms
tps = 27242.705122 (including connections establishing)
tps = 27253.100588 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
0.001 \set owner_id random(1, 1 * 1000000)
0.439 select from asset where owner_id = :owner_id and price is not null;
As we can see, SELECTs working with the partial index are a little bit (~5%) faster than those that work with the full-table index. This is what we expected.
Now let’s see the difference in UPDATE performance. We will touch only those records, which already have defined price, to keep amount of NULLs (40%) the same, just in case we will want to do more experiments:
echo "\set id random(1, 1 * 600000)" > updates.bench
echo "update asset set price = price + 10 where id = :id;" >> updates.benchpsql test -c 'vacuum full analyze asset;' && \
psql test -c 'select pg_stat_reset();' >> /dev/null && \
pgbench -n -T 30 -j 4 -c 12 -M prepared -r test -f updates.bench
When no indexes except the primary key are present: 14553 tps.
With the whole-table index: 14371 tps.
With the partial index: 12198 tps.
This might look surprising — the partial index significantly (~14%) slows down UPDATEs!
Why so? To answer this question, let’s look at a Postgres internal statistics, pg_stat_user_tables
. Noticed that we called pg_stat_reset()
function before running pgbench for UPDATEs? This was by purpose, to reset statistics for whole Postgres cluster and collect fresh data. Now, if we check the result of this query after each experiment with this query:
select
n_tup_upd,
n_tup_hot_upd,
round(100 * n_tup_hot_upd::numeric / n_tup_upd, 2) as hot_ratio
from pg_stat_user_tables
where relname = 'asset';
…we’ll see something like this:
-- no indexes except PK
n_tup_upd | n_tup_hot_upd | hot_ratio
-----------+---------------+-----------
436808 | 409220 | 93.68-- full-table index
n_tup_upd | n_tup_hot_upd | hot_ratio
-----------+---------------+-----------
431473 | 430807 | 99.85-- partial index
n_tup_upd | n_tup_hot_upd | hot_ratio
-----------+---------------+-----------
366089 | 0 | 0.00
These numbers are clear — in case of partial index, we lose HOT updates completely. HOT stands for Heap Only Tuples, this is an internal technique of how PostgreSQL performs UPDATE of a tuple. It’s not well documented in the main PostgreSQL documentation, but your can find well explained details in README.HOT from Postgres source files (or read this article: “Increase the speed of UPDATE query using HOT UPDATE”).
In short, if during UPDATE, old and new tuple (versions of the row you’re updating) are located in the same page of memory, and this UPDATE does not modify any indexed column, special kind of optimization will occur, which will allow Postgres to save time, first of all because it will not change index entries.
But what exactly is any indexed column here? Well, README.HOT explains it:
…HOT solves this problem for a restricted but useful special case: where a tuple is repeatedly updated in ways that do not change its indexed columns. (Here, “indexed column” means any column referenced at all in an index definition, including for example columns that are tested in a partial-index predicate but are not stored in the index.)
In our case, the index depends on price
column which is our partial-index predicate and and once we change it, Postgres cannot use HOT, so our UPDATEs (and DELETEs) in general become slower.
I believe this could be optimized internally inside Postgres itself: if both old and new values in a tuple have price is not null
, it means that there is no reason to touch the index. But there is no such optimization inside Postgres (at of March 2018, when the newest version is Postgres 10), so developers have to keep this in their minds while optimizing database schema and queries.
I hope that you will not think that partial indexes are bad for UPDATE performance— that are not bad at all, but you need to pay attention to the set of columns you use in an index definition and keep in mind this tradeoff between gain in SELECT performance and possible loss in UPDATE performance.
This is just one of many examples how tricky might look the database optimization process in non-expert’s eyes — I often hear that Postgres DBA’s work is a “black magic”. It is, if you need to work with hundreds or thousands of tables, indexes and other DB objects and are trying to play Whac-A-Mole game, guessing which changes will help you most. It’s not uncommon when optimizing one SQL query you harm performance of some other queries
I believe that new automated database administration tools are to appear in the nearest future, helping DBAs and developers with performance optimization.
Meanwhile, I suggest using my new project called postgres_dba — it is a collection of SQL reports bundled to a handy toolset ready to run in your psql in interactive mode. It is a special report called “3 — Load Profile”, which allows you to see how many tuples were SELECTed, INSERTed, UPDATed, and DELETEd in every table, and what is the ratio of HOT updates among all updates:
This is a very basic way to control UPDATE performance for your tables. If you see that HOT updates ratio is low for a table, to understand why is it so you need to check all indexes definitions for this table — are they allow HOT updates in the first place, or UPDATE queries involve columns participated in definitions of some indexes?
Nikolay Samokhvalov
Follow me on Twitter! New and cool Postgres stories are published daily: @postgresmen.
And give a try to postgres_dba: https://github.com/NikolayS/postgres_dba