PostgreSQL 10 features: Hash indexes
PostgreSQL 10 it’s coming soon, so I will write some small posts about new features in this amazing release of the most advanced open source database.
This are not a deep dive into the features, just a general overview of the new features coming in the new version.
PostgreSQL provides several index types: B-tree, Hash, GiST, SP-GiST, GIN and BRIN. In this post, I would like to talk about Hash indexes in PostgreSQL, this is not an entirely new feature, in fact PostgreSQL support Hash indexes since very ancient versions of PostgreSQL, from what I can tell, they can be older than PostgreSQL 7.2 (February 2002), so what’s the “new” with this?
In older versions and even in the current PostgreSQL 9.6, Hash indexes have not received much love, in fact they have a big fat warning in the docs:
From PostgreSQL 7.2:
Note: Because of the limited utility of hash indexes, a B-tree index should generally be preferred over a hash index. We do not have sufficient evidence that hash indexes are actually faster than B-trees even for = comparisons. Moreover, hash indexes require coarser locks.
From PostgreSQL 7.3:
Note: Testing has shown PostgreSQL’s hash indexes to be similar or slower than B-tree indexes, and the index size and build time for hash indexes is much worse. Hash indexes also suffer poor performance under high concurrency. For these reasons, hash index use is discouraged.
From PostgreSQL 8.1 and the first sign of hope and a more alarming warning that are not WAL-logged (not crash-safe):
Note: Testing has shown PostgreSQL’s hash indexes to perform no better than B-tree indexes, and the index size and build time for hash indexes is much worse. Furthermore, hash index operations are not presently WAL-logged, so hash indexes may need to be rebuilt with REINDEX after a database crash. For these reasons, hash index use is presently discouraged.
While the problems with hash indexes may be fixed eventually…
From PostgreSQL 9.0 this changed from a Note to a Caution, since they are not WAL-logged, they cannot be replicated, which means data loss in queries:
Caution: Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash if there were unwritten changes. Also, changes to hash indexes are not replicated over streaming or file-based replication after the initial base backup, so they give wrong answers to queries that subsequently use them. For these reasons, hash index use is presently discouraged.
Well since PostgreSQL 10, there is no more warning!
But please note: if you are currently using Hash indexes, even with all this warning signs, you must rebuild hash indexes when you upgrade (pg_upgrade specifically) from any previous major PostgreSQL version. This is required since in PostgreSQL 10 major improvements necessitated this requirement.
Now in PostgreSQL 10 Hash indexes are WAL-logged, this means that are crash-safe and can be replicated.
The performance has been improved too, require less locks, and meta-information is cached for faster lookups.
Wait, wait, you may ask, what so special about hash index? They essentially works only for = (equals) comparison, and B-Tree indexes already do that and a lot more.
The primary and most important improvement against B-Tree indexes is SPACE, the structure used for a hash is flat compared to a tree, so the gain in space can be significant, and they do pretty fast lookups.
If your data is pretty singular, where you need fast = lookups with less space, then Hash indexes can provide a good alternative to B-Tree.
Let’s suppose that you have a system, and you want to integrate Troy Hunt password SHA1 hashes to check that a user don’t use a pwned password, then you download the massive file of 306 million SHA1 pwned passwords list and import it to PostgreSQL (yes, you use PostgreSQL right?) then how do you query a record? you convert the user input into SHA1 and search the row in PostgreSQL massive table (or little table depending who are you speaking). You query this table with a =, for every change of password the user wish to do, if the password is pwned, you reject that password change and force the user to choose another one.
Please note: It’s 2017, you should not, I repeat, NOT store a plain SHA1 user password (and if you are still using MD5, you seriously have problems), it’s not secure, please use something stronger like bcrypt or PBKDF2.
Here is where Hash indexes start to shine, a simple table with 2 columns a serial and a text column, and 319,894,674 records, the table size is 23 GB, the serial column is the primary key (no good reason, just added to it) and the size of the PK 6852 MB.
A query without index in the sha1 column, the execution time is 4 minutes (thanks to the parallel workers).
The size of the B-Tree index: 20 GB. The size of the Hash index: 8192 MB (8 GB) more than a half than B-Tree :-) and the execution time is roughly the same as b-tree.
Another advantage of smaller indexes are that they can fit best in memory and less reads of disk, “Buffers: shared hit=2” vs “Buffers: shared hit=6”.
Here basically ends my first post of new features of PostgreSQL 10, next I will write about the improvements in Parallel queries.