Solving A Postgres Performance Mystery

At Flatiron, we use Postgres as our database solution for a wide variety of use cases. On the whole it has served us very well, but as with any database system, we have occasionally run into instances where it does something unexpected. For example, a system which had been running smoothly for over a year recently slowed down dramatically.

This occurred with our Postgres database which catalogues the unstructured documents representing a patient’s journey. As part of the process of populating this database, each night we ingest new documents for the millions of patients in the Flatiron network. Most of the documents we ingest require one form of processing or another. Therefore, we query our documents database to see which documents for each patient require additional action but have not been handled yet. When this system suddenly slowed down, it hampered our ability to tackle each night’s new documents in a timely manner. Our system health monitors alerted us that we were not finished processing the previous day’s documents by the start of the workday.

Digging through our logs, we discovered that one query in particular was the source of the issue:

SELECT
document_metadatas.id dm
JOIN
document_contents dc ON dc.id = dm.content_id
WHERE
dm.patient_id IN ({List of Flatiron patient identifiers…}) AND
dm.updated_at > now() — INTERVAL 7 ‘DAYS’ AND
dc.requires_processing AND
dc.processing_version_id IS NULL;

^ A slightly simplified version of the query which started causing performance issues.

This query finds all documents updated in the past week which require processing but are still awaiting it, as evidenced by the lack of a processing system version ID. In the past this query had taken mere milliseconds to run, but it was now taking two or three seconds to run. What caused this query to slow down by three orders of magnitude?

When investigating a mystery of this nature, looking at the query plan is always the best place to start. We restored a backup of our documents database from a few days prior and compared what the Postgres query planner was doing with our query before and after the slowdown. Sure enough, the query planner had made an important change:

Nested Loop (cost=0.28..3786.34 rows=32 width=4) (actual time=7.328..7.328 rows=0 loops=1)
-> Index Scan using document_metadatas_unique_patient_to_document_id_uniq on document_metadatas (cost=0.16..3646.99 rows=60 width=8) (actual time=0.031..1.445 rows=2153 loops=1)
Index Cond: (((patient_id)::text = ‘{Flatiron patient identifier}’::text)
Filter: (updated_at >= ‘2017–10–16 00:00:00’::timestamp without time zone)
Rows Removed by Filter: 51
-> Index Scan using document_contents_pkey on document_contents (cost=0.11..2.32 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=2153)
Index Cond: (id = document_metadatas.content_id)
Filter: (requires_processing AND processing_version_id IS NULL)
Rows Removed by Filter: 1
Planning time: 0.247 ms
Execution time: 7.354 ms

^ The query plan we had been using.

Nested Loop (cost=0.23..1822644.45 rows=32 width=4) (actual time=2600.674..2600.674 rows=0 loops=1)
-> Index Scan using document_metadatas_updated_at_idx on document_metadatas (cost=0.11..1822505.10 rows=60 width=8) (actual time=112.500..2594.888 rows=2153 loops=1)
Index Cond: (updated_at >= ‘2017–10–16 00:00:00’::timestamp without time zone)
Filter: ((patient_id)::text = ‘{Flatiron patient identifier}’::text))
Rows Removed by Filter: 6764524
-> Index Scan using document_contents_pkey on document_contents (cost=0.11..2.32 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=2153)
Index Cond: (id = document_metadatas.content_id)
Filter: (requires_processing AND processing_version_id IS NULL)
Rows Removed by Filter: 1
Planning time: 0.439 ms
Execution time: 2600.708 ms

^ The query plan we suddenly started using, resulting in degraded performance.

These query plans indicate that the query planner switched which index was used for the index scan. In the past we had been scanning by patient, filtering first to all documents for the specified patients and then to recently updated documents. We had started scanning by document update timestamp instead, filtering first to all recently updated documents across all patients in our network and then to only the patients we were interested in.

This change in indexes is clearly the wrong behavior. Even though only a small fraction of all patients in the Flatiron network get new documents within any given week, the sum of all new documents in the past week for all of our patients vastly outnumbers the number of documents any single patient will have throughout the entire course of their treatment. We run this query across only a small set of patients at once. Beginning by filtering to all documents added or updated in the past week results in a great deal more work for the database than first filtering to all documents for the given patients.

Likely, a recent automatic ANALYZE tweaked some statistics to make this seem like a viable approach, and a new ANALYZE on the table could have caused the original index to be used. We were interested in preventing this from happening ever again without requiring manual intervention, as it’s clearly the wrong behavior. We also didn’t want to add a new index on both patient and update timestamp for just this one query. To that end, we capitalized on the fact that Postgres will not use indexes when you perform a function on the indexed columns (unless you use an index on an expression to explicitly enable this).

SELECT
document_metadatas.id dm
JOIN
document_contents dc ON dc.id = dm.content_id
WHERE
dm.patient_id IN ({List of Flatiron patient identifiers…}) AND
dm.updated_at + INTERVAL ‘0 SECONDS’ > now() — INTERVAL ‘7 DAYS’ AND
dc.requires_processing AND
dc.processing_version_id IS NULL;

^ A slightly simplified version of the query we moved to using.

Since we are now transforming the update timestamp in our query — by adding 0 seconds, so effectively calling an identity function — Postgres will no longer view the index on this timestamp as viable. This will prevent this type of performance degradation from cropping up unexpectedly in the future and ensure that our unstructured documents remain useful. This pattern can be applied to any query where the query planner sometimes chooses the wrong index to use: applying an identity function to a column (appending the empty string to a string, adding 0 to a number, etc.) will prevent the query planner from using a standard index on that column.