I was asked to speak at the Postgresql User Group in Paris recently, and I chose to talk about materialized view (aka MatView), as they saved our production a few months ago.
I’ll present the materialized view usage with a problem we had at JobTeaser.
Our use case
My team and I are responsible for (out of many) providing an API to display stats on our back office reporting tools about job offer performance :
- number of views,
- number of unique views
- number of candidacies
These reports are used all day long by companies posting job offers.
Our workflow looks pretty much like this :
We had two problems:
1. we receive near real-time data, but the database is refreshed on a daily basis, this can be deceptive for our users.
2. we also faced performance issues on these API endpoints (>1sec response time). Although this response time is normal considering the table structure below, we do not want to make our users unhappy:
With this table (~30millions rows), retrieving the unique view count, means computing a distinct on student_id per job_offer_id. This table is also very generic, as it wants to fit all our needs about job offers’ performance: for example, shool_id and created_at are useless in the current use case, but useful for another one.
Our first (naive) idea was to build an agregation table, storing performance data for each job offer (target :<1 million rows):
We thought about refreshing this table by deleting rows that have been updated, and inserting new versions each 2 hours. Bad idea.
The aggregate table is used by the API, which means that there are read access locks on this table. If lock, then wait for insertion… and in this case, wait for a while. Plus this lock accumulation generates some performance issues.
Being stuck there, we dig in the whole internet and in the PG documentation, and found those materialized view
What’s a materialized view ?
Creation of Materialized View is an extension, available since Postgresql 9.3.
MatViews are widely available in other RDBMS such as Oracle, or SQL Server since longtime.
A MatView is in between a view and a table. Basically it’s built with a query refering to one or more tables, and the results are stored physically, making it acting like a cache.
This is the main difference with a simple view, which queries it’s source each time you call it. This means that you may wait for a while before getting your result.
A MatView can be used like a regular table, for example, you can add indexes or primary key on it, it supports
ANALYZEcommands (useful when you refresh it often), you can even create it without data, just with the definition query.
Bonus point: you can mess up your source table, your end-user won’t notice it before the refresh.
Our MatView-oriented solution:
Creating the materialized view
A materialized view creation looks like the creation of a view or a
CREATE TABLE AS instruction:
Piece of cake.
Refreshing the materialized view
Now, if the source table
views is updated and you want your materialized view to take those updates in account, you’ll must refresh it manually:
REFRESH MATERIALIZE VIEW job_offer_views_mv;
Well, that’s really sad ! Other RDBMS can do this automatically, and with simple view, the results would have been up to date (but if your query is heavy, you don’t want a simple view).
So it’s either you accept some decrepencies in your data or, you can use a trigger to refresh your MatView when the source is updated (not detailed here, maybe in a future post).
We were happy, this works well in staging, but …
The trick: refresh materialized view CONCURRENTLY
There’s not a lot of users in our staging environment, because of that, we missed two things:
- users generates read locks, so refreshing the MatView can take a while,
- refreshing the MatView locks any new read from users, making our API pretty slow
Actually, the ‘basic’ refresh is useful, and fast if the table is not used often (like several times a day). Otherwise, use refresh concurrently.
But beware! To use the refresh concurrently, you must define at least one unique index on your materialized view.
This is obvious regarding the way the refresh concurrently works. It refreshes the rows without locking concurrent select, so it needs to identify ‘free’ rows. It may take longer than the simple refresh, but it won’t bother your users.
That’s it ! Production is now safe, refresh every 2hours and these endpoints are pretty fast (< 100 ms).
MatViews are great. Use them if you :
- want to cache results of a heavy query
- don’t have ‘real’ real time (<200 ms) constraints
Don’t forget to use the concurrently option during your refresh.
And you dear reader, have you worked with MatView ? Any experience or best practice to share?