Why you should use Postgres over MySQL for analytics purpose
(this post originally published on the Holistics Blog)
Being an analytics/reporting product company, we’ve worked with a fair amount of customers who run Postgres/Redshift for their analytics, and the same fair amount who uses MySQL.
And from our experience with both, I’d prefer our new customers to use Postgres all the time. For reporting/analytics/data-warehouse purpose, Postgres wins over MySQL hands-down.
In this post I lay down a few practical reasons why working with Postgres is so much better than MySQL from a data analyst perspective.
1- MySQL doesn’t support CTE.
Unlike production application where your queries should be as simple and straight-forward as possible, queries for analytics are usually quite complex. And very often you need to build intermediate resultset to be used multiple times across your queries.
The query below in Postgres will get the employed people together with their supervisor.
with employees as (
first_name || ' ' || last_name as full_name,
where person_type = 'employed'
E2.id as supervisor_id
E2.full_name as supervisor_full_name
from employees E1
left join employees E2 ON E1.supervisor_id = E2.id
With MySQL, the query will need to be more verbose and difficult to maintain, because you have to repeat employees into subquery. See the difference.
2- MySQL doesn’t support window functions
I can’t stress enough how useful this is when it comes to doing analytics. Window functions bring Postgres’ capabilities to a whole new level, allowing a lot of complex, sophisticated requests to be done. There have been countless times that I need to use window functions to do a certain report.
3- MySQL doesn’t support schema (namespacing)
Your analytics database don’t usually contain just data from one source. You want to pull multiple data sources into one place. Things like click-stream events (in Hadoop), 3rd-party data sources like Adwords or Google Analytics, multiple production databases (when your apps are doing service-oriented architecture).
When these happen, namespacing your data from different data sources become a must-do. And MySQL doesn’t have an elegant way to do it! I’ve seen people doing it the prefix way
It’s like having one folder with over 100 files, and no subfolders. It’s difficult to manage, it’s overwhelming, and there’s no native way to work with all tables from the same data source.
Postgres saves you all these pains with schemas, so that you can put data in different data sources into one dedicated schema.
Update: Edwin pointed out in the comment that my MySQL I can do cross-database joins, thus getting the same functionality of Postgres’ schemas. While this is posible, it doesn’t feel right to be to maintain different databases just for the sake of categorizing them.
4- Postgres has better interface when dealing with datetimes
Dealing with dates/datetimes is a major topic when doing analysis. With my experience so far, Postgres has a far more consistent interface over MySQL.
Take a simple example of converting a timestamp field to day, week (first day of week), and month (first date of month).
- DATE_TRUNC(ts, ‘DAY’)
- DATE_TRUNC(ts, ‘WEEK’)
- DATE_TRUNC(ts, ‘MONTH’)
- DATE_ADD(ts, INTERVAL (1-DAYOFWEEK(ts) DAY))
5- Other smaller things:
I try to make this post sort, so below are some smaller things in Postgres is better over MySQL, that I find immensely useful.
MySQL doesn’t support full outer join: I know this might not be a big deal, but it is a big deal when you need it and it isn’t there. Learn more why here.
Wrong group by doesn’t throw error in MySQL: This query will give an error in Postgres, but not in MySQL (it’ll give strange results). I’d rather get an error thrown at me, than getting a wrong result.
# error in Postgres, and give wrong results in MySQL!
SELECT listing_type, count(1) FROM listings
Postgres supports VALUES for manual values list: In Postgres to generate a manual table, I can use VALUES, with MySQL, I have to use UNION ALL
('2015-01-01', 100, 200),
('2015-01-02', 200, 400),
('2015-01-03', 300, 600)
select '2015-01-01', 100, 200 union all
select '2015-01-02', 200, 400 union all
select 2015-01-03', 300, 600
Postgres has generate_series This has helped me countless time from generating number series, sample data, and date ranges!
Postgres has materialized views, to help you with pre-calculating data (complex queries) so that querying them will be quicker later on.
MySQL is indeed the most popular database, and is a fine and battle-tested one. But over the past few years, Postgres has taken over and become my favourite database, both for building production applications, and doing data analytics. I do hope MySQL will catch up with the missing features, so that it’ll make some (if not a lot) of our lives easier.
The above list is nowhere near exhaustive, so feel free to share in the comments if you have anything to add.
— — — — — — — — — — — —
Having problems finding a simple and affordable data reporting system for your startups? Check us out at holistics.io.