ClickHouse was released in June 2016, we had already gone for BigQuery at the time, so we did not consider it.
I don’t know much about ClickHouse and could not find any performance benchmark vs BigQuery. Do you think it’s comparable? Why do you think it would require less pre-computing?
This page explains the different was of loading data into BigQuery.
If I’m not mistaken, you have 3 “free” options that use load jobs:
The Analytics Service (see the schema at the end of the article) is a rather central piece that holds some business logic (that would be too complex to model as SQL) and aggregates other data sources too (not only BigQuery) to enrich data with information that are not present in the logs. It let us generate excel…
The 10 billion tracking events a day the article focuses on are handled by a single n1-highcpu-16 Dataflow instance.
Those events are all stored in a table (4 months retention), and contains 1 trillion events, for a size close to 250TB. That’s about 2TB a day.
Sure, but we deliberately chose to keep everything on BigQuery, it has some benefits too.
Unlike traditional ETL processes where data is Transformed before it is Loaded, we choose to store it first (ELT), in a raw format.
Using managed infrastructure such as BigQuery come with some constraints but save us a lot of time and effort too. Managing a Druid cluster at this scale would probably come close to a full time job.
I clearly think BigQuery is superior to Redshift for our use-case (but again, my experience with Redshift involves 1000x less data). BigQuery is the only reason that made us move a part of our infrastructure on GCP. We don’t really use the most powerful features of Dataflow, and you’re right, all the rest is extra cost.
Druid was our second choice. The main reasons that made us choose BigQuery were: