Benchmarking Google BigQuery at Scale
By Robert Barton
King is currently attending the #GoogleNext18 conference at London Excel and below is an article from one of our Senior Big Data Platform Engineers about the migration from our Data Warehouse to BigQuery on Google Cloud.
At King, data helps us understand the full picture of our games, from prototype to fully live product, and how our hundreds of millions of monthly active users are engaging with our games. On any given day, more than a hundred data scientists run thousands of queries, working across many areas of the business to bring valuable insights that help drive rapid decisions.
It is the Core Data Services team that provides this data to our stakeholders across the business. Historically, we have served data using a petabyte-scale on-premise Data Warehouse build on top of the Hadoop ecosystem. This came with a number of challenges ranging from operational overheads, the scalability of applications and services running on the platform, all the way through to there being multiple SQL frameworks for users to access data based on their particular use case, rather than a convenient “Single Pane of Glass” solution.
These challenges, combined with the opportunity to utilise other Google Cloud based services such as ML Engine prompted us to carry out an extensive investigation into whether or not BigQuery could act as a suitable replacement for our Data Warehouse. In this article, I will give an overview of our historical on-premise systems, our experience in benchmarking them against BigQuery and the results that led us to commit to migrating our Data Warehouse to BigQuery on Google Cloud; a project which, at the time of writing, is ongoing.
On-Prem Data Warehouse Architecture
We run a classic Hadoop data warehouse architecture, using mainly Hive and Impala for running SQL queries. This Hadoop cluster runs in our own data centre and consists of approximately 400 nodes each holding 150GB RAM and 20 CPU cores. This is complemented with an MPP database, which holds a recent subset of our data, providing users with extremely fast ad-hoc and ETL query times. This cluster also runs in our own data centre and it consists of 64 nodes in total, each with 757 GB RAM and 48 CPU cores.
In this batch architecture, logs are ingested from a log server into HDFS, in their raw TSV format. From there, logs are processed into Hive tables, which are sorted by event type. We then have a daily ETL which runs overnight to process these events and organise them further into Hive and our MPP database to form our core data model. This model is then queried by Analysts and Data Scientists and is also the basis for our daily KPI emails, which enable our senior management team to make data-driven decisions.
As an ad-hoc SQL engine, we run Impala on our Hadoop cluster, allowing faster analysis for our users when they are exploring the data that resides on the cluster. We reserve a small portion of the cluster’s resource to Impala and set a cap of ten concurrent queries to ensure a consistent user experience. No ETL processes run on Impala; those processes are solely reserved for Hive due to its resilience and concurrency, and our MPP database due to its speed and reasonable concurrency.
During the day Hive executes anywhere between 10,000 and 15,000 queries, while our MPP database will execute between 15,000 and 25,000, all of which are a mix of scheduled and ad-hoc jobs.
Choosing Queries and Datasets to Benchmark
The first thing we needed to decide when planning for the benchmark tests was what queries and datasets we should test with. It was essential that these accurately reflect the data that is accessed often and how our users currently work with that data.
To do this we started by looking at the queries that run on each of our on-premise systems. As a starting point, we extracted a list of queries and datasets from each system based on a set of criteria: most frequently executed, longest running, fastest, most syntactically complex, largest number of joins, and widest date ranges (which for us means the largest number of partitions in Hive and Impala). This gave us a good batch of queries that represented a snapshot of how our systems are actually used.
We were able to extract the queries from our monitoring systems, where we track what SQL was run when and where just by running relatively simple SQL queries. From this point, we were then able to categorise the queries into twelve different groups that allowed us to keep our benchmarks simple, consistent and balanced. The categories were as follows:
Out of the 12 groups, we found that there were actually only ten, as there were very few or no queries that fell into the medium and large scan only with multiple joins categories. The size of the queries were categorised based on the number of rows that they scanned, which we calculated using a Python script, which took a row count based on the tables and partitions specified in the query The above categorisation was done for each system, which meant that after tweaking syntax to ensure the majority of these queries ran successfully across all three on-premise systems, we could benchmark these queries against each other too, not just against BigQuery.
Now that we had a big set of queries based on actual usage, we could analyse what tables they were using and select a reasonable amount of these to load into GCS and BigQuery. This was again done with another Python script that parsed all of the queries to extract the table names, fetch their size in HDFS and return a count for how many of our test queries use each table. With this data, we were quickly and easily able to select a good variety of tables (big and small, partitioned and non-partitioned, fact and dimension, etc.), whilst controlling the amount of data that we required to be loaded into BigQuery.
We ended up with approximately 80TB worth of data to play with, which is the compressed size achieved using the Parquet, Text and RC file formats and gzip compression (this uncompresses into BigQuery by approximately sevenfold in our use case). This 80TB represented 1 trillion rows across 50 tables. The test queries that referenced tables outside of this list were dropped, leaving us with an overall set of 50 test queries, which ran across all three on-premise systems that we could also run on BigQuery after some syntax tweaks.
Getting the Test Data into BigQuery
At the time when the benchmarks were conducted in 2017, Google BigQuery supported three main file types for loads — Avro, CSV and JSON — which meant that if your data was not currently in one of those formats it would need to be converted. (now of course, BigQuery load supports the Parquet file format), luckily, there were a couple of ways to do this: either write your data to newly formatted Hive tables or use Spark and the Databricks Avro package; both methods can either be done on your own cluster or in GCP by spinning up a Dataproc cluster.
We decided to write a simple Spark job to run on our own cluster that reads either a Parquet file or a Hive table into a Spark DataFrame and then writes that DataFrame into HDFS in Avro format with a deflate compression codec, honouring the partitioning directory structure if there was any. We ran this Spark job across all of our Benchmark data so we ended up with an Avro copy of it all that we could then copy over to GCS.
With the data now converted to Avro format, we needed to then copy this over to GCS, which we did quite simply using a bash script that called the Hadoop distcp framework and the GCS connector provided by Google.
To then load the data into BigQuery from GCS we needed to put together a tool. In this instance, we used a bash script that called the gsutil and bq command line clients to create the BigQuery table (partitioned or non-partitioned), load the Avro data into BigQuery and then mark the table or partition as copied. One observation from this process at the time was that the bq load utility can only load one partition at a time, which led to extremely slow ingestion for tables with many partitions. We ended up running approximately 20 partitions in parallel (one table at a time), which seemed to be the sweet spot performance wise due to how BigQuery balances this load traffic. On the other hand, we found that non-partitioned tables did load very swiftly with it handling around 8 TB in approx 30 minutes.
The above drawbacks are now not the case when using the (now not so new) DAY partitioning feature (rather than the historic pseudo _PARTITIONTIME method), which does support loads into multiple partitions.
Automated Benchmarks and the Results
Now that we had our desired test data and knew what queries we wanted to run, we could start running the benchmarks themselves. To do this, we used a Test Harness written in Java that would allow us to configure and run these test queries at scale and record the concurrency and query time of each execution. The test harness could connect to Hive, Impala, our MPP database and BigQuery and orchestrate the configured tests one system at a time. Once the test harness has finished running it produces a CSV file, the contents of which we were then able to visualise in Looker. The parallelism is configurable, so we were able to run a variety of different tests.
It’s worth mentioning at this point that for this set of benchmarks we had available to us a reserved pool of 2,000 slots in BigQuery with the potential to burst up to 4,000.
Initial Individual Query Validation Tests
The first set of tests that we ran were simple validation tests with low concurrency to see how the queries performed on a live system (our on-premise) or an idle system (BigQuery). Each system had all of their queries run against them once with a low concurrency. The idea was to verify that the queries would definitely work on BigQuery, give us the opportunity to correct any syntactical issues and to give us an initial view of general query speed comparisons against each system.
The biggest change we had to make syntactically was converting our partition columns from the column name into “_PARTITIONTIME AS column_name”, as well as a few minor function syntax changes between systems, which is pretty normal when migrating SQL. We were actually able to do the majority of the _PARTITIONTIME changes using a Python regex script, so this fairly tedious step took less time than expected. As mentioned above, this is now not a concern in BigQuery with the newer DAY Partitioning method, which does not rely on the pseudo _PARTITIONTIME column.
These initial tests were run over 10 iterations and results were presented as an average. The expectation was that Hive would be the slowest by some margin, then Impala and BigQuery would be close and our MPP database would be fastest. The results for the first test are as follows:
For the vast majority of the queries, one system far the slowest running with a few exceptions. These exceptions are mostly pure scans, with the others being related to the query not being fully optimised for the offending system. It is worth noting here that all queries have been taken as they were originally executed on our systems and have not been optimised for performance in any way for any system, so any large and unexpected peaks and troughs can most likely be tuned. For example, in one extreme case, a poor performing BigQuery query from the results was run independently with the join order re-arranged so the largest table was on the left and had the number of rows processed reduced and the run time went from 201 minutes to two minutes, whilst still producing the same result.
From the above graph, we can see that our hypothesis was correct in most instances.
Benchmarking with Controlled Concurrency
From the previous tests, we were able to confidently identify that the slowest performing system out of the four. Due to this and the fact that the majority of our production work runs on Hive 24/7, we deemed it safe to exclude this system from our Concurrency tests.
The next step was to test the systems with differing levels of concurrency. We ran incrementally increasing tests of 5, 10, 15, 30, 50 and — only on BigQuery — 100 concurrency to really put each system under pressure and record how it performs. The way these tests worked was that the test harness was configured to run the desired degree of parallelism and each query in the test set ran back to back over a given time period, which was 90 minutes.
Each execution of each query was recorded in the result set so we were able to extract an average execution time for each query and for each degree of parallelism. These sets of queries were randomly selected from the above set to give a fair and realistic representation of what our real usage may look like. The same queries were used for each system. Due to the scale of our Impala implementation, we were only able to ramp up the concurrency to 15 for this system as our cluster just doesn’t have the resource to go any higher. These concurrency tests were run on Impala and our MPP database during a company conference where less than three queries were being run on each system and we could control the parallelism. This wasn’t achievable with Hive.
What we observed was that at a concurrency of five, our MPP database performed on average five times faster than BigQuery and four times faster than Impala. The results for this test are below.
At the concurrency of ten tests, Impala and BigQuery are performing very similarly on average, with our MPP database performing approximately four times faster than both systems. At the concurrency of 15, Impala really drops off, more than doubling its execution time — again, this is due to the amount of resource available to it. We then continued to ramp up the degree of parallelism to 50 for both BigQuery and our MPP database and then once more to 100 for BigQuery, as our MPP database starts to encounter issues with this level of concurrency. The full concurrency results are below.
BigQuery shows the flattest concurrency degradation line of all the systems. For our on-premise systems, Impala and our MPP database, they perform well within their scope and provide impressive query times which can be impacted when concurrency tests increase.
*It should be noted that there have been a number of improvements such as clustering, roughly 5x performance improvement on 10T TPC-DS, and so on, so these tests will look different today.
BigQuery sits exactly where we expected it to results wise. The advantages of BigQuery over our current Data Warehouse are that it has all the desirable benefits of public cloud and the chance to unify our SQL analytics offering.
It is the consistent line of degradation for BigQuery on the graph that demonstrates the scale that BigQuery is able to operate at. As we ramp up usage and concurrency we would expect this line to hold its shape (depending on the size of your flat-rate pricing slot reservation) given the elasticity that comes from BigQuery being a managed service running in the cloud. It is these findings and experiences coupled with the desirable benefits of cloud and the chance to unify our SQL analytics offering that led us to make the decision to migrate King’s Data Warehouse to Google Cloud Platform.