BigQuery: Tell me your region, I will tell you your speed!
Google Cloud, as any cloud provider, needs to expand and open new regions (view the history on Wikipedia). Today 28 regions are up and running, any 6 are coming.
New datacenter means new hardware installation and configuration. Then, the hardware continues to live for a while up to its replacement (obsolete).
And here come my wondering:
The new regions are deployed with up-to-date hardware, the older regions have older hardware.
So, are the performance equals in all regions?
And especially when you don’t choose the underlying hardware, like with serverless products. Let’s have a test with BigQuery!
Separation of storage and processing
BigQuery design separates the storage from the processing. The processing unit is called “slot” which is a slice of CPU and memory.
The target here is to test the slot performance and not the storage throuput.
For that, I chose to query less than 10 Mb (1 million of rows with only numbers), and to perform compute intensive mathematical operations (cos, sin, tan, log, square, exp,…) on each number of each row.
This test will deploy the same data, in different dataset, one per region, and to run the same query. At the end, get query duration for each region and log it.
You can find and reproduce the bench yourself with the code in my GitHub repository
I ran 3 times the query on the same data and I got the result that you can see bellow.
I added the average processing time in the latest column, and I sorted the result from the quickest to the slowest region. (The durations are in ms)
As you can see, some regions are 80% slower than the fastest ones!
Performance and region age
My assumption was that the older regions have older hardware (several CPU generations behind the most recent ones). Let’s add the region opening information.
This table mentioned the opening date, when known, and the age from now (Q3 2021) in quarter. Sorted by the age
Legacy the region that I didn’t find a release date.
Clearly, there is a correlation between the datacenter age and the query performance.
The older, the slower!
There is 2 exceptions to this “rule”:
- Europe west2 (London), opened in Q2 2017, and with the top result of the performance test.
This good performance can be explained because BigQuery has been added in this region in Q4 2018
- Europe-west4 (Netherland), not so old (3 years and half) and with the top 3 poor performances.
I haven’t found any possible reason. Please suggest!
Additional insight for wise choice
I sorted the information according the cost per Gb processed (on-demand analysis) and per geographical area
There is no situation where the performances are slow for a high cost. The performance are, more or less, correlated to the cost.
However, with this new table, some valuable choice can be made:
- In the US, us-west4 (Las Vegas) is clearly the fastest and the cheapest.
If the greenest interests you, northamerica-northeast1 (Montreal) is for you even if it’s 5% more expensive.
- In Asia, there are no green regions.
For the North, all the northeast regions offer good performance/cost ratio.
For the South, asia-southeast2 (Jakarta) is very interesting.
- In Europe, the europe-north1 (Finland) is clearly the best balance between cost, performance and CO2 impact.
Make the right choice
Making the right choice is always difficult. Many different parameters have to be taken into account:
- Availability (and the multi-region requirement)
- Data residency
- Service latency (closest to the user) and inter-region traffic to move data to analyse in the fastest region
- Cost (inter region traffic, and region over-cost for better performances)
- Low CO2 emission (Green IT)
- Performance and time-sensitive queries.
However, time is money and, in some situations, if you have queries that run 1 hour in a “slow” region, it could be interesting to save up to 20 minutes by simply switching the region!
The right choice depends on you and your context, but it’s an additional insight that you can use.
Limit of the analysis
I can’t say that this analysis is absolutely correct. I performed it with my personal account and, maybe, the corporate accounts have different resource provisioning.
In addition, I only performed an “on-demand analysis” test. I haven’t reserved slots, or use BigQuery ML to perform a similar analysis.
Anyway, this approach is interesting if your queries are performance sensitive