Google BigQuery is an important partner for your enterprise data warehouse workloads — serverless, highly scalable, and cost-effective. Understanding BigQuery performance in terms of slot usage is critical for smarter business decisions. You may have already seen a guide on how to perform slot usage analysis through system tables, Stackdriver, and audit logs.
In this post, you will learn how to combine insights from system tables (made available by INFORMATION_SCHEMA views) with the power of Python so you can deep dive into job execution and take control of performance. Specifically, I will walk you through an investigation of slot resource contention, a common cause of slower BigQuery jobs.
Step 1: Gathering basic slowness information
It’s Monday morning and I notice that a recurring query job appears to be taking more time than usual. As one of the BigQuery administrators in my organization, I am able to locate and note the id of the problematic job and the administration project id. First, I want to gather basic slowness information. I will query, save, and search the JOBS_BY_ORGANIZATION system table as a Pandas dataframe.
Note: authentication and import statements are omitted in the following code snippets for brevity and are included in the supplemental syzz-1 Colab notebook.
In the resulting table, I can view the most important metrics for the problematic job. In particular, I will take note of the project id, reservation id, job duration, and slot utilization.
Step 2: Verifying whether the job is performing out of normal bounds
With the job’s summary information on hand, I want to verify whether the job is executing slower than usual. Since I am investigating a recurring job, I want to identify past jobs by matching query text. I will use the JOBS_BY_PROJECT system table with a new BigQuery client in order to retrieve data from the problematic job’s project, which is different from the administration project I used for the organizational view. Then, I can visualize job durations with a histogram.
Figure 2 (above) shows me the distribution of job durations for jobs similar to the problematic job, not including the problematic job itself. The Y-axis is count of jobs and the X-axis is job duration in seconds. For example, I see that only one similar job had a runtime of 3200 seconds. Keep in mind that we defined similar jobs as those that have similar queries to the problematic job’s query.
Step 3: Checking for slot resource contention
I recall that the job_duration_seconds from the problematic job’s summary above was 3515 seconds. Comparing this value to the histogram, it is clear to me that the job in question is running slower than expected, so I will loop back and dig deeper into the JOBS_BY_ORGANIZATION system table. I can manipulate the data frame directly and create a bar chart of concurrent queries during the problematic job’s execution time. In this figure, the Y-axis is count of jobs and the X-axis is the minute value from the timestamp.
Looking at the plot, I see that there are many concurrent queries running from a different project over the same time period as the problematic job’s execution time. I also know from the console that the projects are assigned to the same reservation. My conclusion is that jobs are fighting for slots and causing slower run times.
I have several options to mitigate this contention: decrease the number of concurrent queries, increase slot capacities, or load balance queries among different reservations. After implementing one of these workarounds, I will monitor job performance next week to ensure that the incident is resolved.
Conclusion & Next Steps
The system tables, exposed through INFORMATION_SCHEMA, contain valuable information about BigQuery performance. In the case above, they helped identify slot resource contention with other projects as the root cause of job slowdowns. Other common root causes include slot contention within the project, slot contention on idle slots, SQL anti-patterns, and waiting for quota.
You have just taken your first steps into debugging BigQuery performance with system tables and Python. All of the code used in this post is documented in my public syzz-1 Colab notebook. Copy, tailor, and extend the functionality for your environment at will.
For more information about similar projects, please explore the following: