Managing a BigQuery data warehouse at scale
In previous articles, we described how our teams are ingesting data and how it is funneled to our end users. In this article, we’ll give more details on how we manage our BigQuery data warehouse and especially how we monitor and manage three important topics:
- Slow queries
- Slots usage
- Table and field sizes
Let’s start with a quick reminder about BigQuery. BigQuery is a highly scalable data warehouse that leverages a heavily distributed parallel architecture. Each parallel unit of processing is called a “
slot”. A slot is a virtual CPU used by BigQuery to execute SQL queries. Depending on query size and complexity BigQuery automatically defines how many slots are required by each query.
We will later refer to pricing models. With BigQuery we have a choice of using an on-demand pricing model or a flat-rate pricing model. When we use the flat-rate pricing model we have to explicitly choose how many slots to reserve. Past a certain workload, flat-rate pricing is much more efficient than on-demand analysis pricing. That’s why we use the flat-rate pricing at Teads.
If you’re familiar with regular databases, you’re likely already knowledgeable about slow queries. With BigQuery, we have to care both about actual query run time and query slot time consumed. As indicated earlier, we are on the flat-rate pricing at Teads and we have to monitor our slots to avoid starving concurrent jobs.
The audit logs are the best input to monitor BigQuery. You can have a look at how to set up a sink into BigQuery if you haven’t activated them already. We can query the exported audit tables to obtain our slot usage:
Here is the SQL query:
The reported used slot time is often decorrelated to the actual query time. It’s especially true when you request huge datasets where
usedSlotsTimeInMinutes will be much higher compared to small CPU-intensive queries for instance.
From a single-user perspective, you’ll likely look to optimize query time to reduce related job durations or API responsiveness. However, from the team’s perspective, reducing the slot time usage on queries is a higher priority. Once all your slots are consumed, BigQuery will stack all your requests as concurrent requests until your quota is filled. Any additional requests will fail until previous requests are processed so BigQuery can queue more.
Google provides a lot of tips to improve your queries in its documentation. Here is a quick selection of the most relevant from our own experience:
- Filter as much and as early as possible your inputs
- Be careful on your join conditions and try to make them as efficient as possible
- If you reuse a common expression table multiple times, you should likely extract it in another query, especially if it’s most of the workload
- Be careful when working with arrays of raw data as you’ll likely end up with skewed data and you should pre-aggregate it or limit it with an ordering when possible
BqVisualiser from Stephan Meyn is a great tool to explore your jobs. If you don’t want to log in you can use the
jobs.get API method to get the job details, store them in a file to be uploaded to the tool. Though this is not the purpose of the article, this tool will also help to pinpoint the most expensive steps for your queries. Then you can easily look at the average vs maximum time spent in a stage, for instance, if your data is skewed.
If you’ve been working with BigQuery at scale for a while, you likely moved to flat pricing. Therefore, slots become the main resource and we need to monitor and control their usage to ensure consistent performance for our production pipelines. At Teads, we rely on multiple GCP projects to prioritize slot usage throughout our workloads and our users.
Our reservation hierarchy looks like the following:
All reservations have a dedicated amount of slots that are used whenever required, but those pools can tap in the other available slots from other pools.
Most of our slots are allocated in priority towards our customer-facing services to provide a responsive experience. Then BigQuery-based ETL pipelines will have priority over the remaining slots. Finally, if there are still some slots left, they will be available for exploration or analysis.
We use Grafana to monitor our slot usage per reservation.
On this graph, we’re using Google Cloud’s operations suite (formerly Stackdriver) as a data source for Grafana and then look for the Max slot assigned metric and compare it to slots used by project, reservation and job type (per reservation). Monitoring is directly possible on the Operations suite, but using Grafana we are able to combine and display this information in more advanced ways.
As you can see on our previous graph, we do have temporary spikes and in these situations, we need to increase our overall reservation to match our needs. This is done through flex slots.
Flex slots are a great way to deal with temporary workload increases and really make sense when your average slot usage is much lower than your current usage. Of course, flex slots are more expensive and should be used wisely.
As you can see on the previous chart, the flex slots are 71% more expensive than the annual commitments. Practically, if you spend more than 14 hours a day with flex slots, it’s worth looking for an annual commitment.
On top of the Operations Suite, Google recently released BigQuery Admin Resource Charts which is a convenient way to monitor your slot usage. This tool is simpler but also more accessible if you try to evaluate how efficiently you’re using your slot reservations.
Slots are a costly line on our bills at Teads but storage isn’t coming cheap either with several petabytes stored. This leads to the following question: how to monitor your storage usage?
Looking for the biggest tables can be the first approach to pinpoint where your storage costs are the highest.
Table size is available directly as a query on dataset tables’ metadata but it requires listing all your datasets manually. Here is a convenient query to list your tables by size:
Tables’ metadata doesn’t provide the split between the short-term and long-term storage in your table so the cost will be an approximation. It’s a convenient way to identify outliers and discuss with the table owners to work on optimizing their size.
There are few ways to help with table sizes:
- Add or tighten partition/table expirations
- Prefix exploration results table with the owner name to hold their creators accountable
- Sample raw data whenever possible. Sometimes we can even sample data at the producer level on specific columns, for example, sample URL addresses used in analytics log data.
- Rollup your raw data
Practically at Teads, we have some data flows on which we sample the output to BigQuery at 1/16 and then set partition expiration at 30 days. We then join that data with other inputs to create specific rollups on which we set partition expiration at 1 year. 30 days of data is usually enough for troubleshooting but of course, in some cases, we keep raw data much longer.
Although we are able to retrieve some information in the table metadata about fields and their types, Google doesn’t provide metadata for table field sizes. Indeed, it would likely be closer to partition metadata but it’s not accessible directly.
To work around that limitation, I built a tool that uses dry run queries to export field sizes. Dry run queries provide the estimated billed bytes size which is the amount of data contained in the column.
Using that tool, you can output the data on BigQuery and the resulting table looks like the following schema:
Thanks to that schema, we can analyze the size changes over multiple table partitions:
The resulting output will allow you to obtain the following chart through a data pivot on the field value:
In this chart, using a log-based scale axis on GBs, we can see that a field is way above the others, weighing up to nine times the second highest. Most fields are stable but thanks to that data we can see that the 3rd biggest field is increasing by close to 50% (69 GB to 101 GB) over a few days. It could be a business change, an environmental change, or even an issue/regression. In that example, it’s linked to a change in one of our algorithms that resulted in more outputs.
Finally, the last day appears to be lower because the daily data was partial when the snapshot happened. That’s one of the reasons why we recommend running this job only once you’ve finished filling your table for that day (using a proper DAG for instance).
Relative numbers might be more relevant if you’re trying to compare fields weights and visualize their distribution within your table:
Using the same approach on the query output we obtain the following chart:
Here we can see that 70% of the table’s weight is due to the biggest field, even managing a 15% reduction on that field would allow saving more than removing the second biggest one. Using this visualization we are able to quickly identify which data drives the cost of our tables.
With BigQuery, Google is taking care of abstracting all the infrastructure. The resources and their related pricing are relatively easy to work with. But all that convenience doesn’t come for free and costs can quickly rise if not monitored.
We hope that the monitoring tips and optimizations described in this article will help you to contain these costs. This is obviously a time-consuming effort and we have found alerts on slots usage and storage volume to be really useful.
In another article, we’ll give details on how we manage rights and access for BigQuery through Terraform so that developers can be mostly autonomous for production deployments.