BigQuery 101, how to tame the beast? — Part 3
Great! We have understood everything thanks to the first two parts (Part 1 and Part 2), we can now use it in production! Oh, but wait! Think BigQuery is automagic? With the information that I have shared with you, we can be under the impression that the tool will solve all of our problems. It is true and false at the same time. In the third part of this article, I would like to present to you the main problems that I encountered on missions when using BigQuery.
“BigQuery, it’s simple”
Many think this is a simple solution.
In reality, it is a solution that is really easy to access. As we saw in the demo, all you have to do is connect to the console to add data or request existing data. BigQuery respects the ANSI 2011 standard, so we don’t lose our bearings.
As far as data organisation is concerned, it is also very simple. A table or a view is attached to a dataset (logical concept) which itself is associated with a project, which makes it possible to link BigQuery to our billing account.
However, it’s important to know that BigQuery is built on. We have seen the different components make up the solution. These components will make it possible to process a large volume of data in a minimum of time. For this, BigQuery will build an execution tree to optimise our query and its processing.
Some concepts that are a little more advanced can also be a little more difficult to grasp. For example, Authorized Views allows you to share your data with a few users or groups of users, without giving them access to the tables under the view. It is a very powerful and practical tool, but it takes a little time to be fully mastered.
“It’s just another SQL database”
BigQuery is not just another database, it’s very simplistic. It is a two-headed storage and processing solution. We are not talking about resources (CPU, RAM, whatever) but about slots and only slots for processing. It is a serverless solution that is completely managed by Google.
Be careful though, there are, for example, quotas limiting the use of these slots. To avoid unpleasant surprises, it is important to monitor its use as soon as you adopt BigQuery. What’s the consumption of each query? Who is the biggest consumer? All these questions will give you a better control of its daily use.
Another important point, BigQuery is not relational in nature. It will often be interesting to modify your schemas to move from a normalized model (as we know it in relational databases), where we link tables by ids, to a denormalized model. How does it work?
With a standardised approach, each table will carry one type of information. In the example below, we have two tables, Person and Address, each carrying a type of information. We are used to making the link between these informations thanks to an intermediate table which makes it possible to find the information between the different tables. In our case, if we want to find my different addresses, I will do a SELECT from my ID “123” to retrieve IDs relating to my addresses and by a JOIN I will have all the information.
The problem with this approach is that the more data we have, the more resource-intensive it will be. In this context, we will then use denormalized models. These models will consume more storage but avoid unnecessary computational load. Considering the previous use case to find the addresses, I only have to search for line 123, all the informations is already there. I just have to select the columns that are of interest. And here it is! The explanation is a bit short. If you want more details, do not hesitate to ask and I will do an article on this subject.
“BigQuery has no indexes on tables”
Indeed, BigQuery is a fully-managed service. As such, traditional DBA operations are no longer the responsibility of our teams but are part of the product package. We leave this burden on the Google teams who take care, in a completely transparent way, of the maintenance and updating of BigQuery.
But, even with the goodwill of Google engineers, BigQuery can’t optimise everything! To take full advantage of the product, there are two very important concepts to take into account: partitioning and clustering.
⤴ but, consider for the performance
Partitioning is a way to segment our table at the storage level. To simplify, BigQuery will create one file per partition. There are three types of partitions available: by ingestion date, by providing a date/time type column or by providing an integer type column. Why use partitioning? To avoid doing a full scan of our data. If I want to access the table of my customers’ purchases, it would be harmful to access all the data if I only want to analyse the previous week’s data. With partitioning, we will be able to select only the dates of interest. We will therefore only carry out our processing on the data that sis useful to us.
Clustering is a way to optimise storage. It is possible to specify up to four clustering columns. Once activated and thanks to column storage, BigQuery will be able to collocate identical data. This will be very interesting in certain use cases. For example, I want the sales of stores in Lille for the previous week. If I use partitioning, I will retrieve all sales from all stores: Lille, Bordeaux, Paris, Lyon, etc. With clustering activated on the “City” column, I will be able to quickly find the rows relating to Lille. It is during processing that clustering will be used, so it will be a real lever for accelerating your queries.
Last aspect to consider in terms of performance: the execution plan. BigQuery will split your query into a series of steps to perform, called stages, in order to build an execution tree. The execution plan is the best way to understand how BigQuery will execute your query. For each step, you will be able to analyse the time spent. The figure below is an example of what will be visible in the console.
“BigQuery is for Big Data!”
In my opinion, we must start by defining what Big Data is! How huge is it? But that will surely be an excuse for another article.
More seriously, yes, BigQuery is designed to be integrated into the Big Data ecosystem. Remember what we saw earlier: “Serverless, highly scalable, and cost-effective multicloud data warehouse designed for business agility”.
👎 but, size doesn’t matter
Now, we have to put things into perspective. You can also use small tables, with a few lines. The experience will not be degraded, like when you buy a 500HP car to go to the bakery on Sunday! (Be careful all the same with this kind of car which might get dirty if you really only use it for that 🙏 )
Another thing well done with BigQuery is that you can also connect to other storage systems and only use the Dremel query engine. For example, we can make requests to: Google Sheets, Google Cloud Storage, RDBMS, etc.
“BigQuery, it’s expensive”
This is probably what I hear most often. Admittedly, the costs can seem enormous when you start. But, it is possible to work on them.
BigQuery pricing is separated into two categories: Storage and Processing.
Storage also comes in two modes:
- Active mode is the default mode. The pricing is equivalent to that of Google Cloud Storage, one pays for MB stored per second. It is quite simple to quantify what this will represent in an organisation.
- Automatically, once your data is over 90 days old (in table or partition), the price will be reduced by 50%. This reduction can be attractive in the long term.
There are also two modes for processing:
- The default mode is on-demand. It will allow you to benefit from a maximum of 2000 shared slots. This is ideal for small organisations or the one just getting started. This is where slot monitoring comes into its own. Depending on the metrics reported by your dashboard, you may or may not switch to the other system.
- The Flat-Rate will allow you to commit to Google for the reservation of slots. It works in increments of 500 for $10,000. In this case, the slots are dedicated to you. When you need to have consistent and controlled invoices, this is the ideal mode.
To avoid disappointments with pricing, there are some tips I would like to give you:
- We never, ever, do SELECT *. Imagine your query is for a 100TB table but you only need two columns. The associated cost will not match the useful data for this use case.
- When you run a query with BigQuery, the result is stored in a temporary table. This can serve as a cache. So, if you run the exact same query and the data used hasn’t changed, BigQuery will reuse the previous result to answer your request. The advantage is that it’s fast (0 second of processing) and it’s free!
- Pay attention to the LIMIT. It only limits the data returned at the end to the execution. It won’t change the processing.
- Ask yourself the right questions when you write your request. What do WHERE clauses do? In our case, with BigQuery, it will be interesting to use these clauses as soon as possible and as often as possible to avoid data shuffling between slots. Similarly, a JOIN with the widest table on the left is often preferred. Finally, the ORDER clause must be at the very end of the query to order only the output data.
- As we have seen, partitioning and clustering are also good tools to use to limit the volumes of processed data.
“There’s no limit with BigQuery!”
👍 virtually unlimited 👎 but
The main limit is the color of your bank card. Everything is based on the slots and the number that Google assigns you when executing your queries.
The second limit is related to the global organisation of GCP. For each product, you have default quotas and limits. Often it is adjustable but it ensures an even experience between service users in the same area.
For BigQuery, for example, there are limits on:
- DML (Data Manipulation Language) where you can only apply two simultaneous mutation operations.
- Streaming ingestion with a capacity of 100,000 rows ingested per second per tables
- Partitioned tables which can only contain a maximum of 4,000 partitions.
In short, all this is not really disabling regarding our tasks, but it is important to know it exists.
You are now ready to face the harsh reality! The most curious of you will surely wonder what follow-up is to be given to this article.
First of all, this overview only covers a small part of BigQuery. I deliberately omitted to mention:
- Data Transfer Service, which will allow you to repatriate data automatically (or not) from the various Google services (Youtube, Ads, Cloud Storage) but also from other providers (AWS S3, Amazon Redshift).
- Bi Engine, which offers a pseudo cache above BigQuery to be able to speed up certain processing.
Finally, from my point of view, it will be interesting to look at the parallel functions of BigQuery, such as:
- BigQuery ML, which, as its name suggests, offers the ability to train and use ML models from our datasets, all expressed in SQL.
- BigQuery Omni will let you look to the future and think multicloud. Through BigQuery, you will be able to perform queries on data present on AWS (S3) or Azure (Azure Blob Storage). Omni is always serverless, cost-effective and efficient!
Another solution allows you to execute SQL in an optimised way, it’s Dataflow SQL. It uses a syntax close to that of BigQuery to interact with data present on BigQuery, Cloud Storage but also on Pub/Sub, the GCP messaging solution.
These points are only tips to be followed along your own path. I also advise you to browse the official documentation which is full of tips but also to follow a Google Cloud training to go into detail and discuss with a trainer (me for example :p)
✌️ Feel free to ask questions or suggest other topics