BigQuery 101, how to tame the beast? — Part 2
In the first part of this series, we introduced Google BigQuery and demonstrated the power of the tool with some queries.
In this part, we’ll see how BigQuery is designed under the hood to understand how to avoid pitfalls when using the solution.
A third part will focus on best practices and advanced concepts.
It works! But how?
Google BigQuery is a three-layers solution. There is a storage solution with strong data replication, a processing solution that allows us to execute our requests. These two services will work together via a fiber network.
I suggest you see each of these components individually.
Colossus is Google’s global storage system. Whether you use Youtube, Gmail or BigQuery, at some point your data will be on Colossus. Colossus offer some storage services like:
- security on the lower layer storage
- the possibility of scaling globally (in Europe or America)
- a durability of 99.99999999999% (11 9 after the decimal point). This metric is very important because it promises that for 1 billion objects stored for 100 years, we should not lose any of them
- optimal performance, the solution only manages storage
- cross-datacenter replication
- finally, maintenances and updates are transparent
Colossus is therefore a high-performance low-cost storage solution (without the downside that is often associated with this term).
BigQuery will therefore rely on Colossus to store our tables. To take full advantage of Colossus, BigQuery uses a specific file format: Capacitor.
Capacitor allows us to store our information in a “columnar” format. Simply put, we can say that a file represents a column of our tables. This allows information compression and therefore a lower cost of storage.
Capacitor will also take care of operations on metadata, cache management, buffering and encryption.
Once our information is securely stored and compressed, it’s time to interact with it. For this, we will use Dremel. Dremel is BigQuery’s query engine. It will use all the metadata stored by Colossus to build dynamic execution plans based on available resources. It will then be able to optimise our SQL queries to take full advantage of our data, in the shortest possible time, for the lower possible cost.
What is important to remember is that BigQuery charges us a certain amount of resources, which are not hardware related (CPU, RAM and others are not present in the vocabulary of BigQuery).
If we summarise what we have just seen together, we have an information storage system (Colossus) and a query engine (Dremel). How will these two services interact with each other? Well, simply via Jupiter. Jupiter is Google’s network, which is surely the most different character of Google Cloud compared to its competitors. It is a proprietary inter-datacenter network. It’s theoretical traffic is 1Pbit/s in bisection, allowing 100,000 machines to interact with each other at 10Gbit/s. With Jupiter, it is no longer necessary to have the data next to the processing system (like HDFS), we are able to load terabytes (and more) of data in a snap of the fingers.
Now that we have all the components in mind, let’s recap:
- Colossus is the base of the storage
- Dremel takes care of the processing
- To optimise these two solutions, BigQuery will use the Capacitor file format
- This storage/compute separation is ensured by Jupiter, the Google network
As you can see, the architecture is quite simple to understand but relies on very advanced concepts and solutions. Fortunately, it is a service fully-managed by Google.
✌️ See you next week for the last article in this series.