BigQuery 101, how to tame the beast?
Google BigQuery often seems to be Google’s magic tool, allowing you to browse huge amounts of data, via SQL, without worrying about the infrastructure. Small flat, magic does not exist! A certain number of levers allow you to avoid some performance and cost disappointments!
This series of articles will present the architecture and the best practices to put in place to be efficient on a daily basis with BigQuery, by explaining how a column-oriented database works or how the parallel processing of your queries is managed.
BigQuery, what’s this?
You may have arrived here luckily and wondering what BigQuery is?
Google defines BigQuery as:
“A serverless, highly scalable, and cost-effective multicloud data warehouse designed for business agility”
BigQuery is a product launched in 2012 by Google to address processing issues in the Big Data ecosystem. BigQuery is a serverless solution that has the ability to be auto scalable. It is a solution that makes it easy to address multi-cloud use cases, all at a relatively controlled cost.
Once we said that, we did not say much! From my point of view, what is important here is:
- the fact that the solution is serverless. That doesn’t mean it’s magic (although some layer of magic can appear sometimes and we’ll get to that later) but that we don’t have to worry about the ressources behind this service.
- Then, it’s important to understand that this is not a database… BigQuery is a Data Warehousing solution. Kesako? Data Warehousing makes it possible to respond to problems such as Analytics or Machine Learning required a large amount of data. Unlike its Relational cousin, we will favor reading over writing.
- Finally, it is a solution designed for the future with the multi-cloud! Today, BigQuery is a tool for working together with other Public Cloud (hello AWS and Azure).
BigQuery and its ecosystem
Simply put, BigQuery is a central solution in data architectures with Google Cloud. To feed BigQuery, we will be able to use different types of sources. These can be of batch type allowing you to load data from files present on Cloud Storage, or of the stream type to consume data in real time with Kafka or Pub/Sub.
It will sometimes be necessary to add a “transform” step between these sources and BigQuery, such as Apache Spark or Apache Beam.
Once the data is persisted in BigQuery, we will be able to interact with it with our dataviz solutions (Data Studio, Looker, Tableau, etc.) but also to train our Machine Learning models from our historical data.
BigQuery will allow you to completely free yourself from software and hardware issues to focus on the data and its use. Thus, building a solution allowing you to have a 360 view of your customers requires almost no skills, just a few SQL queries. Are you a Data Scientist and need data to train your models? Nothing could be easier, explore BigQuery and make your queries directly in your notebooks.
BigQuery in action
I invite you to go to the Github page below to find and perhaps run the demonstration that I have prepared for you.
GitHub - allienna/bigquery-101: Google BigQuery demonstration used in talks and stories
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…
If you prefer to watch the demo on video, you can see this demo live (but in French 🇫🇷). Don’t spoil yourself, pause the video after the demo and continue reading.
As you may have seen in the demo, we just played with a 5 TB dataset in SQL, with a more than acceptable response time. If you’re curious, let’s meet again next week to see how BigQuery works.