Introduction to BigQuery for data analysts and data scientists

How to use BigQuery in data management, data exploration, and visualization

Carmen Adriana Martínez Barbosa, PhD.
plotwise
9 min readOct 21, 2021

--

Photo by Jan Antonin Kolar on Unsplash

BigQuery is one of the current cloud data warehouses in the market. This tool is unique in how low-maintenace it is. Additionally, the architecture of BigQuery — which is known as server-less infrastructure — allows users to focus more on the analysis of their data.

The data stored in BigQuery can be accessed through SQL queries which can be written either in the cloud console or via client libraries. There are client libraries for: C#, Go, Java, Node.js, PHP, Python, and Ruby. The ease of use BigQuery together with its excellent documentation makes it a great tool for data analysts and data scientists. Currently, more than 3000 companies use BigQuery, including The New York Times, HSBC bank, Spotify among others.

Here at Plotwise, we use BigQuery to store the large amounts of daily data generated by our planning engine. We take advantage of the integrated built-in features like data studio (to create compelling dashboards) and BigQuery Geo Viz (to do geospatial visualizations). This post aims to give you an introduction to BigQuery. We will explore each of the aforementioned features by showing a real business case that involves a sample of Open Street Map (OSM) data.

Before diving into BigQuery and its functionalities, let’s briefly talk about the Google Cloud Platform (GCP).

GCP is a set of physical computers that are located in different Google’s data centers around the globe. These computers or resources can be accessed through services. GCP services are features that combined, give you the infrastructure needed to create any application. Examples of GCP services are Cloud Storage, Compute Engine, and the subject of this post: BigQuery.

To start using GCP and all its services, you need to create or select an existing Google Cloud project. Think of a Google Cloud project as an entity where you can manage the settings, permissions, and all the GCP services used to build your application. This means that you can work with many GCP services within a project. You can find detailed information on how to create a project at this link.

Resource hierarchy in GCP. Credit: https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy

Enough decription of GCP. Let’s start exploring BigQuery!

Working with BigQuery’s cloud console

Once you have created a Google Cloud project, you can start using BigQuery’s cloud console by selecting BigQuery in the navigation menu of the GCP console. You will see the following image:

BigQuery’s cloud console. Image made by the author

BigQuery’s cloud console has four main sections: The navigation menu, the explorer panel, the details panel, and the SQL workspace.

· Navigation menu: This panel contains a list of BigQuery functionalities such as transferring data and managing scheduled queries.

· Explorer panel: It contains a list of your current projects and other pinned projects. In BigQuery a project can contain several datasets which in turn, contain several schemas (i.e. tables) and/or views. We’ll see this structure later.

· Details panel: When you select a dataset, table, or view in the explorer panel, the console opens a tab in the details panel which shows information about the specific object. In the details panel, you can create new tables or modify existing schemas.

· SQL workspace: You can make this tab visible by selecting SQL workspace in the navigation menu. Here you can run queries to obtain data that can be directly analyzed and visualized via Sheets or with Data Studio.

Data hierarchy

In BigQuery, the data is organized by datasets and tables. Datasets are containers within a project. Datasets organize and control the access to the tables and views. A table or view must belong to a dataset; therefore, you need to create a dataset before loading data into BigQuery. The data hierarchy of BigQuery is shown in the following figure:

Hierarchy in BigQuery’s data. Every table is defined by a schema. Image made by the author.

BigQuery’s cloud console eases the creation of both datasets and tables. The creation of a new dataset is made by selecting the option create dataset, located in the view actions of the BigQuery’s project (If you go to the explorer panel, the view actions are marked as the three vertical dots next to the project’s name):

Creation of a dataset via the BigQuery’s cloud console.

To create a dataset, you just need a name and defining its location (see the figure above).

The same procedure can be carried out to create a table within a dataset:

Creation of a table via BigQuery’s cloud console.

In this case, there are several options to upload or create tables in a BigQuery’s dataset:

· Empty table: In this option, you must add fields manually to create the table’s schema.

· Google cloud storage: In this case, you must specify the location of the file and its format. BigQuery provides an option to detect the table’s schema automatically.

· Upload: Select this option if your table is stored locally.

· Drive: Same as above but for tables saved in Google drive.

· Google Cloud Bigtable: This manual provides the information to upload tables to BigQuery.

The creation of datasets and tables in BigQuery can be also carried out via the client libraries. In this blog we do not cover this topic; however, excellent documentation can be found in the following links:

Functionalities of BigQuery

We can now explore some of the utilities that make BigQuery a special tool for data analysts and data scientists. For the demonstration, I use OSM data, which is a collaborative project to create an open, free, and editable geographic database of the world. OSM data contains all kinds of geospatial information such as amenities or points of interest (i.e. hospitals, universities, bars); Building footprints; Elevation data; Roadmaps and urban networks; Maximum speed and type of road networks. At Plotwise we work extensively with these last two features to calibrate the Routing Kit micro-service, an internal library that provides advanced route planning functionalities for our continuous planning platform.

The maximum speed and type of road networks come in the form of two tables: the ways table — which contains the roads divided by segments — and the nodes table — which holds information of the start and endpoints of a road segment — . The ways table has the values of the max speed and type of the road segments. The nodes table contains the location of the start and end nodes, as well as their degree (the degree of a node is the number of connections with neighboring nodes, see figure below).

Top: graph representation of a road network. Bottom: Ways table and Nodes table.

Both, the ways and the nodes tables were uploaded to BigQuery via the cloud console. A quick inspection of these tables in Data Studio reveals that the max speed is missing in ~91% of the Dutch road networks. This percentage is also high in other countries such as Belgium (92% ) and Luxembourg (92%).

Percentage of missing max speeds per country in the sample of OSM data. Image made in Google Data Studio.

To satisfy your curiosity, here’s a short video revealing how the plot of above was made. The preprocessing to obtain the number of missing max speeds per each country’s road networks is not shown.

Plotting missing max speed of OSM road networks with Google Data Studio.

Geographical visualizations with BQ geo viz

One of the features that make BigQuery special is how easy it is to plot geographical data. You can create beautiful interactive maps with BQ geo viz. You just need to run a SQL query of your data for BQ geo viz to show you a map that you can easily explore.

First, go to the tool’s website. To start using BQ Geo Viz, you need to authorize it to use your Gmail account. Once you have done that, you can visualize your BigQuery table. First, select your Google Cloud project and then write the SQL query in the box below. Note: your table needs a geographical column, otherwise BQ Geo Viz will raise an error. Fortunately, you can use BigQuery’s geography functions to create such a column. In the next figure, you can see the max speed for a sample of Dutch road networks plotted with BQ geo viz.

Max speed of a sample of dutch road networks. Red represents a max speed of 120 km/h. Image made by author.

In the second step or data, you can add some styles to your lines and in step 3 or Style, you can customize the visualization. Finally, you can share your plot with some colleagues.

If you want to deep dive into BQ geo viz and more GIS functionalities of Bigquery, you can review the following tutorials:

Use case: Prediction of the missing maximum speed

As mentioned before, the max speed is fundamental to calibrating Plotwise’s routing kit; however, in this blog, we saw how large the missingness of this quantity is in the majority of road networks. That is why at Plotwise, we developed an imputation model to predict the max speed. This model uses as features the other variables of the OSM data; i.e. type of road; start and end positions of the nodes; length of the road segment; nodes degree and country. The algorithm used is an XGBoost Regressor whose hyper-parameters are optimized by using Scikit-optimize. The Python code is fully integrated with a Postgres database, where the OSM data resides, and with BigQuery, where the error metrics of the model are stored. To avoid model drift as a result of regular updates made to the OSM data, we implemented a mechanism to automatically retrain the model. This occurs when the average RMSE of the validation set is larger than the Average RMSE + std RMSE of the model trained with previous OSM data in the same region. When the model drift occurs, a new model is retrained with the updated OSM data. The optimization of parameters is also carried out during this process.

Monitor the error metrics of the model with Data Studio

Having an integration of the max speed model with BigQuery has been very handy to monitor the error metrics in a fast and easy way. We took advantage of Data Studio to create a dashboard to monitor the RMSE of the max speed model over time:

Dashboard made in Data Studio to monitor error metrics in the model of imputation of the max speed.

The first plot in the dashboard shows the average RMSE of the validation dataset after cross-validation. The minimum and maximum RMSE can also be observed. This plot is important to track the model drift.

The dashboard also contains a plot showing the RMSE in the train and test sets. This plot is important to check whether a model is overfitting or not. Other useful information is shown such as the percentage of the data used to create the train and test sets, the number of runs, and the number of folds to do cross-validation.

We use Python to create the model that imputes the max speed in the OSM data; however, BigQuery can be used to achieve the same goal through a tool called BigQuery ML. BigQuery ML allows the creation and execution of machine learning models in BigQuery by using SQL queries. This tool empowers data analysts who have limited machine learning knowledge and limited programming experience. In a future post, we will explain the usage of BigQuery ML and we will apply it to our sample of OSM data. Stay tuned!

Take home messages

BigQuery is an excellent tool to do data management, data analysis, and visualization. Due to its ease of use and all the functionalities that it brings, BigQuery should be added to the toolkit set of every data analyst and data scientist.

Hopefully, this post will provide you with useful information for your future analysis and modeling. Thank you for reading!

Plotwise is a startup located in Delft, The Netherlands. Its mission is to improve the last-mile delivery sector through an AI-powered continuous planning engine.

We are hiring! Check our career site!

--

--

Carmen Adriana Martínez Barbosa, PhD.
plotwise

Data Scientist | Improving society through AI | Sharing new algorithms useful to the DS community. LinkedIn: https://www.linkedin.com/in/camartinezbarbosa/