Hands-on Regression Analysis with BigQuery
A step-by-step guide on how to perform linear regression analysis with BiqQuery
We live in the age of big data. It is difficult to give an exact definition of big data. However, as a good rule of thumb, if the dataset can’t fit in memory or disk on a single system, it can be called big data. Big data is used in areas such as healthcare, biology, bioinformatics, and education.
Classic computers were sufficient to analyze small and medium-sized data. But big data has fundamentally changed data analytics. These computers failed to analyze big data because big data didn’t fit in memory.
Cloud computing has changed the game. Now tools in Cloud computing platforms can analyze big data. In this post, I’ll show you how to perform linear regression analysis with BigQuery and cover the following topics:
- What is BiqQuery?
- What is BigQuery ML?
- BigQuery pricing
- Linear regression analysis with BigQuery
Let’s dive in!
What is BigQuery?
BigQuery is a fully-managed, serverless data warehouse that allows you to manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. You can query terabytes in seconds and petabytes in minutes with BigQuery.
What is BigQuery ML?
You can use BigQuery both to query big data and to build machine learning models using standard SQL queries. The BigQuery ML lets you discover, implement, and manage data tools to make critical business decisions.
As you know, building a machine learning model on a big dataset requires extensive programming and knowledge of ML frameworks such as TensorFlow and PyTorch. BigQuery ML helps SQL practitioners to build and evaluate machine learning models using existing SQL tools and skills.
Here are some advantages of BigQuery ML:
- BigQuery ML democratizes the use of machine learning, you just need to know SQL.
- It provides a variety of pre-built models that can be used out-of-the-box.
- You can quickly build machine learning models without exporting data from the data warehouse.
- It is easy to use and can be integrated into existing workflows.
- It is cost-effective and can be used to train models on a pay-as-you-go basis.
You can work with BigQuery ML by using:
- The Google Cloud console
- The BigQuery REST API
- An external tool such as a Jupyter notebook or business intelligence platform
Bigquery is not free and charges you for data storage and running queries. But as part of the Google Cloud Free Tier, BigQuery provides some resources free of charge up to a specific limit. These free usage limits are available during and after the free trial period. For more information about the free usage tier, you can look here.
Regression Analysis with BigQuery
Linear regression is a statistical technique that is used to predict the future value of a dependent variable, based on a given independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make the prediction.
Here are some advantages of linear regression:
- Linear regression is a simple and easy-to-use technique.
- It is a powerful technique that can be used to predict future values.
- It is a reliable technique that can be used to obtain accurate predictions.
- It is a flexible technique that can be used to predict values for different independent variables.
- It is a widely used technique that is well-known and understood by many statisticians.
Let’s look at how to perform a linear regression analysis with BigQuery in 6 steps.
Step 1. Create your dataset
Before building the machine learning model, you need to create a dataset. To do this, let’s go to the bigquery page first. After that, click on the three dots next to the project name and then press Create dataset.
On the Create dataset page, enter a name let’s say reg_analysis, keeping all the other options with default values, and then click CREATE DATASET.
Step 2. Exploring the dataset
The dataset I’m going to use is the new_york_citibike dataset. This dataset exists in the bigquery-public-data GCP project that hosts all the BigQuery public datasets. Let’s take a look at this dataset. Open the bigquery-public-data GCP project and find the new_york_citibike dataset. After that click on the citibike_trips table.
A new window will open showing information about the dataset.
In the schema tab, the structure of the citibike_trips table displays all the fields that can be used as labels and features. Here the tripduration field indicates the duration expressed in seconds of each bike rental. This column will be our target variable. All the other fields in the table are potential features since these fields are directly related to the distance to go through and affect the duration of the trip.
Now let’s take a look at the row count of the tripduration column with a SQL query.
As you can see, there are more than five million records where the tripduration field is not properly valued.
Step 3. Creating the Tables
Before training the model, the dataset is split into three sets in machine learning projects. These are the training, validation, and test sets. The model is built with the training set, the model is evaluated with the validation set, and the values in the test set are predicted with the model.
Let’s create the training set first:
Let’s go over these SQL comments. You can use the
CREATE OR REPLACE TABLE statement to create a new table. The result of the query is stored in the new
nyc_bike.training_table. You can write the columns you want to use in the
WHERE statement is used to filter data. In this statement, I used
EXTRACTmethod to get the year and the month from the
starttime field, which indicates the exact moment when the bike trip started. To exclude outliers in the tripduration column, I’m only taking into consideration the rows with a minimum rental time of 3 minutes and a maximum of 3 hours. I can also add a filter to the year of birth of our customers, by filtering all the records with an empty birth_year and ignoring all the customers born after 2007.
After running this query, you’ll see a table as below.
Let’s create the evaluation table.
After that let’s create the prediction table.
Nice! we created the tables. The next step is to create a regression model using the training table
Step 4. Building the Model
Now let’s build a linear regression model using the training table with BigQuery. We can use the following standard SQL query to build the model we use to predict the trip duration. I’m going to give the model a name, let’s say
You can use the
CREATE MODELclause to create and train the model named
I want to add an
is_weekend. This field is a boolean value that calculates and extracts the day of the week in which the bike rental happened. It is generated by an IF statement that returns true if the day is Sunday or Saturday, represented by the values 1 and 7; otherwise, it’s false. The
tripduration column is our input label column.
After clicking the run button, the query takes about 30 seconds and then your model appears in the navigation panel. Note that since we used a
CREATE MODEL statement to create a table, we do not see query results.
Click reg_model to see model training statistics. Then hit the Education tab and then click Table. The results should look like this:
You can see the mean square error metric by default.
Step 5. Evaluating the model
After creating your model, you can use
ML.EVALUATE function to evaluate the model. Let’s use the following query to evaluate our ML model on the evaluation table:
SELECT statement retrieves the columns from your model. I set the
ML.EVALUATE function in
FROM clause to evaluate the model. When you run this query, the results look like below
Step 6. Using the model to predict outcomes
After evaluating the model, the next step is to use it to predict an outcome. Let’s predict the labels using the data from the prediction table:
I set the
ML.PREDICT function in
FROM clause to predict the data in the prediction table. In the following screenshot, you can see the predicted labels.
That’s it. In this tutorial, we first created a linear regression model and then used this model to predict data that we did not use when building the model.
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can either delete the project you created or save the project and delete the dataset.
In this post, I showed step-by-step how to perform linear regression analysis with BigQuery. You can obtain better output by fine-tuning the hyperparameter. You can find the BigQuery queries I used in this tutorial here. Here is also the youtube video showing the analysis I discussed in this post.
Thanks for reading. I hope you enjoy it. Don’t forget to follow us on YouTube | Medium | Twitter | GitHub | Linkedin | Kaggle 😎
If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇