Geek Culture
Published in

Geek Culture

Hands-on Regression Analysis with BigQuery

A step-by-step guide on how to perform linear regression analysis with BiqQuery

We live in the age of big data. It is difficult to give an exact definition of big data. However, as a good rule of thumb, if the dataset can’t fit in memory or disk on a single system, it can be called big data. Big data is used in areas such as healthcare, biology, bioinformatics, and education.

Classic computers were sufficient to analyze small and medium-sized data. But big data has fundamentally changed data analytics. These computers failed to analyze big data because big data didn’t fit in memory.

Cloud computing has changed the game. Now tools in Cloud computing platforms can analyze big data. In this post, I’ll show you how to perform linear regression analysis with BigQuery and cover the following topics:

  • What is BiqQuery?
  • What is BigQuery ML?
  • BigQuery pricing
  • Linear regression analysis with BigQuery

Let’s dive in!

What is BigQuery?

BigQuery is a fully-managed, serverless data warehouse that allows you to manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. You can query terabytes in seconds and petabytes in minutes with BigQuery.

You can access BigQuery in the Google Cloud console interface and the BigQuery command-line tool. You can also work with BigQuery using client libraries in Python, Java, JavaScript, and Go, as well as BigQuery’s REST API and RPC API to transform and manage data.

What is BigQuery ML?

You can use BigQuery both to query big data and to build machine learning models using standard SQL queries. The BigQuery ML lets you discover, implement, and manage data tools to make critical business decisions.

As you know, building a machine learning model on a big dataset requires extensive programming and knowledge of ML frameworks such as TensorFlow and PyTorch. BigQuery ML helps SQL practitioners to build and evaluate machine learning models using existing SQL tools and skills.

Here are some advantages of BigQuery ML:

  • BigQuery ML democratizes the use of machine learning, you just need to know SQL.
  • It provides a variety of pre-built models that can be used out-of-the-box.
  • You can quickly build machine learning models without exporting data from the data warehouse.
  • It is easy to use and can be integrated into existing workflows.
  • It is cost-effective and can be used to train models on a pay-as-you-go basis.

You can work with BigQuery ML by using:

  • The Google Cloud console
  • The bq command-line tool
  • The BigQuery REST API
  • An external tool such as a Jupyter notebook or business intelligence platform

BigQuery Pricing

Bigquery is not free and charges you for data storage and running queries. But as part of the Google Cloud Free Tier, BigQuery provides some resources free of charge up to a specific limit. These free usage limits are available during and after the free trial period. For more information about the free usage tier, you can look here.

Regression Analysis with BigQuery

Linear regression is a statistical technique that is used to predict the future value of a dependent variable, based on a given independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make the prediction.

Here are some advantages of linear regression:

  1. Linear regression is a simple and easy-to-use technique.
  2. It is a powerful technique that can be used to predict future values.
  3. It is a reliable technique that can be used to obtain accurate predictions.
  4. It is a flexible technique that can be used to predict values for different independent variables.
  5. It is a widely used technique that is well-known and understood by many statisticians.

Let’s look at how to perform a linear regression analysis with BigQuery in 6 steps.

Step 1. Create your dataset

Before building the machine learning model, you need to create a dataset. To do this, let’s go to the bigquery page first. After that, click on the three dots next to the project name and then press Create dataset.

On the Create dataset page, enter a name let’s say reg_analysis, keeping all the other options with default values, and then click CREATE DATASET.

Step 2. Exploring the dataset

The dataset I’m going to use is the new_york_citibike dataset. This dataset exists in the bigquery-public-data GCP project that hosts all the BigQuery public datasets. Let’s take a look at this dataset. Open the bigquery-public-data GCP project and find the new_york_citibike dataset. After that click on the citibike_trips table.

A new window will open showing information about the dataset.

In the schema tab, the structure of the citibike_trips table displays all the fields that can be used as labels and features. Here the tripduration field indicates the duration expressed in seconds of each bike rental. This column will be our target variable. All the other fields in the table are potential features since these fields are directly related to the distance to go through and affect the duration of the trip.

Now let’s take a look at the row count of the tripduration column with a SQL query.

As you can see, there are more than five million records where the tripduration field is not properly valued.

Step 3. Creating the Tables

Before training the model, the dataset is split into three sets in machine learning projects. These are the training, validation, and test sets. The model is built with the training set, the model is evaluated with the validation set, and the values in the test set are predicted with the model.

Let’s create the training set first:

Let’s go over these SQL comments. You can use the CREATE OR REPLACE TABLE statement to create a new table. The result of the query is stored in the new nyc_bike.training_table. You can write the columns you want to use in the SELECT statement.

The WHERE statement is used to filter data. In this statement, I used EXTRACTmethod to get the year and the month from the starttime field, which indicates the exact moment when the bike trip started. To exclude outliers in the tripduration column, I’m only taking into consideration the rows with a minimum rental time of 3 minutes and a maximum of 3 hours. I can also add a filter to the year of birth of our customers, by filtering all the records with an empty birth_year and ignoring all the customers born after 2007.

After running this query, you’ll see a table as below.

Let’s create the evaluation table.

After that let’s create the prediction table.

Nice! we created the tables. The next step is to create a regression model using the training table

Step 4. Building the Model

Now let’s build a linear regression model using the training table with BigQuery. We can use the following standard SQL query to build the model we use to predict the trip duration. I’m going to give the model a name, let’s say reg_model.

Query details

You can use the CREATE MODELclause to create and train the model named nyc_bike.reg_model with OPTIONS(model_type='linear_reg') clause.

I want to add an is_weekend. This field is a boolean value that calculates and extracts the day of the week in which the bike rental happened. It is generated by an IF statement that returns true if the day is Sunday or Saturday, represented by the values 1 and 7; otherwise, it’s false. The tripduration column is our input label column.

After clicking the run button, the query takes about 30 seconds and then your model appears in the navigation panel. Note that since we used a CREATE MODEL statement to create a table, we do not see query results.

Click reg_model to see model training statistics. Then hit the Education tab and then click Table. The results should look like this:

You can see the mean square error metric by default.

Step 5. Evaluating the model

After creating your model, you can use ML.EVALUATE function to evaluate the model. Let’s use the following query to evaluate our ML model on the evaluation table:

Query details

The top-most SELECT statement retrieves the columns from your model. I set the ML.EVALUATE function in FROM clause to evaluate the model. When you run this query, the results look like below

Step 6. Using the model to predict outcomes

After evaluating the model, the next step is to use it to predict an outcome. Let’s predict the labels using the data from the prediction table:

Query details

I set the ML.PREDICT function in FROM clause to predict the data in the prediction table. In the following screenshot, you can see the predicted labels.

That’s it. In this tutorial, we first created a linear regression model and then used this model to predict data that we did not use when building the model.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can either delete the project you created or save the project and delete the dataset.

Conclusion

In this post, I showed step-by-step how to perform linear regression analysis with BigQuery. You can obtain better output by fine-tuning the hyperparameter. You can find the BigQuery queries I used in this tutorial here. Here is also the youtube video showing the analysis I discussed in this post.

Thanks for reading. I hope you enjoy it. Don’t forget to follow us on YouTube | Medium | Twitter | GitHub | Linkedin | Kaggle 😎

Hands-on Machine Learning with Python

19 stories

Resources

If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store