[Big Query] Creating repeatable Splits Dataset

abehsu

Published in

ㄚ北的所見所聞

4 min readSep 30, 2019

Environment Preparation

Step1. Go to GCP Platform Website

Step2. Check AI platform and choose Notebooks

Step3. Create a new instance

there are some options you can choose, this is so amazing!

Step4. Open JupyterLab, click “open jupyerlab”, you will go to the website

Step5. Click the terminal icon, launch a terminal and clone the repository , after that you will see the directory show up.

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Start to try something

Step1. Go to BigQuery Website, understand the dataset

Step2. We want to predict the arrival delay of an airline based on the departure delay

Step3. Calculate the alpha, and to find out our RMSE on training and evaluate dataset

Here you can see, if we use rand() function to split the dataset, each-time we execute the same sql, we will get the different result,
so it is hard to let us repeatable experiment.

Step4. How about we calculate alpha and calculate RMSE inside one sql? It also can not provide the consistent on each time we execute.

Step5. This is why we can use hash function, to make sure the same value go to same bucket, and we can choose different bucket to separate our dataset.

select  date,
        MOD( ABS(FARM_FINGERPRINT(date)), 10) as bucket1,
        airline,
        airline_code,
        departure_airport,
        departure_state,
        departure_lat,
        departure_lon,
        arrival_airport, 
        arrival_state,
        arrival_lat,
        arrival_lon,
        departure_schedule,
        departure_actual,
        departure_delay,
        arrival_schedule,
        arrival_actual,
        arrival_delay
from `bigquery-samples.airline_ontime_data.flights`

Let’s see how the data look like, when bucket1 = 0, in the below picture, we can find that same date value will have the same bucket value, it is consistent.

select  date,
        MOD( ABS(FARM_FINGERPRINT(date)), 10) as bucket1,
        airline,
        airline_code,
        departure_airport,
        departure_state,
        departure_lat,
        departure_lon,
        arrival_airport, 
        arrival_state,
        arrival_lat,
        arrival_lon,
        departure_schedule,
        departure_actual,
        departure_delay,
        arrival_schedule,
        arrival_actual,
        arrival_delay
from `bigquery-samples.airline_ontime_data.flights`
where MOD( ABS(FARM_FINGERPRINT(date)), 10) = 0

Notice:1. be sure to split your data into those valid sets based on a column that you can afford to lose.2.The column which you choose, need to consider it's distributed and noisy, otherwise you can not have a good split.

sample code

hsuyuming/big_query_practice

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Reference

Launching into Machine Learning | Coursera

Learn Launching into Machine Learning from Google Cloud. Starting from a history of machine learning, we discuss why…

www.coursera.org

GoogleCloudPlatform/training-data-analyst

Labs and demos for courses for GCP Training (http://cloud.google.com/training). …