[Big Query] Creating repeatable Splits Dataset

abehsu
ㄚ北的所見所聞
4 min readSep 30, 2019

Environment Preparation

Step1. Go to GCP Platform Website

Step2. Check AI platform and choose Notebooks

Step3. Create a new instance

there are some options you can choose, this is so amazing!

Step4. Open JupyterLab, click “open jupyerlab”, you will go to the website

Step5. Click the terminal icon, launch a terminal and clone the repository , after that you will see the directory show up.

git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Start to try something

Step1. Go to BigQuery Website, understand the dataset

Step2. We want to predict the arrival delay of an airline based on the departure delay

Step3. Calculate the alpha, and to find out our RMSE on training and evaluate dataset

Here you can see, if we use rand() function to split the dataset, each-time we execute the same sql, we will get the different result,
so it is hard to let us repeatable experiment.

Step4. How about we calculate alpha and calculate RMSE inside one sql? It also can not provide the consistent on each time we execute.

Step5. This is why we can use hash function, to make sure the same value go to same bucket, and we can choose different bucket to separate our dataset.

select  date,
MOD( ABS(FARM_FINGERPRINT(date)), 10) as bucket1,
airline,
airline_code,
departure_airport,
departure_state,
departure_lat,
departure_lon,
arrival_airport,
arrival_state,
arrival_lat,
arrival_lon,
departure_schedule,
departure_actual,
departure_delay,
arrival_schedule,
arrival_actual,
arrival_delay
from `bigquery-samples.airline_ontime_data.flights`

Let’s see how the data look like, when bucket1 = 0, in the below picture, we can find that same date value will have the same bucket value, it is consistent.

select  date,
MOD( ABS(FARM_FINGERPRINT(date)), 10) as bucket1,
airline,
airline_code,
departure_airport,
departure_state,
departure_lat,
departure_lon,
arrival_airport,
arrival_state,
arrival_lat,
arrival_lon,
departure_schedule,
departure_actual,
departure_delay,
arrival_schedule,
arrival_actual,
arrival_delay
from `bigquery-samples.airline_ontime_data.flights`
where MOD( ABS(FARM_FINGERPRINT(date)), 10) = 0
Notice:1. be sure to split your data into those valid sets based on a column that you can afford to lose.2.The column which you choose, need to consider it's distributed and noisy, otherwise you can not have a good split.

sample code

Reference

--

--

abehsu
ㄚ北的所見所聞

喜歡探索不同領域的人事物,喜歡將自己學習到的也分享給大家,並產生共同的討論,從中得到回饋。 Linkedln(https://www.linkedin.com/in/hsuyuming/)