Step2. Check AI platform and choose Notebooks
Step3. Create a new instance
there are some options you can choose, this is so amazing!
Step4. Open JupyterLab, click “open jupyerlab”, you will go to the website
Step5. Click the terminal icon, launch a terminal and clone the repository , after that you will see the directory show up.
git clone https://github.com/GoogleCloudPlatform/training-data-analyst
Start to try something
Step1. Go to BigQuery Website, understand the dataset
Step2. We want to predict the arrival delay of an airline based on the departure delay
Step3. Calculate the alpha, and to find out our RMSE on training and evaluate dataset
Here you can see, if we use rand() function to split the dataset, each-time we execute the same sql, we will get the different result,
so it is hard to let us repeatable experiment.
Step4. How about we calculate alpha and calculate RMSE inside one sql? It also can not provide the consistent on each time we execute.
Step5. This is why we can use hash function, to make sure the same value go to same bucket, and we can choose different bucket to separate our dataset.
select date,
MOD( ABS(FARM_FINGERPRINT(date)), 10) as bucket1,
airline,
airline_code,
departure_airport,
departure_state,
departure_lat,
departure_lon,
arrival_airport,
arrival_state,
arrival_lat,
arrival_lon,
departure_schedule,
departure_actual,
departure_delay,
arrival_schedule,
arrival_actual,
arrival_delay
from `bigquery-samples.airline_ontime_data.flights`
Let’s see how the data look like, when bucket1 = 0, in the below picture, we can find that same date value will have the same bucket value, it is consistent.
select date,
MOD( ABS(FARM_FINGERPRINT(date)), 10) as bucket1,
airline,
airline_code,
departure_airport,
departure_state,
departure_lat,
departure_lon,
arrival_airport,
arrival_state,
arrival_lat,
arrival_lon,
departure_schedule,
departure_actual,
departure_delay,
arrival_schedule,
arrival_actual,
arrival_delay
from `bigquery-samples.airline_ontime_data.flights`
where MOD( ABS(FARM_FINGERPRINT(date)), 10) = 0
Notice:1. be sure to split your data into those valid sets based on a column that you can afford to lose.2.The column which you choose, need to consider it's distributed and noisy, otherwise you can not have a good split.
sample code
Reference