Google Cloud Platform(GCP) for Data Scientist and Introduction to Big Query —Part 1

The Google cloud Platform aka GCP is a suite of cloud computing services. It provides IAAS,PAAS and Serverless computing enviroments. As ML and data science become a more integral part of Business today, companies are using Cloud based environments due to its scalability , performance and ease of use. With solutions ranging from AWS to GCP to Mircosoft Azure. Their are many solutions in each of these platforms.

As I continue to learn and grow my skills as a Data scientist I find its essential to explore the GCP BigQuery which is a data warehouse engine that makes life easier due to its usability and scalability. As a data scientist we work with data everyday, mostly doing data analytics or run complex ML models on the data we access from databases or warehouses. We end up mostling doing a lot read activities than write activites. The GCP BigQuery Platform is expected to be used as a write once and read many times database. However, before go deeper into it lets have a look at the other solutions on the GCP Platform.

GCP Products and its comparison to other products

BigQuery Introduction and Key Ideas

Projects and Datasets

GCP Projects define the parent container inside which all data resides in BigQuery. Different projects can have different BigQuery instances. Access is controlled individually or through project-level controls. Next comes the BigQuery dataset.

A BigQuery dataset is similar to the concept of databases in other RDBMS. Datasets govern the geographical location of data within the particular dataset. Locations supported include US, Europe, and Asia.Tables created within a dataset in one location cannot be joined with tables in other locations. Datasets also provide logical grouping of information similar to databases.

We can create different datasets for different applications like inventory, finance, etc , or for different purposes like development, QA, and production. Datasets contain table which in turn, store all the data.

All tables used in a query should be from datasets within the same location.This limitation should be kept in mind while designing datasets and tables.

Loading into Big Query from GCP Storage :

Can be done using the various Big Query Interfaces which include

  • WebUI
  • Google Cloud Shell
  • REST API
  • Programming SDK’s in C#,Go, Java, Node.js,Python,Ruby,PHP — which allow insert, update and delete and run queries using the REST API Calls.
Web UI for BigQuery

Loading can be done by using a file upload feature from the system or from other cloud sources.

Directly into the Database using GCP Cloud Shell using the code as below.

bq — location=US load — source_format=CSV — autodetect PatientData.normal_ranges gs://exercise-dj/normal_ranges.csv
Compose Query opens the Query building section

One can use queries such as below in the New Query section to get the data.

select name,substr(name, strpos(name,” “)) as last_name from `PatientData.patient_details` where STARTS_WITH(name, “Ben”)
select name,regexp_extract(name, r” .*”) as lastname,regexp_replace(name, r” “, “_”) as replacedname
from `PatientData.patient_details` where starts_with(name,”Ben”)
SELECT p1.patient_id, p1.name, p2.blood_sugar, p2.CHOLESTEROL,
p2.Blood_Pressure, p2.heart_rate
FROM `PatientData.patient_details` p1 INNER JOIN `PatientData.test_results` p2 ON p1.patient_id=p2.patient_id
where p1.exercise_days=0
update `PatientData.normal_ranges` 
set min=75
where measure=’BLOOD_SUGAR’
delete from `PatientData.normal_ranges` 
where measure='HEART_RATE'

Note: Some of the features in BigQuery are only available with Standard SQL instead of Legacy SQL, Thus make sure to uncheck Legacy SQL when running a query in order to avoid errors.

Partitioning Tables

BigQuery allows partitions of data based on either time of ingestion or an explicit date or timestamp column. When a partition is created based on the ingestion time, BigQuery creates two pseudo columns called _partitiondate and _partitiontime. These columns can be further used in queries.

When partitioning is done based on an explicit column of date or time that exists in the table no pseudo columns are made. The partition is done on columns data and this improves the performance of the BigQuery as it focuses on lesser number of rows. It is recommended to use partitioning when using transaction based tables.

select * from `PatientData.test_results_partitioned` where TEST_DATE="2018-06-18"

External Storage

Google BigQuery allows users to query data directly from other Google Cloud data sources, like Cloud storage or Bigtable. You can create a table that references an external source. No data is stored inside BigQuery. Rather, every time the data is queried, the query is executed against the external data source under results provider. This saves on storage costs but has an impact of performance on these queries.

Saving as View

We can also save the result of query as view using the the save view button on the query page.

select p1.patient_id,p1.gender,p1.age,p1.drink,p1.smoke,
p2.blood_sugar,p2.cholesterol from [PatientData.patient_details] p1 inner join [PatientData.test_results] p2 on p1.patient_id=p2.patient_id

Labels In BigQuery

When a BigQuery repository is used by multiple individuals and teams, and they end up creating too many data sets and tables, it becomes difficult to track why they are created and of what purpose.Sometimes it is important to group these resources for billing or other purposes. You can do the same by creating labels. Labels are name value paths that can be used to attach a piece of metadata to a resource like a data set, a table, or a view.

To Update the label run the below in GCP Cloud Shell

bq update --set_label department:hospital PatientData
Dataset 'broken-window-XXXXX:PatientData' successfully updated.

To see the label type the below in GCP Cloud Shell

bq show --format=pretty broken-window-XXXXX:PatientData

To find all datasets in the project with a particular label

bq ls --filter "labels.department:hospital" --project_id broken-window-XXXXX

Big Query with Python

The big query package in python can be used to connect to the big query instance on google cloud and data can then be accessed using Standard SQL.

# import the google cloud client library
from google.cloud import bigquery
# Create an instance for the client
client = bigquery.Client()
#creating a query job
query_job = client.query("""
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
LIMIT 10""")
# results in a query
results = query_job.result() # Waits for job to complete.

Best Practices in BigQuery

It is essential to have right table design as design dictates how much would be the cost of using GCP as well as performance of the database and queries. Some of the best practices in BiqQuery are as below.

  • Denormalization : This makes queries run faster. As its expected to be read many times and requires less writing, Performance is key. Thus, Denormalization should be used as much as possible.
  • Partitioning : using transaction date or data load date. This improves performance and allows to optimize queries reducing compute and cost of using GCP.
  • Expiring Tables : remove stray tables or not in use tables. Isolate temporary tables into separate datasets and set expiration times on temporary tables. Partitions can also be expired.
  • Compare costs to store : know which product works best for your use case. Choose between native tables vs external tables based on actual usage and performance requirements. If you have data you analyze on once and year store in Cloud storage and when its needs to be analyzed create external table in BigQuery rather than storing in BigQuery as a Native table. Remove after analyzing.
  • Speeding up Queries : Denormalize frequently joined tables, avoid self joins, avoid cartesian joins , minimze rows using filtering and use partition. Also, Native bigQuery tables are faster than External tables and thus for frequently queried use Native tables. Use the correct datatypes and avoid casting and conversion.
  • Monitoring and Logging : use StackDriver in GCP to monitor query performance and storage. Cloud audit logs also provides audit trail for access and usage along with filter and query features on the audit logs.

Hope this blog helps and introduces you to BigQuery. Please comment or let me know if you have any query. I will be working on EDA using python and big query in Part 2. Once completed I will link that blog.


Sources :

You can use free trial with $300 credit to create a free GCP account with your gmail. Once that is done access BigQuery from BigData section.

https://cloud.google.com/bigquery/docs/

https://www.linkedin.com/learning/data-science-on-google-cloud-platform-designing-data-warehouses/why-data-warehouses-are-important , the training is what I did to learn BigQuery basics.