Google Cloud Professional Data Engineer Notes Part 1

5 min readMay 13, 2024

Preparing for your Google Professional Data Engineer certification exam? Need quick notes before you take your certification exam?… then you have come to the right place. In this blog, you can find the GCP Data Engineer notes as per the updated syllabus (as of 2024).

In this blog, I’ll start by covering the topics that were important i.e. had more questions around them according to the majority of the people who have taken the certification exam.

Google Cloud Platform (source: Project Pro)

Please note that this blog doesn’t cover all the topics in detail, if you want to study further in detail please visit the official documentation at Google Cloud official website.

Big Query:

Serverless, Highly Scalable, Cost-Effective Data Warehouse
Can query Terabytes(TB) of data in milli seconds and Petabytes(PB) of data in minutes
Can do real-time and historical data analysis plus OLAP Analytical workloads
Provides long-term storage
Partitioning tables can be used for cost optimization without affecting query performance
Clustering is used for optimizing the queries
BI engine is an in-memory analytics service
Big query plan execution provides details about query execution
Supports legacy SQL and Big query SQL supported complex queries
ACID compliant
Data storage is in columnar format
Automatic storage replication across multiple locations
Centralized management of data and compute resources
Can integrate with Data Catalog for data discovery and metadata management
Provides Dataset and project-level security and governance
Supports data separation by region to satisfy the Sovereignty requirements
Provides built-in data governance through Framework, Process, and Tools & Technology
Supports Multi-cloud through Big Query Omni which can ingest data from different clouds
HIPAA Compliant & ML predictions
Built on top of Dremel (query execution engine)
Provides reservations to switch between on-demand and capacity-based pricing
Supports streaming data updates
Provides Federated queries to read data from external sources
Ingesting data can be done in real-time or in batches and it supports structured and semi-structured data
Data Storage is in Petabyte scale and in the form of Datasets and tables
Analyze the data through SQL queries, aggregates, and transforms
Visualize the data via Looker and spreadsheets
Materialized views are cached views that can optimize the queries and provide persistent results
Avoid duplication through the Federated data model
Big Query Data Transfer Service to do data migration from GCS to Big Query
Supports IAM roles
Pricing is provided based on Storage (active/long-term), Query (on-demand/flat rate), and Ingestion & exports (Batch/Streaming/Read Write API)

Cloud Storage:

Unstructured data storage where data is stored as objects in multi-class and multi-region format
It is a Blob storage service
Used mainly for data archiving, backup, and recovery
An object is an immutable piece of data
Object versioning is used for any changes on the object to retain a non-concurrent version
Bucket lock is a retention policy used to lock buckets
IAM access to buckets and storage in terms of update/create/delete
Lifecycle rules can be setup to automatically delete objects after required retention period to optimize costs
Storing data in the cloud storage bucket is most effective for long-term storage in the Google Cloud Platform(GCP)
There are 4 Google Cloud Storage classes based on the order of access frequency of stored data — Standard(frequent), Nearline(1 month), Coldline(quarter), Archive(year(s))
Interacting with GCS(Google Cloud Storage) can be done in many ways — Terraform, FUSe (local file system), console, Google cloud CLI, APIs, and client libraries
Objects are stored in buckets and have managed folders for access management
To record all the operations/requests on a buckets we can turn on Data access audit logging
We must follow bucket naming conventions stated as per the official documentation (eg: 3–63 character length, no IP addresses,etc)
We can specify Bucket encryption type while creating a bucket like CMEK, etc

Dataflow

Create & design end-end pipelines and data processing using Dataflow
It supports both batch data and streaming data
Fully managed GCP service through a pool of VMs
It’s built on Apache Beam (open source) and for execution Runner or Dataflow Runner is used
It supports languages like Java, Python, Go
Supports auto-scaling and parallel data processing via multiple VMs
It does per-second billing
Does exactly one-time data processing and storage to avoid duplicates
It’s flexible, observable, Portable and uses a data pipeline model (data moves in series of steps)
Uses a streaming engine to separate compute and storage to improve performance and decrease resource utilization
For update operation uses Drain to maintain the in-flight data
ParDo — core parallel processing operation in beam SDK, DoFn — User defined function (UDF) custom code for data processing in beam pipeline, PCollection — abstraction of the distributed multi-element dataset
Watermark — Notion when all the data in a certain window is expected to arrive
Triggers — Determines when to emit aggregated results as data arrives by default at the end of the window when the watermark is passed. There are 3 types of triggers — Composite, Data-drive, and time-based.
Windowing — Divides PCollection based on a timestamp and each window has finite set of elements. There are 4 types of windowing — Single Global window, Fixed time/Tumbling window, Sliding time/Hopping window, and Session window
PTransform — used to build complex data transformations like data type check, data validation, etc
Pre-defined data flow templates are available in GCP.

Big Query administration

BQ Administration is used to manage various components in Big Query as listed below
Manage resources — Folders, Projects,Datases, Tables
Secure resources — Restricted access
Manage Workloads — Jobs ,queries, compute
Monitor resources — quotas, jobs, compute
Optimize workloads — cost control and best performance
Troubleshoot — errors, billing issues, quotas

Big Query Omni

Big Query analytics on data stored in other cloud platforms like AWS, Azure using BigLake Table
Compute clusters reside in the AWS/Azure and billing is done only for the queries on demand

Okay so lets a pause here and till I’m back with the part 2 of this notes please refer other resources already available like GCP Sketch notes, Google Cloud official documentation/videos, and even blogs on medium authored by many others who have successfully passed the exam.

Thank you for reading:)