Google Cloud Professional Data Engineer Notes Part 1

Zaha
5 min readMay 13, 2024

--

Preparing for your Google Professional Data Engineer certification exam? Need quick notes before you take your certification exam?… then you have come to the right place. In this blog, you can find the GCP Data Engineer notes as per the updated syllabus (as of 2024).

In this blog, I’ll start by covering the topics that were important i.e. had more questions around them according to the majority of the people who have taken the certification exam.

Google Cloud Platform (source: Project Pro)

Please note that this blog doesn’t cover all the topics in detail, if you want to study further in detail please visit the official documentation at Google Cloud official website.

Big Query:

  • Serverless, Highly Scalable, Cost-Effective Data Warehouse
  • Can query Terabytes(TB) of data in milli seconds and Petabytes(PB) of data in minutes
  • Can do real-time and historical data analysis plus OLAP Analytical workloads
  • Provides long-term storage
  • Partitioning tables can be used for cost optimization without affecting query performance
  • Clustering is used for optimizing the queries
  • BI engine is an in-memory analytics service
  • Big query plan execution provides details about query execution
  • Supports legacy SQL and Big query SQL supported complex queries
  • ACID compliant
  • Data storage is in columnar format
  • Automatic storage replication across multiple locations
  • Centralized management of data and compute resources
  • Can integrate with Data Catalog for data discovery and metadata management
  • Provides Dataset and project-level security and governance
  • Supports data separation by region to satisfy the Sovereignty requirements
  • Provides built-in data governance through Framework, Process, and Tools & Technology
  • Supports Multi-cloud through Big Query Omni which can ingest data from different clouds
  • HIPAA Compliant & ML predictions
  • Built on top of Dremel (query execution engine)
  • Provides reservations to switch between on-demand and capacity-based pricing
  • Supports streaming data updates
  • Provides Federated queries to read data from external sources
  • Ingesting data can be done in real-time or in batches and it supports structured and semi-structured data
  • Data Storage is in Petabyte scale and in the form of Datasets and tables
  • Analyze the data through SQL queries, aggregates, and transforms
  • Visualize the data via Looker and spreadsheets
  • Materialized views are cached views that can optimize the queries and provide persistent results
  • Avoid duplication through the Federated data model
  • Big Query Data Transfer Service to do data migration from GCS to Big Query
  • Supports IAM roles
  • Pricing is provided based on Storage (active/long-term), Query (on-demand/flat rate), and Ingestion & exports (Batch/Streaming/Read Write API)

Cloud Storage:

  • Unstructured data storage where data is stored as objects in multi-class and multi-region format
  • It is a Blob storage service
  • Used mainly for data archiving, backup, and recovery
  • An object is an immutable piece of data
  • Object versioning is used for any changes on the object to retain a non-concurrent version
  • Bucket lock is a retention policy used to lock buckets
  • IAM access to buckets and storage in terms of update/create/delete
  • Lifecycle rules can be setup to automatically delete objects after required retention period to optimize costs
  • Storing data in the cloud storage bucket is most effective for long-term storage in the Google Cloud Platform(GCP)
  • There are 4 Google Cloud Storage classes based on the order of access frequency of stored data — Standard(frequent), Nearline(1 month), Coldline(quarter), Archive(year(s))
  • Interacting with GCS(Google Cloud Storage) can be done in many ways — Terraform, FUSe (local file system), console, Google cloud CLI, APIs, and client libraries
  • Objects are stored in buckets and have managed folders for access management
  • To record all the operations/requests on a buckets we can turn on Data access audit logging
  • We must follow bucket naming conventions stated as per the official documentation (eg: 3–63 character length, no IP addresses,etc)
  • We can specify Bucket encryption type while creating a bucket like CMEK, etc

Dataflow

  • Create & design end-end pipelines and data processing using Dataflow
  • It supports both batch data and streaming data
  • Fully managed GCP service through a pool of VMs
  • It’s built on Apache Beam (open source) and for execution Runner or Dataflow Runner is used
  • It supports languages like Java, Python, Go
  • Supports auto-scaling and parallel data processing via multiple VMs
  • It does per-second billing
  • Does exactly one-time data processing and storage to avoid duplicates
  • It’s flexible, observable, Portable and uses a data pipeline model (data moves in series of steps)
  • Uses a streaming engine to separate compute and storage to improve performance and decrease resource utilization
  • For update operation uses Drain to maintain the in-flight data
  • ParDo — core parallel processing operation in beam SDK, DoFn — User defined function (UDF) custom code for data processing in beam pipeline, PCollection — abstraction of the distributed multi-element dataset
  • Watermark — Notion when all the data in a certain window is expected to arrive
  • Triggers — Determines when to emit aggregated results as data arrives by default at the end of the window when the watermark is passed. There are 3 types of triggers — Composite, Data-drive, and time-based.
  • Windowing — Divides PCollection based on a timestamp and each window has finite set of elements. There are 4 types of windowing — Single Global window, Fixed time/Tumbling window, Sliding time/Hopping window, and Session window
  • PTransform — used to build complex data transformations like data type check, data validation, etc
  • Pre-defined data flow templates are available in GCP.

Big Query administration

  • BQ Administration is used to manage various components in Big Query as listed below
  • Manage resources — Folders, Projects,Datases, Tables
  • Secure resources — Restricted access
  • Manage Workloads — Jobs ,queries, compute
  • Monitor resources — quotas, jobs, compute
  • Optimize workloads — cost control and best performance
  • Troubleshoot — errors, billing issues, quotas

Big Query Omni

  • Big Query analytics on data stored in other cloud platforms like AWS, Azure using BigLake Table
  • Compute clusters reside in the AWS/Azure and billing is done only for the queries on demand

Okay so lets a pause here and till I’m back with the part 2 of this notes please refer other resources already available like GCP Sketch notes, Google Cloud official documentation/videos, and even blogs on medium authored by many others who have successfully passed the exam.

Thank you for reading:)

--

--