Google Cloud Professional Data Engineer Notes Part 2

4 min readJun 15, 2024

In the previous blog, Google Cloud Professional Data Engineer Notes Part 1 we covered five of the topics from the GCP Data engineer syllabus — Big Query, Cloud Storage, Dataflow, Big Query Administration, and Big Query Omni now let us continue some other topics which are part of the GCP DE study path.

GCP Data Engineering (*source: Connecting Dots)*

Please note that this blog doesn’t cover all the topics in detail, if you want to study further in detail please visit the official documentation @google cloud official website.

Big Table

NoSQL big data DB Service (non-relational)
Terabyte to Petabyte scale of data
Low latency — stores large amounts of single-keyed data and supports High Read & Write
Values are stored as row-indexed i.e each value is stored as a row key, and each value is ≤10MB
Scalable | Simple administration | Auto-replication
Ideal for Map Reduce operations, Streaming Analytics, Machine Learning Applications
No downtime while the cluster resizing
1-row key can have multiple column families (eg: RK1 — CF1(cv1,cv2…) CF2(cv11,cv22…) — Here RowKey1 has 2 Column Families and each column families have different values in it)
Supports both structured and unstructured
Tablets are sharded blocks of contiguous rows that are used for balancing query workload with the Colossus File system as the base
IAM can be configured at various levels — Project, Instance, and Table
Does not support SQL queries, joins, and multi-row transactions
Supports Compaction i.e periodic rewrites to reorganize and maintain efficient RW(read-write) operations
Supports Mutations i.e updates & deletes made cause some storage increase because both organizations and new value are stored sequentially until compaction

Dataproc

It’s a managed Apache Hadoop and Apache Spark service
Supports — Batch Processing | Querying | Streaming | Machine Learning | Integrations
It can process large datasets and transformations and its simple and familiar
Dataproc automation can create/manage/off clusters easily
Low cost as it has preemptible (short-lived instances) instances and per-second billing
Super fast in terms of cluster management
Supports — Hadoop, Spark, Hive, and Pig as the open-source connectors
Can access through REST API, Cloud SDK, Dataproc UI, Cloud client libraries
Auto-scaling can be done for individual jobs
Ways for On-prem to Hadoop migration (i) Incremental migration — start with low-risk jobs first (ii) Cloud Storage over HDFS — better durability/low costs (iii) Ephemeral Hadoop cluster — can shut down when not in use

Dataproc Metastore

Managed Apache Hivestore
Helps in managing technical metadata
Highly available | Autoheal | Serverless
Centralized metadata repository among Dataproc clusters
Provides a unified view of data
Auto Backup | Performance Monitoring
Can integrate with — Dataproc, Big Query, Data plex, Data catalog, Logging and monitoring, IAM

Cloud SQL

Relational Database Management Service
OLTP | Transactional SQL support | No downtime for maintenance | Cost effective | Early Migration | High Availability | Change Data capture & replication
Data Protection with a retention period of 35 days and with restoring a backup instance(in enterprise plus edition)
Compliance with PCI DSS, SSAE15, ISO 27001, HIPAA
Compatible with MySQL, PostgreSQL, SQL Server, AlloyDB
Cloud SQL insights can provide AI/ML-driven insights
Data caches are available for improved read-write latency
Provides per-second billing and pricing varies with editions (enterprise and enterprise plus), engine, and metric level settings(CPU, memory, disk)
Adds read replicas to distribute the read load and improve performance
Can integrate with Compute Engine, Cloud run, Kubernetes engine, IAM, Big query(federated queries), Datastream (low latency DB replication)

Cloud Composer

It is a managed workflow orchestrator built using Apache Airflow which is an open-source software
To run the workflows an environment is needed
Helps in Creating, Scheduling, and Monitoring Data pipelines
Data pipelines are configured as DAGs in Python where each DAG — Directed Acyclic Graph has a set of tasks
Supports Hybrid and multi-cloud | Reliability | Network and Security
Consumption-based Pricing — vCPU/hour, GB/month, etc.
We can use Ephemeral clusters (i.e clusters that quickly run the job and reallocate resources) for others for parallel job processing
Integrates with Big Query, Dataflow, Dataproc, Datastore, GCS, Pub/sub, AI platform

Okay so let us take a pause here with the ten topics from both blogs and till I’m back with the final part of this notes please refer to other resources already available like GCP Sketch notes, Google Cloud official documentation/videos, and even blogs on medium authored by many others who have successfully passed the exam.

Thank you and Happy reading!!!