Google Cloud Professional Data Engineer Notes Part 2

Zaha
4 min readJun 15, 2024

--

In the previous blog, Google Cloud Professional Data Engineer Notes Part 1 we covered five of the topics from the GCP Data engineer syllabus — Big Query, Cloud Storage, Dataflow, Big Query Administration, and Big Query Omni now let us continue some other topics which are part of the GCP DE study path.

GCP Data Engineering (source: Connecting Dots)

Please note that this blog doesn’t cover all the topics in detail, if you want to study further in detail please visit the official documentation @google cloud official website.

Big Table

  • NoSQL big data DB Service (non-relational)
  • Terabyte to Petabyte scale of data
  • Low latency — stores large amounts of single-keyed data and supports High Read & Write
  • Values are stored as row-indexed i.e each value is stored as a row key, and each value is ≤10MB
  • Scalable | Simple administration | Auto-replication
  • Ideal for Map Reduce operations, Streaming Analytics, Machine Learning Applications
  • No downtime while the cluster resizing
  • 1-row key can have multiple column families (eg: RK1 — CF1(cv1,cv2…) CF2(cv11,cv22…) — Here RowKey1 has 2 Column Families and each column families have different values in it)
  • Supports both structured and unstructured
  • Tablets are sharded blocks of contiguous rows that are used for balancing query workload with the Colossus File system as the base
  • IAM can be configured at various levels — Project, Instance, and Table
  • Does not support SQL queries, joins, and multi-row transactions
  • Supports Compaction i.e periodic rewrites to reorganize and maintain efficient RW(read-write) operations
  • Supports Mutations i.e updates & deletes made cause some storage increase because both organizations and new value are stored sequentially until compaction

Dataproc

  • It’s a managed Apache Hadoop and Apache Spark service
  • Supports — Batch Processing | Querying | Streaming | Machine Learning | Integrations
  • It can process large datasets and transformations and its simple and familiar
  • Dataproc automation can create/manage/off clusters easily
  • Low cost as it has preemptible (short-lived instances) instances and per-second billing
  • Super fast in terms of cluster management
  • Supports — Hadoop, Spark, Hive, and Pig as the open-source connectors
  • Can access through REST API, Cloud SDK, Dataproc UI, Cloud client libraries
  • Auto-scaling can be done for individual jobs
  • Ways for On-prem to Hadoop migration (i) Incremental migration — start with low-risk jobs first (ii) Cloud Storage over HDFS — better durability/low costs (iii) Ephemeral Hadoop cluster — can shut down when not in use

Dataproc Metastore

  • Managed Apache Hivestore
  • Helps in managing technical metadata
  • Highly available | Autoheal | Serverless
  • Centralized metadata repository among Dataproc clusters
  • Provides a unified view of data
  • Auto Backup | Performance Monitoring
  • Can integrate with — Dataproc, Big Query, Data plex, Data catalog, Logging and monitoring, IAM

Cloud SQL

  • Relational Database Management Service
  • OLTP | Transactional SQL support | No downtime for maintenance | Cost effective | Early Migration | High Availability | Change Data capture & replication
  • Data Protection with a retention period of 35 days and with restoring a backup instance(in enterprise plus edition)
  • Compliance with PCI DSS, SSAE15, ISO 27001, HIPAA
  • Compatible with MySQL, PostgreSQL, SQL Server, AlloyDB
  • Cloud SQL insights can provide AI/ML-driven insights
  • Data caches are available for improved read-write latency
  • Provides per-second billing and pricing varies with editions (enterprise and enterprise plus), engine, and metric level settings(CPU, memory, disk)
  • Adds read replicas to distribute the read load and improve performance
  • Can integrate with Compute Engine, Cloud run, Kubernetes engine, IAM, Big query(federated queries), Datastream (low latency DB replication)

Cloud Composer

  • It is a managed workflow orchestrator built using Apache Airflow which is an open-source software
  • To run the workflows an environment is needed
  • Helps in Creating, Scheduling, and Monitoring Data pipelines
  • Data pipelines are configured as DAGs in Python where each DAG — Directed Acyclic Graph has a set of tasks
  • Supports Hybrid and multi-cloud | Reliability | Network and Security
  • Consumption-based Pricing — vCPU/hour, GB/month, etc.
  • We can use Ephemeral clusters (i.e clusters that quickly run the job and reallocate resources) for others for parallel job processing
  • Integrates with Big Query, Dataflow, Dataproc, Datastore, GCS, Pub/sub, AI platform

Okay so let us take a pause here with the ten topics from both blogs and till I’m back with the final part of this notes please refer to other resources already available like GCP Sketch notes, Google Cloud official documentation/videos, and even blogs on medium authored by many others who have successfully passed the exam.

Thank you and Happy reading!!!

--

--