‘Google Cloud Architecture Framework: System Design’ Summary and My Notes

Sharing learnings from my cloud journey…

Devi Priya Karuppiah
13 min readNov 21, 2023

The Google Cloud Architecture Framework is organized into six categories (also known as pillars), as shown in the following diagram:

screen capture from Google Cloud’s documentation

System design is the foundational piece of the framework. Sharing quick notes on it below. Consider this blog as a condensed version of Google Cloud’s system design documentation. The goal is to use it for revision as you work on GCP services or take exams. You can find the entire documentation here — https://cloud.google.com/architecture/framework

Here is another cool find, the link to Google Cloud’s developer cheat sheet that can also be used to prep for any GCP certification — https://googlecloudcheatsheet.withgoogle.com/

Core principles of system design:

  1. Document everything

2. Simplify your design and use fully managed services

3. Decouple your architecture (e.g. monolithic to microservices)

4. Use a stateless architecture (Stateless applications can perform tasks without dependencies by using shared storage/ cached services. A stateless architecture enables your apps to scale up quickly with minimum boot dependencies)

Deployment Archetype

A deployment archetype is an abstract model that you use as the foundation to build application-specific deployment architectures that meet your business and technical requirements.

screen capture from https://cloud.google.com/architecture/framework/system-design/archetypes

These archetypes can be:

  • Zonal
  • Regional
  • Multi-regional
  • Global
  • Hybrid
  • Multicloud

Use best practices to deploy your system based on geographic requirements

Managing cloud resources

Google Cloud’s hierarchy lets you manage common aspects of your resources like access control, configuration settings, and policies.

pic credit — https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy
  • Use tags and labels at the outset of your project
  • Set up organization policies such as naming conventions, audits, etc.

Evaluate Google Cloud Compute options

  • Virtual machines (VM) with cloud-specific benefits like live migration.
  • Bin-packing of containers on cluster-machines that can share CPUs.
  • Functions and serverless approaches, where your use of CPU time can be metered to the work performed during a single HTTP request.
screen grab from https://cloud.google.com/architecture/framework/system-design/compute

Choose a compute migration approach

screen grab from https://cloud.google.com/architecture/framework/system-design/compute

Best practices for designing workloads

  • Evaluate serverless options for simple logic
  • Decouple your applications to be stateless
  • Use caching logic when you decouple architectures
  • Use live migrations to facilitate upgrades

To support your system

  • Design Scaling workloads — eg Use startup and shutdown script for stateful applications
  • Manage operations to support your system — eg Use snapshots for instance backup
  • Manage capacity, reservations and isolation — eg Use committed-use discounts to reduce costs

Design network infrastructure

  • this helps optimize for performance and secure application communications with internal and external services
  • Google’s private network connects regional locations to more than 100 global network points of presence
  • Google Cloud Virtual Private Cloud (VPC) provides networking functionality to Compute Engine virtual machine (VM) instances, Google Kubernetes Engine (GKE) containers, and serverless workloads.
  • Google ensures content is delivered with high throughput by using technologies like Bottleneck Bandwidth and Round-trip propagation time (BBR) congestion-control intelligence.

Best practices for designing workload VPC architectures to support your system:

  • Consider VPC network design early
  • Start with a single VPC network
  • Keep VPC network topology simple to ensure a manageable, reliable, and well-understood architecture
  • Use VPC networks in custom mode. To ensure that Google Cloud networking integrates seamlessly with your existing networking systems, we recommend that you use custom mode when you create VPC networks. Using custom mode helps you integrate Google Cloud networking into existing IP address management schemes and it lets you control which cloud regions are included in the VPC.

Best practices for designing inter-VPC connectivity to support your system

  • Choose a VPC connection method. To implement multiple VPC networks, you need to connect those networks. VPC networks are isolated tenant spaces within Google’s Andromeda software-defined network (SDN). Choose how you connect your network based on your bandwidth, latency, and service level agreement (SLA) requirements.
  • Use Shared VPC to administer multiple working groups
  • Use simple naming conventions to understand the purpose of each resource, where it’s located, and how it’s differentiated
  • Use connectivity tests to verify network security
  • Use Private Service Connect to create private endpoints
  • Secure and limit external connectivity
  • Use Network Intelligence Center to monitor your cloud networks

Storage Strategy Implementation

Cloud Storage provides reliable, secure object storage services

screen grab from https://cloud.google.com/architecture/framework/system-design/storage

Best practices for choosing a storage type to support your system

screen grab illustrating storage strategy options from https://cloud.google.com/architecture/framework/system-design/storage
  • Choose active or archival storage based on storage access needs

A storage class is a metadata that is used by every object. For data that is served at a high rate with high availability, use the Standard Storage class. For data that is infrequently accessed use the Nearline Storage, Coldline Storage, or Archive Storage class.

  • Evaluate storage location and data protection needs for Cloud Storage

For a Cloud Storage bucket located in a region, data contained within it is automatically replicated across zones within the region. Data is also replicated across multiple, geographically separate data centers.

  • Use Cloud CDN to improve static object delivery

Cloud CDN uses the Cloud Load Balancing external Application Load Balancer to provide routing, health checking, and anycast IP address support.

  • Use best practices to select optimal Storage access pattern and workload type. Use Persistent Disk to support high-performance storage access

Storage management best practices

Database optimization

GC offers a multitude of database services as listed in the table.

screen grab of key GC database services from https://cloud.google.com/architecture/framework/system-design/databases

Best practices for choosing a database to support your system

  • Consider using a managed database service and evaluate Google Cloud managed database services before you install your own database or database cluster. Installing your own database involves maintenance overhead including installing patches and updates, and managing daily operational activities like monitoring and performing backups.

To migrate databases, use one of the products described in the following table:

screen grab of key GC migration services from https://cloud.google.com/architecture/framework/system-design/databases
  • Choose an appropriate migration strategy
screen grab of migration options from https://cloud.google.com/architecture/framework/system-design/databases
  • Use Memorystore to support your caching database layer. Memorystore is a fully managed Redis and Memcached database that supports submilliseconds latency. Memorystore is fully compatible with open source Redis and Memcached.
  • Use Bare Metal servers to run an Oracle database. This approach fits into a lift-and-shift migration strategy.
  • Use migration as an opportunity to modernize your database and prepare it to support future business needs.
  • Use fixed databases with off-the-shelf applications. Commercial off-the-shelf (COTS) applications often require Lift and shift migration approach
  • Verify your team’s database migration skill set. Use Google Cloud Partner Advantage to find a partner to support you throughout your migration journey.
  • Design your databases to meet high availability (HA) and disaster recovery (DR) requirements, and evaluate the tradeoffs between reliability and cost
  • Specify cloud regions to support data residency (i.e. where your data physically resides at rest)
  • Include disaster recovery in data residency design — refer to 100% reliability is the wrong target and Disaster recovery planning guide.
  • Make your database Google Cloud-compliant

Encryption

Database design and scaling

Networking and access

  • Run databases inside a private network

Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure restricted access

  • Grant minimum privileges to users

Identity and Access Management (IAM) controls access to Google Cloud services, including database services.

Automation and right-sizing

  • Define database instances as code, which lets you apply a consistent and repeatable approach to creating and updating your databases.
  • Use Liquibase to version control your database. Google database services like Cloud SQL and Cloud Spanner support Liquibase. Liquibase helps you to track your database schema changes, roll back schema changes, and perform repeatable migrations.
  • Test and tune your database to support scaling
  • Choose the right database for your scaling requirements
screen grab of GC db scaling options from https://cloud.google.com/architecture/framework/system-design/databases

Operations — Use Cloud Monitoring to monitor and set up alerts for your database

Licensing — Select between on-demand licenses and existing licenses

Analyze your data

  • Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization.
  • Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam
screen grab of GC cloud analytics services from https://cloud.google.com/architecture/framework/system-design/data-analytics

Data Lifecycle

As part of your system design, you can group the Google Cloud data analytics services around the data lifecycle:

The following stages and services run across the entire data lifecycle:

  • Data integration includes services such as Data Fusion.
  • Metadata management and governance includes services such as Data Catalog.
  • Workflow management includes services such as Cloud Composer.

Data Ingestion best practices

  • Determine the data source for ingestion. Data typically comes from another cloud provider or service (use Cloud Data Fusion, Storage Transfer Service, or BigQuery Transfer Service), or from an on-premises location ( use Cloud Data Fusion and for large volumes of data, you can use Transfer Appliance or Storage Transfer Service)
  • Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.
  • Identify streaming or batch data sources. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery. If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template.
  • Ingest data with automated tools. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.
  • Use migration tools to ingest from another data warehouse. If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance
  • Estimate your data ingestion needs. The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second.
  • Use appropriate tools to regularly ingest data on a schedule. Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs.
  • Review FTP/SFTP server data ingest needs. If you need a code-free environment to ingest data from an FTP/SFTP server, you can use the FTP copy plugins.
  • Use Apache Kafka connectors to ingest data. If you use Pub/Sub, Dataflow, or BigQuery, you can ingest data using one of the Apache Kafka connectors

Data storage

Apply the following data storage best practices to your own environment.

screen grab illustrating data storage usecase from https://cloud.google.com/architecture/framework/system-design/data-analytics

Data Processing and Data Transformation

  • Explore the open source software you can use in Google Cloud. Dataproc is a Hadoop-compatible managed service that lets you host open source software, with little operational burden. Dataproc includes support for Spark, Hive, Pig, Presto, and Zookeeper.
  • Determine your ETL or ELT data-processing needs. Google Cloud lets you use either traditional ETL or more modern ELT data-processing systems.
  • Use the appropriate framework for your data use case. For a batch data processing system, you can process and transform data in BigQuery with a familiar SQL interface. If you have an existing pipeline that runs on Apache Hadoop or Spark on-premises or in another public cloud, you can use Dataproc. If you have analytics and SQL-focused teams and capabilities, you can also stream data into BigQuery. For real-time use cases, use Dataflow.
  • Retain future control over your execution engine. To minimize vendor lock-in and to be able to use a different platform in the future, use the Apache Beam programming model and Dataflow as a managed serverless solution
  • Use Dataflow to ingest data from multiple sources. To ingest data from multiple sources, such as Pub/Sub, Cloud Storage, HDFS, S3, or Kafka, use Dataflow
  • Discover, identify, and protect sensitive data. Use Sensitive Data Protection to perform actions such as to scan BigQuery data or de-identify and re-identify PII in large-scale datasets.
  • Modernize your data transformation processes. Use Dataform to write data transformations as code and to start to use version control by default.

Data analytics and warehouses best practices

  • Review your data storage needs. Data lakes and data warehouses aren’t mutually exclusive. Data lakes are useful for unstructured and semi-structured data storage and processing. Data warehouses are best for analytics and BI.
  • Identify opportunities to migrate from a traditional data warehouse to BigQuery. For more information and example scenarios, see Migrating data warehouses to BigQuery.
  • Plan for federated access to data. Identify your data federation (i.e. This virtual database takes data from a range of sources and converts them all to a common mode)lneeds, and create an appropriate system design. For example, BigQuery lets you define external tables that can read data from other sources, such as Bigtable, Cloud SQL, Cloud Storage, or Google Drive.
  • Use BigQuery flex slots to provide on-demand burst capacity. These flex slots help you when there’s a period of high demand or when you want to complete an important analysis.
  • Understand schema differences if you migrate to BigQuery. BigQuery supports both star and snowflake schemas, but by default it uses nested and repeated fields.

Reports and visualization

  • Use BigQuery BI Engine to visualize your data
  • Modernize your BI processes with Looker. Looker is a modern, enterprise platform for BI, data applications, and embedded analytics. If you have existing BI processes and tools, we recommend that you modernize and use a central platform such as Looker.
  • Use workflow management tools to manage and maintain end-to-end data pipelines, use appropriate workflow management tools. Cloud Composer is a fully managed workflow management tool based on the open source Apache Airflow project.

Implement machine learning

screen grab illustrating GC’s AI and ML services from https://cloud.google.com/architecture/framework/system-design/ai-ml

Data processing best practices

  • Ensure that your data meets ML requirements such as accurately labeled data for training
  • Store tabular data in BigQuery. If you use tabular data, consider storing all data in BigQuery and using the BigQuery Storage API to read data from it.
  • Ensure you have enough data to develop an ML model. To predict a category, the recommended number of examples for each category is 10 times the number of features. The more categories you want to predict, the more data you need.
  • Prepare data for consumption. When you configure your data pipeline, make sure that it can process both batch and stream data so that you get consistent results from both types of data.

Model development and training best practices

  • Choose managed or custom-trained model development. When you build your model, consider the highest level of abstraction possible. Use AutoML when possible so that the development and training tasks are handled for you. Consider the Vertex AI training service instead of self-managed training on Compute Engine VMs or Deep Learning VM containers. For a JupyterLab environment, consider Vertex AI Workbench, which provides both managed and user-managed JupyterLab environments.
  • Use pre-built or custom containers for custom-trained models. Pre-built containers are available for Python training applications that are created for specific TensorFlow, scikit-learn, PyTorch, and XGBoost versions.
  • Consider distributed training requirements. Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines. These frameworks automatically coordinate division of work based on environment variables that are set on each machine.

Design for environmental sustainability

  • Understand your carbon footprint - visit the Carbon Footprint dashboard
  • One simple and effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions — use the Google Cloud Region Picker to balance lowering emissions with other requirements, such as pricing and network latency
  • Migrate workloads to managed services. Also consider that many workloads don’t require VMs. Often you can utilize a serverless offering instead. These managed services can optimize cloud resource usage, often automatically, which simultaneously reduces cloud costs and carbon footprint.
  • Identify idle or overprovisioned resources and either delete them or rightsize them.
  • Reduce emissions for batch workloads — for more info see Reduce emissions for batch workloads

The details of the 5 pillars that reside on top of the system design layer can be found here :

Follow me on my journey to complete Google’s Professional Cloud Architect certificate. More blogs to follow in this corner of the digital space.

❤️❤️❤️ Never stop learning ❤️❤️❤️

--

--

Devi Priya Karuppiah

~ Multipotentialite ~Versatile Product and Program Manager -Dreamer- Believer- Go-getter ~ Aspiring Minimalist, Novelist, World Backpacker