Google Cloud Production Guideline

In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.

Design and Development

  • Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
  • Define and set SLOs for your service at design time.
  • Document the availability expectations of external services you depend on.
  • Avoid single points of failures by not depending on single global resource. Have the resource replicated or have a proper fallback (e.g. hardcoded value) when resource is not available.

Configuration Management

  • Static, small and non-secret configuration can be command-line flags. Use a configuration delivery service for everything else.
  • Dynamic configuration should have a reasonable fallback in the case of unavailability of the configuration system.
  • Development environment configuration shouldn’t inherit from production configuration. This may lead access to production services from development and can cause privacy issues and data leaks.
  • Document what can be configured dynamically and explain the fallback behavior if configuration delivery system is not available.

Release Management

  • Document all details about your release process. Document how releases affect SLOs (e.g. temporary higher latency due to cache misses).
  • Document your canary release process.
  • Have a canary analysis plan and setup mechanisms to automatically revert canaries if possible.
  • Ensure rollbacks can use the same process that rollouts use.

Observability

  • Ensure the collection of metrics that are required by your SLOs are collected and exported from your binaries.
  • Make sure client- and server-side of the observability data can be differentiated. This is important to debug issues in production.
  • Tune alerts to reduce toil, for example remove alerts triggered by the routine events.
  • If you are using Stackdriver, include GCP platform metrics in your dashboards. Setup alerting for your GCP dependencies.
  • Always propagate the incoming trace context. See grpc-trace-bin metadata key for gRPC, X-Cloud-Trace-Contextheader for HTTP requests. Even if you are not participating in the trace, this will allow GCP to debug production issues.
  • Enable Stackdriver Profiler to optimize your CPU usage.

Security and Protection

  • Make sure all external requests are encrypted.
  • Make sure your production projects have proper IAM configuration.
  • Use projects to fully isolate resources.
  • Use networks within projects to isolate groups of VM instances.
  • Use Cloud VPN to securely connect remote networks.
  • Document and monitor user data access. Ensure that all user data access is logged and audited.
  • Ensure debugging endpoints are limited by ACL.
  • Sanitize user input. Have payload size restrictions for user input.
  • Ensure your service can block incoming traffic selectively per user. This allows to block the abuse cases without impacting other users.
  • Avoid external endpoints that triggers a large number of internal fan-outs.

Capacity Planning

  • Document how your service scales. Examples: number of users, size of incoming payload, number of incoming messages.
  • Document resource requirements for your service. Examples: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPUs or TPUs.
  • Document resource constraints: resource type, region, etc.
  • Document quota restrictions to create new resources. For example, document the rate limit of GCE API if you are creating new instances via the API.
  • Consider having load tests for performance regressions where possible.