Google Cloud Production Guideline

Published in

Google Cloud - Community

3 min readJan 12, 2019

In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.

Design and Development

Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
Define and set SLOs for your service at design time.
Document the availability expectations of external services you depend on.
Avoid single points of failures by not depending on single global resource. Have the resource replicated or have a proper fallback (e.g. hardcoded value) when resource is not available.

Configuration Management

Static, small and non-secret configuration can be command-line flags. Use a configuration delivery service for everything else.
Dynamic configuration should have a reasonable fallback in the case of unavailability of the configuration system.
Development environment configuration shouldn’t inherit from production configuration. This may lead access to production services from development and can cause privacy issues and data leaks.
Document what can be configured dynamically and explain the fallback behavior if configuration delivery system is not available.

Release Management

Document all details about your release process. Document how releases affect SLOs (e.g. temporary higher latency due to cache misses).
Document your canary release process.
Have a canary analysis plan and setup mechanisms to automatically revert canaries if possible.
Ensure rollbacks can use the same process that rollouts use.

Observability

Ensure the collection of metrics that are required by your SLOs are collected and exported from your binaries.
Make sure client- and server-side of the observability data can be differentiated. This is important to debug issues in production.
Tune alerts to reduce toil, for example remove alerts triggered by the routine events.
If you are using Stackdriver, include GCP platform metrics in your dashboards. Setup alerting for your GCP dependencies.
Always propagate the incoming trace context. See grpc-trace-bin metadata key for gRPC, X-Cloud-Trace-Contextheader for HTTP requests. Even if you are not participating in the trace, this will allow GCP to debug production issues.
Enable Stackdriver Profiler to optimize your CPU usage.

Security and Protection

Make sure all external requests are encrypted.
Make sure your production projects have proper IAM configuration.
Use projects to fully isolate resources.
Use networks within projects to isolate groups of VM instances.
Use Cloud VPN to securely connect remote networks.
Document and monitor user data access. Ensure that all user data access is logged and audited.
Ensure debugging endpoints are limited by ACL.
Sanitize user input. Have payload size restrictions for user input.
Ensure your service can block incoming traffic selectively per user. This allows to block the abuse cases without impacting other users.
Avoid external endpoints that triggers a large number of internal fan-outs.

Capacity Planning

Document how your service scales. Examples: number of users, size of incoming payload, number of incoming messages.
Document resource requirements for your service. Examples: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPUs or TPUs.
Document resource constraints: resource type, region, etc.
Document quota restrictions to create new resources. For example, document the rate limit of GCE API if you are creating new instances via the API.
Consider having load tests for performance regressions where possible.