Google Cloud Production Guideline

Jaana Dogan
Jan 12, 2019 · 3 min read

In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.

Design and Development

  • Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
  • Define and set SLOs for your service at design time.
  • Document the availability expectations of external services you depend on.
  • Avoid single points of failures by not depending on single global resource. Have the resource replicated or have a proper fallback (e.g. hardcoded value) when resource is not available.

Configuration Management

  • Static, small and non-secret configuration can be command-line flags. Use a configuration delivery service for everything else.
  • Dynamic configuration should have a reasonable fallback in the case of unavailability of the configuration system.
  • Development environment configuration shouldn’t inherit from production configuration. This may lead access to production services from development and can cause privacy issues and data leaks.
  • Document what can be configured dynamically and explain the fallback behavior if configuration delivery system is not available.

Release Management

  • Document all details about your release process. Document how releases affect SLOs (e.g. temporary higher latency due to cache misses).
  • Document your canary release process.
  • Have a canary analysis plan and setup mechanisms to automatically revert canaries if possible.
  • Ensure rollbacks can use the same process that rollouts use.

Observability

  • Ensure the collection of metrics that are required by your SLOs are collected and exported from your binaries.
  • Make sure client- and server-side of the observability data can be differentiated. This is important to debug issues in production.
  • Tune alerts to reduce toil, for example remove alerts triggered by the routine events.
  • If you are using Stackdriver, include GCP platform metrics in your dashboards. Setup alerting for your GCP dependencies.
  • Always propagate the incoming trace context. See grpc-trace-bin metadata key for gRPC, X-Cloud-Trace-Contextheader for HTTP requests. Even if you are not participating in the trace, this will allow GCP to debug production issues.
  • Enable Stackdriver Profiler to optimize your CPU usage.

Security and Protection

  • Make sure all external requests are encrypted.
  • Make sure your production projects have proper IAM configuration.
  • Use projects to fully isolate resources.
  • Use networks within projects to isolate groups of VM instances.
  • Use Cloud VPN to securely connect remote networks.
  • Document and monitor user data access. Ensure that all user data access is logged and audited.
  • Ensure debugging endpoints are limited by ACL.
  • Sanitize user input. Have payload size restrictions for user input.
  • Ensure your service can block incoming traffic selectively per user. This allows to block the abuse cases without impacting other users.
  • Avoid external endpoints that triggers a large number of internal fan-outs.

Capacity Planning

  • Document how your service scales. Examples: number of users, size of incoming payload, number of incoming messages.
  • Document resource requirements for your service. Examples: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPUs or TPUs.
  • Document resource constraints: resource type, region, etc.
  • Document quota restrictions to create new resources. For example, document the rate limit of GCE API if you are creating new instances via the API.
  • Consider having load tests for performance regressions where possible.

Google Cloud - Community

Google Cloud community articles and blogs

Jaana Dogan

Written by

See rakyll.org for more.

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Jaana Dogan

Written by

See rakyll.org for more.

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store