In the last few years, I found myself giving advice to a ton of companies who are on Google Cloud about the basics to consider when going to production. Based on production-readiness good practices and commonly undermined steps, I put together a list of actions to go through before production. Even though the list is specifically mentioning Google Cloud, it is still useful and applicable outside of Google Cloud.
Design and Development
- Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
- Define and set SLOs for your service at design time.
- Document the availability expectations of external services you depend on.
- Avoid single points of failures by not depending on single global resource. Have the resource replicated or have a proper fallback (e.g. hardcoded value) when resource is not available.
- Static, small and non-secret configuration can be command-line flags. Use a configuration delivery service for everything else.
- Dynamic configuration should have a reasonable fallback in the case of unavailability of the configuration system.
- Development environment configuration shouldn’t inherit from production configuration. This may lead access to production services from development and can cause privacy issues and data leaks.
- Document what can be configured dynamically and explain the fallback behavior if configuration delivery system is not available.
- Document all details about your release process. Document how releases affect SLOs (e.g. temporary higher latency due to cache misses).
- Document your canary release process.
- Have a canary analysis plan and setup mechanisms to automatically revert canaries if possible.
- Ensure rollbacks can use the same process that rollouts use.
- Ensure the collection of metrics that are required by your SLOs are collected and exported from your binaries.
- Make sure client- and server-side of the observability data can be differentiated. This is important to debug issues in production.
- Tune alerts to reduce toil, for example remove alerts triggered by the routine events.
- If you are using Stackdriver, include GCP platform metrics in your dashboards. Setup alerting for your GCP dependencies.
- Always propagate the incoming trace context. See
grpc-trace-binmetadata key for gRPC,
X-Cloud-Trace-Contextheader for HTTP requests. Even if you are not participating in the trace, this will allow GCP to debug production issues.
- Enable Stackdriver Profiler to optimize your CPU usage.
Security and Protection
- Make sure all external requests are encrypted.
- Make sure your production projects have proper IAM configuration.
- Use projects to fully isolate resources.
- Use networks within projects to isolate groups of VM instances.
- Use Cloud VPN to securely connect remote networks.
- Document and monitor user data access. Ensure that all user data access is logged and audited.
- Ensure debugging endpoints are limited by ACL.
- Sanitize user input. Have payload size restrictions for user input.
- Ensure your service can block incoming traffic selectively per user. This allows to block the abuse cases without impacting other users.
- Avoid external endpoints that triggers a large number of internal fan-outs.
- Document how your service scales. Examples: number of users, size of incoming payload, number of incoming messages.
- Document resource requirements for your service. Examples: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPUs or TPUs.
- Document resource constraints: resource type, region, etc.
- Document quota restrictions to create new resources. For example, document the rate limit of GCE API if you are creating new instances via the API.
- Consider having load tests for performance regressions where possible.