Production Reliability Principles [from fowler16]:

Steve McGhee

5 min readApr 19, 2018

This is a summary of Susan Fowler’s book. Go buy it!

“A Service is…

Stable

Stable dev cycle — trusted QA cycle, mocks, unit/integration tests, dev environments emulating prod. eg: CI/CD pipeline
Stable deployment process — clear indicators of regressions, quick/automated rollbacks of canaries
Stable introduction/deprecation process — eg: EOL of libraries and features are planned and executed
Utilize immutable infrastructure
* This should provide rapid turnup and teardown of entire sets of the infrastructure.
* Also the ability to have ephemeral clusters for testing
* No need to continually track changes, just declaring what needs to exist.

Reliable

Reliable deployment process — release, canary, validate, fast rollbacks
Planning, mitigation and protection from dependency failure — do not require/expect perfection from your backends
Reliable routing and service discovery — accurate and fast health checking, decentralized service map, fast service map update / propagation

Scalable

Growth scaling understood (quantitative, qualitative) — does this service grow with customer acquisition? time? languages-supported? what is the ratio of “eyeballs” to rps?
Identify resource bottlenecks and requirements — shared resources: CPU/RAM/ports/FDs/quotas/DB-connections/… — favor horizontal scaling (more computers) over vertical scaling (bigger computers).
Capacity planning — repeatable tests of “real” traffic. establish “sustainable limit” vs “breaking point” and track over time, finding regressions.
Scalable traffic handling (batch/burst) — planning for surges of traffic: dropping less-valuable traffic, using queues effectively, fail-fast, log/monitor traffic levels. Utilize traffic troughs for batch compute.
Scale to zero: zero traffic = zero cost, no idle VMs.
Autoscaling based on load. No need to “see new traffic surges coming” but be able to respond immediately to demand. Scaling LB pools as well as corresponding compute sinks.
Scaling dependencies: planning and execution — introducing new frontends can require more backends. a process for determining and communicating this is needed.
Scalable data storage — consider upfront: consistency needs, geographical replication needs. also: side effects of schema changes, reads/write ratio, parallelism and replica choices.

Fault Tolerant / Prepared for Catastrophe

Identify and plan for failure scenarios — practice regular and creative disaster exercises
Single points of failure identified and resolved — beware solitary message queues
Failure detection and remediation strategies in place — plan for: hardware failures, network failures, configuration management failure, monitoring/logging failure. Practice these failure modes during regular disaster testing
Testing: code/unit testing, integration testing, end-to-end testing, load testing, chaos testing, ongoing chaos “fuzzing”.
Traffic management abilities, in preparation for failure. We should have the knobs available to drop traffic, to sacrifice nodes/clusters, to totally isolate services.
Incident and outages handled productively:
* Assessment: an automated indicator of a problem, with a clear call to action. Not “eyeballs on a graph” but a monitoring system with alert thresholds. Calls to action includes immediately determining experts and service-owning teams. Services must have oncall rotations pre-populated and maintained.
* Coordinate: communicating about the incident in real-time, allowing “lurking” without interference, clearly record events and actions in a log, allow other (dependant) failing services to know and “attach” to a given incident. Develop a single point of contact for record keeping and information about a given incident.
* Mitigate: reduce impact to end users immediately, including at the cost of rolling back an awaited release. Mitigation is not always the long-term fix, but a reduction or elimination of harm. Once mitigation is complete, the TTM (time-to-mitigation) clock stops and the service’s SLA is no longer threatened.
* Resolve: If mitigation included short-term fixes, here you fix them properly. This stage is not time-dependant as the services SLA is no longer impacted. This stage can be done without being rushed.
* Follow-up: blameless postmortems must be written, including concrete steps around: detection, mitigation, prevention. These next steps should have team and/or individual ownership, as well as expected deadlines. These should be reviewed for completion. Bugs that are related to a production incident should be treated with higher priority than non-exercised theoretical failure modes.
Datastores must have backups which are regularly tested via regular data restoration tests. This includes single-datum restores, point-in-time restores, full-scale restores.
These tests ensure recovery not only from disastrous failure of data availability, but also detection and correction of slower data corruption / decay.

Performant

SLAs/SLOs for availability established and monitored
Efficient utilization of resources (CPU/RAM)
Region-aware storage and compute. Serving a customer’s request from a “nearby” node results in a better experience for the given user.
* This often implies colocating the user’s data to a nearby geography, as well as the associated compute stack.
* Allow for the “migration” of user data across geographies. (eg: Texan user visits Taipei)
* Consider Hot/Warm/Cold storage options.
* Fully In-region for 80% of workloads, data.
* 3 basic regions: APAC, EMEA, NOTAM

Observable / Monitored [ 1 @copyconstruct | 2 @SREBook ]

Logging and tracing throughout the stack. A given request or type of request should be trackable across service boundaries. Service versioning must be included in the logs, to determine which version of software results in which log / response.
Separation of service-level vs host-level metrics
* Host-level metrics are often not applicable to services running on a scheduled job management system like kubernetes.
* Aggregated Service-level: showing the utilization of resources across an entire microservice can be helpful.
* Drilldown Service-level (eg: individual container) metrics should be findable, but not the primary source of alerting or viewing.
Both Whitebox and Blackbox systems.
* Whitebox: explicit instrumentation of the code, exposing internals to monitoring rules and consoles. Exposing internals of the system, including counters, queue-length, virutal machine stats, garbage collection stats, other internal statistics.
* Blackbox: monitoring a service as a user would see it. Issue requests to the public API, URLs, measure response times and error codes, optionally search for particular bits of data in response.
Dashboards that are understandable and accurately reflect the health of each service
* Health dashboards: metrics that show how the system is working, throughput rates, response times, etc.
* Product dashboards: metrics that show how the business-level metrics are doing, money earned, users logged-in, active sessions, empty shopping carts, etc.
Chargeback — an economic transparent model and understanding of resource usage per service to ensure proper tradeoffs.
* Allow for teams to view and manage their own usage, weigh infra costs against development.
* Pricing and usage must be available for service owners, both for their service as well as their usage of shared infrastructure. The level of granularity should allow individual services to make decisions on their usage without conferring with external teams.

Documented

Clear architecture diagram (including dependencies and exposed interfaces, request flows, available endpoints)
FAQs about a service, for future maintainers and future potential integrations
Oncall runbook. Given a failure mode, what knobs are available, how to diagnose and mitigate issues. These should be short and clearly written.
Aside from explicit knobs and mitigation, these should also include basic Troubleshooting and Debugging sections, detailing strategic and methodical ways to observe a given system.
Alerts should be tied to and actionable runbook entry.
Runbooks must be discoverable and searchable.
Thorough updated and centralized (discoverable) documentation for each service
Organizational understanding of teams around service: the oncall rotation that is responsible for a service, pointers to team structure, the SLA for a service.

Secure

(Out of Scope)

[fowler16] — Production-Ready Microservices

Building Standardized Systems Across an Engineering Organization

By Susan Fowler http://shop.oreilly.com/product/0636920053675.do

Production Reliability Principles [from fowler16]:

Written by Steve McGhee