Operational Excellence

The goal of this story is to provide the essence of Operational Excellence to manage your cloud environment. Though it has references to AWS and Cloud, the principles can be equally applied to any cloud or on-premise environments.

AWS has published the Well Architected Framework in 2015 and described four pillars for Well Architected Principles namely

  • Reliability
  • Security
  • Performance Efficiency
  • Cost Optimization

Operational Excellence is the latest addition to the Well Architected Framework. This story is focused on the Operational Excellence and provides salient points of Operational Excellence Principles for quick reference and is based on the AWS white papers.

Design Principles of Operational Excellence

  • Perform Operations as Code: All Operational aspects (infrastructure, applications) must be treated as code and scripted, They must be treated uniformly across all environments. The scripts must be automated and eliminate human errors.
  • Annotated Documentation: Create annotated documentation using every change. Use annotations as input to your Operations Code. For example release notes are automatically generated by using GitHub commit messages and Issue templates.
  • Make frequent, small and reversible changes: Design workloads such that components can be updated regularly and also small enough to reverse changes when they fail
  • Refine Operations Procedures Frequently: As the workload evolves, refine operational procedures and evolve them along. Conduct game days to validate the procedures
  • Anticipate Failures: Perform failure tests to identify potential failure sources, identify them and mitigate or remove them. Check out http://principlesofchaos.org/
  • Learn from Operational Failures: Drive improvements through operational events, lessons learned. Focus on problem, not people

Areas of Operational Excellence

Prepare

Operational Priorities

  • Understand entire workload and business goals
  • Understand regulatory and compliance requirements
  • Leverage AWS Trusted Advisor, Cloud Compliance, and Well Architected Framework as best practices

Design for Operations

  • Design how workload can be deployed, updated and operated
  • Implement engineering practices for troubleshooting and defect fixing
  • Observe the system using logging/instrumentation and insightful business and technical metrics
  • Design your workload as code at all layers of the stack (Application/Infrastructure/Policies/Governance and Operations)
  • Version control the infrastructure
  • Implement CICD at all levels
  • Leverage metadata to identify resources for operational activities. e.eg use Tags to identify environment/owner etc.
  • Publish and capture metrics
  • Leverage key AWS Services like CloudWatch, Developer Tools, and X-ray

Operational Readiness

  • Use consistent processes for deploying workload
  • Use runbooks to automate routine activities like deployments and playbooks for issue resolutions
  • Leverage right sized operations team (Site Reliability Engineers)
  • Use comparable parallel environments to test failures and performance
  • Leverage services like EC2 system manager to run scripts in EC2 and use AWS Lambda to respond to events
  • Anticipate failures and test for failures
  • Leverage AWS Config to track changes to vital configurations e,g, CFTs

Operate

The understanding of the operational health of workloads is key. Using business and technical metrics, observe events and respond

Understanding Operational Health

  • Use metrics based on operational outcomes e.g. successful logins per second during peak vs non-peak and identify deviations
  • Implement dashboards and technical viewpoints to help the operations team make informed decisions
  • Leverage logging services like CloudWatch Logs, Dashboards like CloudWatch Dashboards
  • Analyze logs: e.g Ingest logs into AWS ElasticSearch and use Kibana dashboards
  • Leverage AWS Service Health Dashboard (SHD)and Personal Health Dashboard (PHD) for monitoring higher-level events that might affect the system

Responding to Events

  • Anticipate planned events like sales promotions, peak sale days, paydays and also unplanned events like component failures, cloud provider outages
  • Use runbooks and playbooks to respond to alerts consistently
  • Alerts should be assigned to accountable operations team
  • Conduct Root Cause Analysis (RCA) to refine runbooks and playbooks
  • Improve recovery by replacing failed components using last known good versions and conduct analysis on failed resources separately
  • Respond to events using available services for e.g. Use CloudEWatch events to invoke Lambda functions, ECS tasks etc.. or use Service APIs to connect to third-party services like Splunk/Sumologic

Evolve

Improving over the time is key to success. Implement small and incremental changes and evolve from the lessons learned.

Learning from Experience

  • When things fail, learn from the failures
  • Analyze failures and plan improvements
  • Review lessons learned widely across various teams and validate
  • Perform cross-platform reviews with business/operations and developer teams to validate insights and identify areas of improvements
  • Leverage key AWS services like CloudWatch logs, Athena, S3 and QuickSight to collect/analyze data

Share Learnings

  • Share learnings to increase the benefit across organizations
  • Socializing frequently occurring issues and improvement opportunities will increase the focus to deliver more features
  • Share Cloudformation templates, AMIs, Reusable Lambda functions for key operational actions
  • Use Version controlled code for all layers of the stack for tracking and sharing.

Closing thoughts

  • Operational Excellence is an ongoing effort
  • Every failure and an operational event must be treated as an opportunity to improve
  • Focus on incremental improvements and learn from failures/retrospectives

References

Find more deep references from the white paper

AWS Well Architected Framework

Operational Excellence White Paper