Operational Excellence

Published in

becloudy

4 min readNov 24, 2017

The goal of this story is to provide the essence of Operational Excellence to manage your cloud environment. Though it has references to AWS and Cloud, the principles can be equally applied to any cloud or on-premise environments.

AWS has published the Well Architected Framework in 2015 and described four pillars for Well Architected Principles namely

Reliability
Security
Performance Efficiency
Cost Optimization

Operational Excellence is the latest addition to the Well Architected Framework. This story is focused on the Operational Excellence and provides salient points of Operational Excellence Principles for quick reference and is based on the AWS white papers.

Design Principles of Operational Excellence

Perform Operations as Code: All Operational aspects (infrastructure, applications) must be treated as code and scripted, They must be treated uniformly across all environments. The scripts must be automated and eliminate human errors.
Annotated Documentation: Create annotated documentation using every change. Use annotations as input to your Operations Code. For example release notes are automatically generated by using GitHub commit messages and Issue templates.
Make frequent, small and reversible changes: Design workloads such that components can be updated regularly and also small enough to reverse changes when they fail
Refine Operations Procedures Frequently: As the workload evolves, refine operational procedures and evolve them along. Conduct game days to validate the procedures
Anticipate Failures: Perform failure tests to identify potential failure sources, identify them and mitigate or remove them. Check out http://principlesofchaos.org/
Learn from Operational Failures: Drive improvements through operational events, lessons learned. Focus on problem, not people

Areas of Operational Excellence

Prepare

Operational Priorities

Understand entire workload and business goals
Understand regulatory and compliance requirements
Leverage AWS Trusted Advisor, Cloud Compliance, and Well Architected Framework as best practices

Design for Operations

Design how workload can be deployed, updated and operated
Implement engineering practices for troubleshooting and defect fixing
Observe the system using logging/instrumentation and insightful business and technical metrics
Design your workload as code at all layers of the stack (Application/Infrastructure/Policies/Governance and Operations)
Version control the infrastructure
Implement CICD at all levels
Leverage metadata to identify resources for operational activities. e.eg use Tags to identify environment/owner etc.
Publish and capture metrics
Leverage key AWS Services like CloudWatch, Developer Tools, and X-ray

Operational Readiness

Use consistent processes for deploying workload
Use runbooks to automate routine activities like deployments and playbooks for issue resolutions
Leverage right sized operations team (Site Reliability Engineers)
Use comparable parallel environments to test failures and performance
Leverage services like EC2 system manager to run scripts in EC2 and use AWS Lambda to respond to events
Anticipate failures and test for failures
Leverage AWS Config to track changes to vital configurations e,g, CFTs

Operate

The understanding of the operational health of workloads is key. Using business and technical metrics, observe events and respond

Understanding Operational Health

Use metrics based on operational outcomes e.g. successful logins per second during peak vs non-peak and identify deviations
Implement dashboards and technical viewpoints to help the operations team make informed decisions
Leverage logging services like CloudWatch Logs, Dashboards like CloudWatch Dashboards
Analyze logs: e.g Ingest logs into AWS ElasticSearch and use Kibana dashboards
Leverage AWS Service Health Dashboard (SHD)and Personal Health Dashboard (PHD) for monitoring higher-level events that might affect the system

Responding to Events

Anticipate planned events like sales promotions, peak sale days, paydays and also unplanned events like component failures, cloud provider outages
Use runbooks and playbooks to respond to alerts consistently
Alerts should be assigned to accountable operations team
Conduct Root Cause Analysis (RCA) to refine runbooks and playbooks
Improve recovery by replacing failed components using last known good versions and conduct analysis on failed resources separately
Respond to events using available services for e.g. Use CloudEWatch events to invoke Lambda functions, ECS tasks etc.. or use Service APIs to connect to third-party services like Splunk/Sumologic

Evolve

Improving over the time is key to success. Implement small and incremental changes and evolve from the lessons learned.

Learning from Experience

When things fail, learn from the failures
Analyze failures and plan improvements
Review lessons learned widely across various teams and validate
Perform cross-platform reviews with business/operations and developer teams to validate insights and identify areas of improvements
Leverage key AWS services like CloudWatch logs, Athena, S3 and QuickSight to collect/analyze data

Share Learnings

Share learnings to increase the benefit across organizations
Socializing frequently occurring issues and improvement opportunities will increase the focus to deliver more features
Share Cloudformation templates, AMIs, Reusable Lambda functions for key operational actions
Use Version controlled code for all layers of the stack for tracking and sharing.

Closing thoughts

Operational Excellence is an ongoing effort
Every failure and an operational event must be treated as an opportunity to improve
Focus on incremental improvements and learn from failures/retrospectives

References

Find more deep references from the white paper

AWS Well Architected Framework

Operational Excellence White Paper

Operational Excellence

Prepare

Operate

Evolve

References

Written by Sathiya Shunmugasundaram