How To Design For Operational Excellence in Software Application

Kanika Modi
CodeX
Published in
3 min readJul 31, 2021

Do you envision your software application is clean, tested, efficient, well monitored, requires negligible operational effort, and provides the best customer experience?

Well, GREAT NEWS!

I have compiled some best operational practices and lessons built up over time that can be leveraged to make sure that your software is running properly, your customers are taken care of, and your team resolves defects in a timely manner:

Resiliency Best Practices

Resiliency is system’s ability to recover from a fault and maintain persistency of service so that system continues to work despite failures.

  1. Recognize the most critical API and reserve capacity for it to keep service afloat even if a portion of it is failing
  2. Prevent retry storms by aggregating retries at process, host or service level
  3. Set timeouts of dependency calls such that together each call is lower than overall Latency SLA of the service
  4. Have lever-based mechanisms that allows to turn off functionality and limit load to dependencies during outages to limit the spread of failures
  5. Routinely test for/with failure: Chaos, Stress, Soak Tests
  • Chaos Test: Testing system by continuously, however randomly injecting failures from dependencies
  • Stress Test: Testing system beyond normal operating points and evaluate system limits for extreme conditions
  • Soak Test: Testing system under a huge volume of load for long duration to find performance issues and resource leaks like memory, thread pools etc.

Reliability Best Practices

Reliability is system’s capability to perform failure-free operation in a given environment for a specified period for a predefined number of input cases, assuming that the hardware and the input are error-free.

  1. Create roadmap for fixing known OE and design problems
  2. Design services to scale a minimum of 2X
  3. Architect for redundancy to avoid single point of failure
  4. Throttle abnormal high volume & high latency callers
  5. Setup metrics on service usage trends
  6. Periodically perform logging hygiene as bad logging can fill up your disks and cause a system to do even more work when it’s degrading
  7. Manage deployment risks through change management mechanism that ensure not all regions are deployed at one time
  8. Automate deployments, verifications and roll backs to achieve consistency

Scaling Best Practices

Scalability is system’s potential to increase capacity and functionalities based on its users’ demand.

  1. Bias towards microservice design strategy where each service performs one business function so that it is easy to cut or replace to better handle failure
  2. Consider to use serverless or container based infrastructure for small services to reduce effort for planning scaling for them
  3. Configure early alerts for each service in case of abnormal breaching so that it buys time for fix
  4. Automate operational scaling with Scaling Planners like CloudTune for EC2 in AWS
  5. Have automated alarms and dashboards for resource utilization and performance metrics like SLA on Availability, Latency, Error, Fault etc.
  6. Automate the running of load tests with every deployment release

Remember, one size fits none!

Not all systems are built with the same customer expectations or availability standards. Operational Excellence is a continuous improvement process to build high availability systems for customers and to reduce developer effort in managing the service.

Thank you for reading! If you found this helpful, here are some next steps you can take:

  1. Send some claps my way!
  2. Follow me on Medium for next part on Operational Excellence in Design series!
  3. Connect with me on LinkedIn & Twitter for more tech blogs!

--

--

Kanika Modi
CodeX
Writer for

Software Engineer @amazon | Designing Systems At Scale | Tech & Career Blogs | She/Her | ViewsMine()