How To Design For Operational Excellence in Software Application

Published in

CodeX

3 min readJul 31, 2021

Do you envision your software application is clean, tested, efficient, well monitored, requires negligible operational effort, and provides the best customer experience?

Well, GREAT NEWS!

I have compiled some best operational practices and lessons built up over time that can be leveraged to make sure that your software is running properly, your customers are taken care of, and your team resolves defects in a timely manner:

Resiliency Best Practices

Resiliency is system’s ability to recover from a fault and maintain persistency of service so that system continues to work despite failures.

Recognize the most critical API and reserve capacity for it to keep service afloat even if a portion of it is failing
Prevent retry storms by aggregating retries at process, host or service level
Set timeouts of dependency calls such that together each call is lower than overall Latency SLA of the service
Have lever-based mechanisms that allows to turn off functionality and limit load to dependencies during outages to limit the spread of failures
Routinely test for/with failure: Chaos, Stress, Soak Tests

Chaos Test: Testing system by continuously, however randomly injecting failures from dependencies
Stress Test: Testing system beyond normal operating points and evaluate system limits for extreme conditions
Soak Test: Testing system under a huge volume of load for long duration to find performance issues and resource leaks like memory, thread pools etc.

Reliability Best Practices

Reliability is system’s capability to perform failure-free operation in a given environment for a specified period for a predefined number of input cases, assuming that the hardware and the input are error-free.

Create roadmap for fixing known OE and design problems
Design services to scale a minimum of 2X
Architect for redundancy to avoid single point of failure
Throttle abnormal high volume & high latency callers
Setup metrics on service usage trends
Periodically perform logging hygiene as bad logging can fill up your disks and cause a system to do even more work when it’s degrading
Manage deployment risks through change management mechanism that ensure not all regions are deployed at one time
Automate deployments, verifications and roll backs to achieve consistency

Scaling Best Practices

Scalability is system’s potential to increase capacity and functionalities based on its users’ demand.

Bias towards microservice design strategy where each service performs one business function so that it is easy to cut or replace to better handle failure
Consider to use serverless or container based infrastructure for small services to reduce effort for planning scaling for them
Configure early alerts for each service in case of abnormal breaching so that it buys time for fix
Automate operational scaling with Scaling Planners like CloudTune for EC2 in AWS
Have automated alarms and dashboards for resource utilization and performance metrics like SLA on Availability, Latency, Error, Fault etc.
Automate the running of load tests with every deployment release

Remember, one size fits none!

Not all systems are built with the same customer expectations or availability standards. Operational Excellence is a continuous improvement process to build high availability systems for customers and to reduce developer effort in managing the service.

Thank you for reading! If you found this helpful, here are some next steps you can take:

Send some claps my way!
Follow me on Medium for next part on Operational Excellence in Design series!
Connect with me on LinkedIn & Twitter for more tech blogs!

How To Design For Operational Excellence in Software Application

Resiliency Best Practices

Reliability Best Practices

Scaling Best Practices

Written by Kanika Modi