Check out these 5 Design Principles of Reliability!!

Shannon Paige

Published in

Plaintext Computer Science

2 min readSep 17, 2020

You won’t believe number 3!!!!

Automatically recover from failure: Pick some business value metrics to measure (key performance indicators — KPIs), and set up alerts/notification when one of these KPIs is above/below the expected value. Once you have that in place, you can add in a recovery process that is triggered by the alert. If you can make the alerts sensitive enough, you could even get to the point where you can predict and fix any failures before the happen.
Test recovery procedures: Much like the famed “Netflix Chaos Monkey” you can set up “failures” of your system and test how well you respond/recover from the failures. This type of chaos testing can expose holes in your recovery process before errors even happen
Scale horizontally to increase aggregate workload availability: Would you rather fight one horse-sized duck, or 100 duck-sized horses? Instead of having one big resource that could fail, split up that resource into lots of little ones. That way if one fails, it won’t bring the whole system down. Just make sure that they don’t all rely on the same things and could possibly all come down at once.
Stop guessing capacity: If you don’t know the capacity of your systems, it’s easy to either a) go over the limit and cause a failure or b) be under the limit and lose money paying for resources you don’t need. In the cloud you can automate the scaling of resources up or down, or you can use Load Testing to get a sense of your workload at base, peak and stress levels.
Manage change in automation: Infrastructure as code enables you to monitor any changes, which is important to review.

Check out these 5 Design Principles of Reliability!!

Written by Shannon Paige