Book Study: The Practice of Cloud System Administration — Part 1
I’m reading this book in hopes that by understanding the art of being an amazing Cloud SysAdmin, it will help me become an excellent Cloud Security Architect, Incident Responder, and teacher on these topics.
The introduction chapter lists several ideas and ideals that might not be new, but they are certainly helpful. I’m going to define them here, hoping that it will be helpful for the future as I continue through the book. All of these ideas will be discussed in-depth as the book continues.
- Treat failures as an expectation, something that will happen for sure. Prepare accordingly. If we all did this, all of the time…
- Everything requires an API (programmable interface to allow for automation). If there’s no API, it’s no good.
- Large systems are made up of smaller services. Everything is a service. There are many tiny pieces that that work together, so that if one fails the entire thing does not fail. This is called a Service Oriented architecture (SOA). Each service can be individually enabled or disabled.
- Infrastructure must be automatically created, not manually. This means it needs to be saved that way, so that it can be created by machines, in seconds, instead of minutes or or even weeks. Also known as “infrastructure as code”, we can (and should!) automate all of it.
- Don’t do mega-releases (many changes in a single release), that’s creating serious risk. Instead improve and automate your release process (using a pipeline/packaging system), and do it more often, for reduced risk.
- Practice the release automation until it’s perfect. Make it a habit to spend time improving your release process regularly.
- Ideally, you perform security testing as well as every other type of testing, using automation, as part of your release cycle.
- Only if it passes all the tests is code released.
- If code is released and it has serious issues, this means you need to fix your release pipeline. This means you need to add more tests.
- Once you have this rolling, you should spend more time improving the release process rather than doing the work of the release process.
- All software must be created so that it can be monitored and logged. Not just for security (my obvious bias) but for all forms of performance.
- Measure everything. Use this information to find problems when they are small, before they cause outages or incidents. Even measure your counter measures, to understand when it’s time to automate.
- People who are oncall are alerted via automation, this can be the Ops or the Incident Response (IR) team.
- Being on call should not be hell. It should be planned in advance, with a realistic amount of alerts, with a backup person, and help in case the person oncall can’t handle what is thrown at them.
- Playbooks (automated if possible) should exist for everything. If an alert or incident happens, your team should know exactly what to do, and what is expected of them. Hmmmm, who has said that before?
- Test your counter measures by causing failures. That’s right, cause problems, on purpose! Paging the Red Team. This also means we should do security incident simulations.
- Implement auto-scaling. Up and down. Why pay for what you are not using?
- Release new features for users a few at a time, to allow for A/B testing. This means you can figure out if users like or dislike the new feature before the big announcement that it’s been released.
- Always have good hygiene. Regularly update documentation, tune your alerting, review post mortem findings and analyze your findings to create improvements.
- Dev and Ops are not two teams but one team that perform a variety of functions. *Everyone* participates in oncall duties.
So far I am liking this book.
Up next! Design: Building It