What comes after DevOps?
If you ask 10 DevOps professionals “What is DevOps?”, you are likely to get 20 different answers. After many debates I have come to the conclusion that I am no longer practicing what is commonly referred to as DevOps, but what I am starting to call CodeOps.
What is CodeOps? CodeOps is the practices and tools utilized when your operational workflows are all built around source control and automation. When you are practicing CodeOps workflows the following practices hold:
- You aren’t running scripts that do any work. The automation platform is running scripts for you. In my case, that means a Jenkins is THE operational orchestrator, and manages the other orchestration tools.
- There is no logging into machines to perform changes or maintenance of any kind. This includes provisioning, bootstrapping, os updates, application configuration, backups, restores, etc. All system state and configuration are performed by automated processes triggered by a Git commit. The only logging in is during development, and troubleshooting. In my case this means that the CMS Ansible is not triggered by me, it’s triggered by Jenkins.
- There is minimal usage of 3rd party UI’s. You don’t log into the AWS portal on a daily basis, it’s there for information gathering only. And you never create any resources there, that all has to be through code (API / Cloud formation / Terraform). As an example, I’ve just finished writing a Git-to-PagerDuty module for configuring PagerDuty services. The only “change” ever made in the PagerDuty UI was to generate an API key for the code to configure the rest. Unfortunately, not all services you need to work with have full featured API’s, hence my comment about minimal usage, instead of no usage.
- Interruptions to workflows (Paging) must be minimized. There are plenty of articles explaining the high cost of an interrupt to development so I won’t go into details here, lets just say interrupts are bad. In order to accomplish this, you need to change your architecture to be fault tolerant so that every failure does not require a page.
- Alert responses or “playbooks” are code, not wiki documents. When alert X fires, playbook X is ran.
Implementing CodeOps
And now for the hard part that makes it all work - Fault tolerance:
I teach at the Sith Lord school of system administration, and if an automated playbook can’t resolve an alert without human help, then I terminate the server. This has serious architecture ramifications, see my post https://medium.com/@anthony.b.hobbs/the-sith-lord-school-of-system-administration-on-rcas-7cafd150ee9 .
NO PETS ALLOWED! Your data stores do not get a free pass on the above requirements.
I allow for chaos engineering to destroy databases in production on a whim. If a database fails at 3am, traffic is automatically shifted to the backup, the failed DB is destroyed, and a new backup is provisioned. This is a huge architecture requirement that cannot be stressed enough. Your databases must be able to be replaced routinely via automation without causing impact to your service. Your business may have a high RTO bar, but automation needs to work with replacing servers on a daily basis. This means two very important things:
- If your data must be accurate (like payments), you have to enforce data consistency at write time and not rely on eventual consistency configurations. This means your data writes will be slower as each write has to wait for quorum confirmation.
- You cannot have huge data stores, they must be sharded so any one of them can be replaced in a timely fashion. My rule of thumb is if you cannot copy the data between two disks in 30 minutes, your database is too big to fail.
Setting up CodeOps is a big upfront investment in time, and places significant requirements on the way you operate your infrastructure. Most organizations historically had to rush to get application and database servers up so the product could be developed, and so they were developing/designing with short-term speed as the most important factor in their platform.
It is very hard to retrofit automation centric workflows into existing operational workflows, as the fundamental decision criteria for every tool and process stems from the original workflow. It’s hard to budget the time to automate everything you have done manually after the fact, which is why it is often better to greenfield the new workflows with a new platform.
When you are practicing CodeOps, automation is everything. Automation is not free, it costs in terms of upfront development, redundant resources, and workflows that aren’t always optimized for speed. However, because of the automation, your engineers can focus on new features instead of keeping the lights on, they can make large changes with more confidence, your customers have a better UX due to better stability, and your disaster recovery plan is not vaporware. The long-term agility benefits are significant, you have to weigh that against the financial cost of redundant servers. Remember that time is money.
If you found any of this interesting, please like the article. You have the opportunity to make my day 😍