Artisanal Ops — Recklessly Releasing

The devops paradigm suggests you have highly efficient pipelines to deploy deltas to your systems. Everything being idealistically tested in integration at every step. Suddenly an artisanal ops team is handed a 5 page paper document, asked to waste a weekend, 4 weeks in advance with every move scrutinised as they prepare to manually by hand install a single code change on two servers.

Traditional risk models may suggest by doing this you are mitigating risk as much as possible by having these “Releases”; reducing events which in theory could cause some kind of outage while adding a second set of eyes to the solution. However in an agile world of continually delivering features, fixes and performance enhancements; this becomes a reckless endeavour.

The importance and usefulness of the agile sprints are greatly reduced as team members see their work piling up in source control and various non production environments which over time come to resemble nothing like the production platforms they are writing code against. In such scenarios when releases are finally performed, everyone is holding their breath hoping all pending changes to code and data all work at the same time.

While there could be an argument for a rollback, sometimes data migrations and model changes pose challenge here. A rollback of code artefacts may simply be a versioned docker container, or a checkout of certain source control tag; such tasks could be done with little to no downtime. Many companies and setups are just not equipped to handle rolling back schema changes on the large, high transaction, databases held as the backing datastore for their code. In such cases you could argue that a successful rollback is even “rolling forwards” as you are making continuous edits to a component of your system.

Having the ability to regularly (and hopefully with automation) release code, in addition with using a suite of automated, post deploy tests; allows a control on any failure that may be encountered. Changes are smaller, less coupled; the ability to resolve, repair and “roll back” if needed becomes a less complex task. Failure can be dealt with swiftly and with direction. Such small changes and code pushes are easy to automate as part of your build pipeline and the whole idea of releases as an “event” starts to change.

In some occasions though, a “release event” may be called for. Considerations always seem to be overlooked to fit within business SLAs. All staff are called to work at 1am leaving no respite or additional capacity if issues arise and the process is drawn out. Staff struggle to get out of bed and maintain a functional level of awareness as the operational time window of early AM becomes unnatural for human productivity. With the possible inability to reach out to their peers and collaborate on any issue that may arise (due to half your company being asleep), you could say “no-one can hear you scream 2am”.