Building a simple application as a developer be it a personal portfolio site or a web application for a small business is one thing (most times all things being equal — hassle free) but building Enterprise Application for thousands if not millions of daily users with rolling features updates and constant need to make it available in production is an entirely different ball game.
If you work for a company be it one of the FAANG companies or a company that ships out enterprise applications to its users; you’d know what I mean first hand and understand the tension that tends to happen between Developers who want their new features shipped and Operators who is responsible for the application not to break in production.
THE GREAT DIVIDE
Over years Developers and Operators do have a lot of friction in working together due to the fact that Devs prioritize their feature updates and would like to see it live! at the get go while Operators would not want some bug ridden code base come break what is currently available to users in Production hence wants to take things much slower. This is as a result of the fact that many Operators know little about the developers code base while most Dev’s know very little about keeping their applications (as well as other configurations) live in production without crashing the entire live application. So it is a case of Developer looking to move faster the Agile way in pushing his/her features into production while the Operator want to keep things slower in maintaining availability.
Now where does DEVOPS comes in?
DevOps is a set of practices or culture, better still lets call it a Philosophy that when being introduced into an Organization it helps Developers, Operators and other departments in IT bring down their walls to allow for collaboration and getting work done at a controlled pace and in a more efficient manner.
DEVOPS IN A PICTURE
TENETS OF DEVOPS
I did want to picture the whole DevOps lifecycle from a lens being painted by Seth Vargo and Liz Fong-Jones two of Google Cloud Engineers.
Reduced Organization Silos — This can be achieved by breaking down barriers across teams which will lead to increase in collaboration and overall throughput. A case study is having both Operators and developers work with the same OS. Working with same tools and stacks will help limit flimsy bottlenecks that comes up when using entirely different operating systems.
Accept Failure as a Normal — Ability not to expect perfection in a development life-cycle will create room for accepting failure as a normal which will reduce the unnecessary tension that happens when application does break in production. This will aid both Operators and Developer to collaborate in debugging the problem as fast as possible. Also Operations gets to whip up a formula for balancing accidents and failures against new releases so common error that happens in the past stops coming to fore.
Implement Gradual Change — Feature changes should be minimal and gradual, this way it is easier to rollback bugged feature cause it is easier and faster to detect.
Leverage on Tooling and Automation — Setting up manually always leave room for human error’s such as over-provisioning CPU size which will eat more into the billing budget. Pushing code into production is best done the Automation way. Encouraging Automation helps minimizing manual systems work to focus on efforts that bring long-term value to the system. Be it managing of portable workers (VMs) the K8s way, or going server-less, or better still adopting Continuous Integration and Continuous development (CI/CD)with Jenkins. There are lots of tools to help aid automation at different project life-cycle.
Measure Everything — Having to track and measure the process is a critical gauge for success.
These five practices would ensure a more productive team of Developers and Operators.
WHO is an SRE?
A Site Reliability Engineer is the person saddled with the responsibility of enforcing Site Reliability Engineering best practices.
What is Site Reliability Engineering then? — “If DevOps is a Philosophy then Site Reliability Engineering can be seen as a prescriptive way of implementing or accomplishing DevOps” — Liz Fong-Jones (ex-SRE at Google).
In a programming analogy we can say SRE is a concrete class that implements DevOps.
So what SRE’s do in brief:
- Ownership of Production with Developers — They own production environment with developers in their team.
- They Establish Postmortems — Thereby failures that happens never has a cause to happen again thereby establishing reliability.
- Eliminate Manual work as much as possible.
On a final note we should know that DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. You can learn more here.