What is DevOps?
From the beginning
(Feel free to skip this section if you’re familiar)
The operational complexity of early computers was immense by today’s standards. Early system administrators were basically computational handymen, constantly swapping tapes, expanding storage, and fixing hardware. One of the attractions of computers has always been the ability to automate tasks, and early sysadmins were no different. Cron was added to early versions of Unix, and was one of the first examples of DevOps.
In the early 21st century warehouse-scale computers became ubiquitous to solve computing problems. This was largely caused by the super-linear growth of audiences — the internet was exploding, and the companies that experienced this incredible growth during this era are now household names: Google, Yahoo, eBay, etc. These complex computation monstrosities were superbly expensive, so these companies developed strategies to build systems wherein the costs, be they the time of individuals, computers, or dollars scaled sublinearly to the products they were producing. These strategies have evolved into the DevOps movement as we know it today.
Today developers can fully automate the provisioning of servers, automatically scale the number of servers with growing traffic, geographically distribute their application (for resiliency and lower latency), and even handle disaster recovery and failover automatically. The days when developers have to wait for supply-chain management to ship parts for their new servers, racking and stacking are no more.
Even though “DevOps” is often mentioned in the same breath as automation, there’s a lot more to the story.
What do we think DevOps Is?
DevOps is a high-efficiency software development methodology. It lowers operational costs, speeds up release cycles, and it enables engineers to produce more value for more people, faster. It breaks down the barriers between software development, QA, reliability, release engineering, and traditional system operations by helping software developers automate, track, and optimize many operational tasks. A NOC full of people that detect production issues, make an attempt to fix issues using a run book, then escalate to the right person, who finally starts debugging the issue is replaced by a software team where human laziness and system reliability are first-order design principles. If a task can be automated, it shall be — and therefore we can remove people from the process by putting the onus for managing a production system on the people that produced it.
Being good at DevOps requires enough architectural and organizational understanding to build the right tools that allow software engineers to effectively deploy, scale, and operate their code affordably and reliably. As the diagram below shows, a lot goes into production operations — this makes sense, because software spends most of its time in production.
Involving developers in this poses them with a lot of new problems, and developers face problems by trying to optimize their way out of them. These optimizations lead to more resource-effective organizations. Organizations that implement practices that allow developers to support operations, as opposed to maintaining a dedicated operations team, are what is called PostOps.
Advantages of PostOps
PostOps reduces the necessity of a large SysOps team and enables your software engineers to write code, test it, deploy it to users in production, monitor it and scale it with ease. Each of these categories that used to be completely separate org units is now replaced by a tool kit for your software developers. They’ll be able to use the tool kit to achieve a pace of innovation that you wouldn’t see in a typical organization due to the complexities of process, and the weight of traditional operational methodologies (ITIL, etc).
Service Oriented Architectures
PostOps allows developers to stop reasoning about all the complexities of setting up an operating system, network stack, and hardware, and focus on building a system of online and offline services that communicate with each other. By thinking about separate tasks and jobs, the datacenter suddenly becomes the unit of compute as opposed to a single machine. Modern, cloud datacenters have loads of machines and networks with full bipartite bandwidth, this enables radically different application designs. It’s no longer unusual for a single web request from a browser to touch hundreds of backend services. Adopting a workflow that makes it easy to develop and deploy pieces of an application makes it easier to reason about, and that means better software, faster.
Users today have high expectations for availability and reliability. Large clusters running distributed applications and storage solutions are commonplace for achieving this now. Geographical distribution is also important for disaster recovery and latency reasons.
Engineers should be able to make changes to their software, test it, and ship it to production in an automated, and safe manner. This greatly reduces the time to delivery. By having a few gates, be it unit tests, metrics, or the help desk calling, you can build systems that allow engineers to deploy their software to production with more confidence.
Traditional organization have been driven by NOCs, and system operations on-call. This results in systems that are opaque, and hard to use for developers. Many organizations still send alerts to system operators, who then follow a run book, after only which they escalate to the developers. Traditional monitoring systems can make finding the developer who owns a piece of code difficult. PostOps removes this separation and ensures the person who is responsible for the the problem knows and is able to introduce a long-term fix immediately.
Site Reliability Engineers are an emerging industry trend. Basically their purpose is to build the PostOps infrastructure around the application code, and ensure that the application is structured such that it will run smoothly in production even under dire circumstances. This can include doing flaw analysis to analyze where systems are likely to fail, and bolstering their fault-tolerance, as well as building platforms for systems like monitoring, deployment, and data storage.
Automated provisioning and load balancing can lead to cost-savings and fire-prevention for companies subject to traffic surges. Everyone sleeps better at night knowing your latest big customer can come online and there won’t be capacity issues.
What can MustWin
do for you?
Technical Program Management
We can consult with your current software development and operations teams to create a plan to upgrade your workflow to PostOps. MustWin will identify opportunities to improve your development team’s throughput, reliability, and operational efficiency. Our extensive experience across multiple deployments allows us to act as architectural advisors, preventing problems before they occur and optimizing existing systems’ performance.
We can help you build out a platform as a service strategy. Whether you’re looking to build something on-prem, extend an existing cluster, or build a new deployment atop an already existing public cloud solution, we can help you put together the pieces. We understand if your IP, or data is sensitive, and we can build agile solutions, that are still cost effective, and on-prem. Every business is different, and the solutions are as well.
Whether you run your own datacenter or you work in the cloud we can set you up with a continuous integration workflow that will take your code from git, ensure your tests are passing, and automatically (or push button) deploy your code to production.
Building distributed systems is tough, but we have some experience managing modern applications running across lots of nodes and we’re happy to help get you setup.
Knowing your application can handle rapid, unexpected growth is fundamental to the PostOps world. We can handle setting your application up for high-scalability on major cloud providers.
Monitoring + Alerting
The first few 9s of availability all come from simply knowing when you have a problem. We setup first alert systems that implement our PostOps mindset to keep your developers accountable and your users happy. Additionally, the more complex parts of monitoring often require bespoke systems in order to adequately handle application interactions. MustWin can build purpose-built monitoring tools that integrate with your existing stack.
Virtualizing development environments
As architectures become more service oriented, the list of dependencies developers will need to get things running locally (or at least separately) gets longer too. Virtualization and clever application design can help new developers get up and running quickly.
Datastore Deployments / Optimizations
Nearly all architectures are built around data storage. We know that data storage is a complex problem, and there is nothing more frustrating than your substrate failing you. We can help you decide which datastore is right for your application, and help you build your application, and data model to optimize for business value — whether it be a Dynamo store, or a traditional ACID-oriented RDBMS. Let our team help you with query-pattern optimization, deployment, scaling, and monitoring your datastore.
Content Delivery and Application Acceleration
In 2006, we learned from Google that responsiveness, and latency of web properties correlates massively with their user satisfaction. Fortunately, CDN technology has evolved massively in the past decade. Unfortunately, this means that application integration has also become more complex. Taking advantage of these technologies requires deep understanding of frontend and backend interactions. We can help you identify the low-hanging fruit, and the engineering work required to capture the higher-touch aspects of application acceleration.
Performance Tuning / Application Optimization
Performance optimization is a hard challenge for a lot of organizations because it requires cross-cutting understanding of an application stack and a firm grip on how each piece works. MustWin has experts along the entire stack, including datacenter network topology, virtual machine resource management, database query optimization, and even fixing bottlenecks in your application. Usually the first step to resolving performance issues is some forensic analysis of metrics from a bunch of areas in your application. We have extensive experience with setting up systems to help you track down and fix bottlenecks on your own as well. These technologies can add value to your organization in the future, as they help you debug your application in production, and in addition provide KPIs for future work.
Like this? Join our mailing list to hear more about optimizing your engineering process using the latest technologies.
Mike Ihbe is a founding partner of The Must Win All Star Web & Mobile Consultancy.
Mike is an expert in a dizzying array of technologies and has loads of experience managing fast-moving dev teams, designing systems, and scaling large applications. When not on the job, he can be found cooking delicious meals on ski slopes in exotic locales.
Sargun Dhillon is a senior backend developer at The Must Win All Star Web & Mobile Consultancy.
Sargun is a backend developer who gets down and dirty with distributed systems, databases, networking, and the operational aspects of software engineering. He has experience in scaling deployments, and running large SaaS applications.