2018 was a significant year for the DevOps community. Kubernetes became the first project to graduate from the CNCF; Pivotal IPO-ed; HashiCorp became a unicorn at $1.9B; and VMware acquired Heptio for close to $600M. These events underscore the category’s importance.
Last January our microservices predictions included service meshes, event-driven architectures, container-native security, GraphQL, and chaos engineering. While these technologies continue to grow in popularity, we see some new trends on the horizon: 1) test automation, 2) Continuous Deployment/Verification (CD/CV), 3) incident response, 4) Cloud Service Expense Management (CSEM), and 5) Kubernetes extending to Machine Learning (ML).
1. Emerging test automation
Traditionally individuals designed test cases used to determine whether software would correctly function across different circumstances. Often, quality assurance (QA) engineers created and ran test cases. Today, software engineers are taking on testing responsibilities from traditional QA teams due to test-driven development, where developers perform tests throughout the continuous integration (CI) pipeline. Testing burdens developers, decreasing their productivity.
We believe businesses desire a software testing solution that automatically designs, runs, and reports the results. The solution should be frictionless by connecting to CI systems, checking new code in real-time, and adding comments similar to a human engineer. We’ve heard the testing solution should test through the User Interface (UI) so engineers can find issues through the UI and decrease false negatives.
Shifting software testing left helps decrease the resources needed to fix a bug. Once the autonomous software identifies a bug it should automatically generate bug fixes. Simple bugs may have automatic patches while complicated bugs could leverage human-designed templates or “mutation-based fixes,” which make small changes to the code until remediated. The recommendation engine can train using data from previous engineers’ fixes and generate informed suggestions that are pre-tested before human approval.
We believe software testing is a great application for AI to dramatically improve productivity, cost, coverage, and accuracy. We’ve written about our excitement for ML-enabled software testing previously and continue to think it is a huge market (~$32B) ripe for disruption.
2. Continuous Deployment/Verification for improved productivity
Businesses continue to feel pressure to accelerate software release cycles. Continuous Deployment (CD) allows code that passes testing to be deployed to production automatically. Unlike continuous delivery, a set of design practices to ensure that code can be rapidly and safely deployed to production, CD takes the next step by managing the full deployment.
CD replaces DevOps engineers’ manual activities. We’ve heard that at some financial institutions one out of every ten DevOps employees works on deploying software to production. Assuming CD software can capture the value of 10% of global DevOps employees, we believe the total market size is close to $2B.
Continuous Verification (CV) adds an intelligence layer on top of CD. CV collects event data from logs and APMs and applies ML to understand features that result in successful and failed deployments. CV should have a human-in-loop component so engineers can provide feedback to improve model accuracy and build trust with the system. CV usually safely rollsbacks failed deploys. We believe in the future CV will help CD be an intelligent control point for multi-cloud environments. It will provide predictive capabilities including insights on the best cloud, region, and configuration to deploy the service based on the its characteristics.
While there are numerous CD solutions, we’ve highlighted the 14 most popular below. They range from closed source to open source to managed services from public cloud vendors. The most well-known solution in the space, Spinnaker, an open source project, has achieved over 5.6K GitHub stars.
3. Incident response to the rescue
Site Reliability Engineers (SREs) arose as a response to complex distributed systems that often suffer resiliency challenges. According to Google’s Site Reliability Engineering book, SREs automate previously manual processes performed by sysadmins. They are responsible for “availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” Emergency/incident response is a critical SRE task.
Downtime has significant financial ramifications so fast resolution is important. Gartner cites an average revenue loss of $5.6K per minute. Large web properties like Amazon can experience a loss of up to $220K per minute. Each minute a service isn’t operating the business could be losing money and hurting its brand.
When a service fails, a team with distinct roles, including incident commander, is alerted and a series of workflows kick off. The incident commander maintains an “Incident State Document” that describes the incident, circumstances, and fixes. Each team member should be executing against pre-defined templated procedures for issue resolution. Once the issue is resolved, the team should participate in a post-mortem analysis to learn from the incident to minimize its recurrence. Google suggests the team pen a postmortem record of “the incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.”
We often hear SRE teams use a mix of PagerDuty, Slack, Jira, Google docs, and knowledge bases for incident response. We believe these pinpoint solutions can be tied together in an end-to-end SaaS platform that helps automate remediation activities and instills best practices. The platform would accelerate Mean-Time-To-Recovery (MTTR), collaboration, and knowledge-sharing.
We’ve identified five solutions that provide modern incident response. A centralized platform should not only assign roles and kicks off workflows, but also state the incident’s impact, status, event timeline, and effect overtime. We believe these platforms complement chaos engineering, a resiliency testing best practice. In the future, incident information can be fed into chaos engineering solutions (like Gremlin) to inform which services should be tested preventatively. The platform’s continuous learning will improve backend resiliency and decrease the frequency of sleepless nights.
4. Cloud Service Expense Management (CSEM) saves $$$
Public cloud cost management is one of the few challenges that not only deeply affects engineering and IT teams, but also the entire company. Most businesses have a hybrid cloud approach but there are an increasing number of public-cloud only organizations. According to Gartner, IaaS and PaaS revenue will grow from $46.2B in 2018 to $90.7B in 2021, a 25% CAGR. A Rightscale report found that of 997 IT professional surveyed, 92% used the public cloud and 81% had a multi-cloud strategy. The public cloud offers many benefits including security, availability, reduced operations, utility pricing, and cost. With the public cloud’s expansive and increasing adoption, cost management and forecasting is, and will continue to be, important.
Cloud cost management is challenging for various reasons. Sometimes teams begin using public cloud services with limited oversight almost like shadow IT. This lack of governance can cause service sprawl. Developers can feel pressure to “move fast” and may not consider cost during the evaluation process. The service breadth and frequent price changes makes it challenging to track. Some cloud bills have over a billion expense lines making them complicated to parse through. These challenges have led Gartner to state that, “Through 2020, 80% of organizations will overshoot their cloud IaaS budgets.”
Below we’ve identified 18 CSEM solutions that represent public cloud and third-party services. VMware sells CloudHealth, which it acquired in August 2018 for $500M. Azure acquired Cloudyn in 2017 for between $50-$70M and rebranded it Azure Cost Management. In early January 2019, Amazon bought TSO Logic to enhance its offering. In 2018 Forrester’s “Cloud Cost Monitoring And Optimization” report analyzed nine vendors and found VMware CloudHealth and Rightscale led the category.
Despite a plethora of cloud cost management solutions, cost control continues to be a pain point. Operators often tell us that the CSEM should normalize results across platforms and map cloud resources to specific owners and teams so the finance department can allocate spend to particular products or business units. Gartner states cloud services can have a 35% underutilization rate in the absence of effective management so the solution should also identify optimization opportunities. For example, a CSEM should identify oversized or idling resources. The software needs to support reserved and spot instances, ongoing rightsizing, chargebacks, ability to set custom discounts, and flagging anomalous spend. Forecasting spend based on increased traffic, data storage requirements, and service utilization is also key. As public cloud usage expands, we expect cost management and forecasting’s importance to increase.
5. Kubernetes extends to ML
Kubernetes took the DevOps world by storm and is now the main orchestration solution for containers. Its scope continues to grow, and we expect it to be part of ML platform stacks. For example, Google released open-source Kubeflow, which extends the Kubernetes API by adding Custom Resources Definitions (CRDs) to a cluster so ML workloads are first-class citizens. During Kubecon Seattle 2018 Kubeflow was one of the most discussed cloud-native projects. Google isn’t alone. Lyft built its own ML platform using Kubernetes. We’ve heard other unicorns are trying to standardize on Kubernetes for ML and analytical workloads.
We’ll be watching test automation, CD/CV, incident response, CSEM, and Kubernetes-empowered ML over the next year. If you or someone you know is working on an open source project or start-up in these areas it would be great to hear from you. Please comment below to let us know what we’re missing or if you agree/disagree