Applying AI and Machine Learning to the DevOps Toolchain

Jimmy Guerrero
SignifAI
Published in
8 min readJul 25, 2017

In this post we will look at how an organization that has adopted DevOps to create, release and monitor software, can benefit from augmenting their tool chain with the power of AI and machine learning.

What is AI? Artificial intelligence is the name given to programs that have been written to solve problems (often very difficult) humans can already solve. The goal of many researchers and programmers in this field is to create programs that can arrive to a problem’s solution autonomously (without supervision) by using methods or logic that might differ from what a human might employ.

What is machine learning? Machine learning is the practical application of AI in the form of a set of algorithms or programs. The “learning” aspect relies on training data and time. The more relevant data you feed into the machine, the longer it can evaluate it and the more sophisticated the algorithms it employs…the more the machine will ultimately “learn.”

Why DevOps?

There are technical, business, and cultural benefits that can be realized when an organization adopts DevOps, including:

  • Automation, automation, automation!
  • Faster development and “go to market” times
  • Continuous delivery of features and fixes with fewer defects
  • Reduced operational complexity
  • Faster remediation to problems
  • Greater developer satisfaction, meaning more time for innovation and less time spent fixing of bugs or fighting fires
  • Increased collaboration and communication across teams

What exactly is the DevOps toolchain?

For the purposes of this blog, we can think of the “DevOps Toolchain” as the tools that are used (and ideally integrated with each other) that help organizations adopt and implement DevOps. The basic “phases” of the toolchain include:

  • Planning: Creating software specs, requirements and release plans
  • Creating: Software design and actual coding
  • Verifying: Software testing and QA
  • Packaging: Releasing and Configuring: Building, preparing and pushing code to production
  • Monitoring: Ensuring that the software is running within accepted parameters
Image Credit: Kharnagy

In an idealized adoption of DevOps, there is a continuous loop between the phases.

DevOps, are we there yet?

Despite the methodology and tools for DevOps being around for over a decade, very few organizations have achieved the ultimate goal of DevOps, which is 100% automation from code change to production…without anyone having to intervene at any point in the process. The reality is that although there are tools that overlap between some phases in the DevOps toolchain, there is no single tool or toolset that can provide a 100% automated process. Instead, organizations end up with the deployment of many DevOps tools that only do one or two phases really well. The negative consequence is that an organization now has created “data silos” or “islands of automation” where it becomes impossible to tightly integrate the tools from end-to-end or to correlate any of the data the tools produce, process or analyze.

The Planning Phase

In the initial “planning phase” of the DevOps process, organizations are concerned with driving efficiencies. Specifically, as it relates to creating specifications, plus release and maintenance processes for the software. This phase also covers all the aspects of gathering both the technical and business requirements. There are a variety of tools that organizations can choose in this phase — open source, proprietary, SaaS or on-premise. Atlassian’s JIRA, Rally (acquired by CA), or GitHub Issues are popular examples.

The Creation Phase

Next up, the “creation phase”. This focuses on improving the efficiency and quality of the actual development of software. In this phase an organization architects, designs, codes, tests, and builds the software. Similar to how we saw in the planning phase, there are many tools an organization can chose from, for example, Atlassian’s Bitbucket, Gitlab, or Github.

The Verification Phase

In the “verification phase”, an organization that has adopted DevOps will put on its “QA hat” and focus on code quality. In this phase, the emphasis is on making sure there is ample test coverage that can be performed quickly and accurately. In this phase, an organization will want to get feedback about the test runs to developers in an efficient manner so that bugs and regressions can be addressed rapidly. In this phase, there is an array of tools to chose from, including Jenkins, CircleCI, and Travis CI.

The Packaging, Releasing, and Configuration Phases

In the “packaging” or “pre-production phase” an organization will take the necessary steps to prepare the software for production which will involve packaging, configuring and staging. In the subsequent “release” and “configuration phases” an organization will implement all the tasks related to the actual “push to prod.” Considerations in this phase include orchestrating, resource provisioning and the actual mechanics that go into going from deployment to the production environment. This can include spinning up databases and loading balancers, proxies, web servers, and other systems or software required to support the application. Popular tools in this phase include Docker for container management and Ansible, Puppet, or Chef for automating and orchestrating deployments.

The Monitoring Phase

In the “monitoring phase” of the DevOps toolchain, an organization will make use of an assortment of monitoring tools to ensure the infrastructure, software, and services that the application relies on, are meeting QoS requirements. In this phase, there is the most diverse group of tools an organization can chose from. Including everything from New Relic and Appdynamics for application monitoring, Datadog for infrastructure monitoring, ThousandEyes for the network, Splunk or Elastic for logs and either VictorOps, PagerDuty, or Slack for notifications and collaboration.

Realizing the Benefits of DevOps with AI and Machine Learning

Now, let’s look at how AI and machine learning can breakdown the data and communication silos that all these tools will inevitably produce. Breaking down these silos will create higher orders of automation, reduced operational complexity, faster remediation to problems, increased collaboration and happier developers…and ultimately customers!

Automation

If we recall, the most ambitious (and elusive) goal of DevOps is complete automation across the toolchain. However, complete automation is a problem that won’t be solved anytime soon, so the best we can hope for is for a substantial increase in the amount of automation across tools, not just within a single tool or phase.

How can AI and machine learning increase automation across the toolchain?

At SignifAI, we use AI to break down the “data silos” and “islands of automation” within the toolchain to automate the analysis of the event, log and metric data produced by the tools. For example, SignifAI can correlate all the relevant data within a toolchain that includes JIRA, GitHub, Jenkins, Ansible, Splunk, NewRelic, PagerDuty and Slack. The benefits of these automated correlations include reduced alert noise, faster, more accurate root cause analysis and predictive insights that are informed by the entire toolchain, not just one individual tool or datasource.

Reduced Operational Complexity

DevOps aims to reduce complexity in the processes required to release software. With less complexity comes speed and efficiency.

How can AI and machine learning reduce operational complexity?

At SignifAI we use AI to reduce complexity by delivering a “single pane of glass” from which an engineer can see all of the alerts and relevant data produced by their tools in a single UI. This has the benefit of avoiding the need to context switch between tools or manually analyze data to find correlations. Prioritizing alerts, performing root cause analysis, ascertaining if an anomaly is really an anomaly, and conducting predictive analytics are all inherently complex tasks, especially when their accuracy is dependent upon looking at all the relevant data. AI and machine learning makes it possible to get a high-level view of the tool-chain, but at the same time zoom in when it is required.

Faster Remediation to Problems

There are the old adages amongst site reliability teams on how to solve issues: reboot, add more memory, or decide it is a DNS problem that would best handled by the networking folks! In reality, conducting root cause analysis, formulating a proper remediation and implementing an accurate solution is time consuming.

How can AI and machine learning enable faster remediations to problems?

At SignifAI we use AI to automatically prioritize the most important issues, collect all the relevant data associated with the issue (regardless where in the toolchain it resides), and suggest a solution based on the industry’s best practices. SignifAI’s also applies its own expertise, and most importantly the feedback and training data provided by the DevOps team. In the context of AI, the accuracy of a recommended solution is only going to be as good as the amount of relevant training data available and how well “labeled” the data is. By combining these three techniques for training the data, the solutions become more accurate over time the more feedback that is provided to the system.

Increased Collaboration

At organizations who have not embraced the DevOps methodology, there is still a sharp cultural and technical divide between developers and operations teams. Developers want to release code early and often, operations teams want there to be the least amount disruption to existing systems, especially if they are performing well. At organizations who embrace DevOps, engineers write code and are on the hook for supporting it in production. When an engineer has to wear both these hats, it is vital that there be a free flow of information about how to best run the applications and systems. Access to information requires copious amounts of communication and collaboration.

How can AI and machine learning improve developer collaboration and communication?

SignifAI uses AI to increase communications and collaboration of a DevOps team by providing a single UI from which all the relevant data about the toolchain can be accessed, a team’s favorite communication tool should be directly accessible within the UI so that JIRA issues, PagerDuty acknowledgements and Slack messages can all be initiated, viewed or responded to without having to switch tools. SignifAI also makes it easy to capture a team’s knowledge about how their systems and applications should run and then surfacing this knowledge at relevant times, like when alerts or anomalies are detected.

Summary

AI and machine learning has the power to breakdown the silos that DevOps tools inevitably produce to create a higher orders of automation, reduce operational complexity, suggest faster remediations to problems, and increase collaboration.

Next steps

Originally published at blog.signifai.io on July 25, 2017.

--

--

Jimmy Guerrero
SignifAI
Writer for

Bringing to market startups in the open source, cloud, API, IoT, AI and machine learning spaces.