What AI means for Enterprise IT Management?

Source: CharlieAJA/iStockphoto

There is a lot of buzz (and hype) around AI with success in human-centric games such as Jeopardy, Go, Dota 2! This article analyzes ground realities of Machine Learning (the most popular branch of AI), and analyzes the implications for automating Enterprise IT Management workflows. The takeaway of this analysis is that there are IT workflow tasks that are low hanging targets for AI disruption (summarized in the table below). Additionally, IT decision-making tasks that require aggregation across multiple sources of knowledge and telemetry will also become targets for AI-based decision-support. Net, net, disruption of Enterprise IT Management is increasingly imminent. With the wide-spread availability of powerful Deep Learning platforms such as Google’s Tensorflow, the barrier to entry is further lowered, placing incumbents and new entrants on a level playing field. Read the rest of the article for details!

As software eats every vertical market, Enterprise IT is transforming from the cost-of-doing-business to a business differentiator. CIOs are busy with “digital transformation” focussing on digitizing new and existing business workflows, as well as extracting competitive insights from the growing amount of data. In meeting these objectives, CIOs are actively evaluating different on-prem or cloud-based platforms that provide agility in delivering new applications, as well as meets functionality expectations. The key metric is to reduce human IT effort from day-to-day management, with the focus on innovation instead of fire-fighting. The promise of “human-like intelligence” of AI starts resonating with CIOs who are under pressure to do more with less.

AI really is a collection of techniques. The term intelligence is broadly defined as the ability to learn knowledge and reason with it for problem-solving. The branch of AI that has gained a lot of popularity today is machine learning. AI has been in the making for 60 years, and now being compared to become as essential as electricity. Skeptics claim that learning techniques such as neural networks are not new, so why now? There are three trends that are coming together: Increasing availability of cheap parallel compute; growing data corpus; and advancements in algorithms for Machine Learning and Data Science. As CIOs evaluate AI-based technologies, a few success metrics would be: 1) Reducing opex costs by automating a subset of manual tasks; 2) Reduce losses due to data loss, security attacks, downtime, etc.; 3) Cost optimizations by better resource usage, etc.

While it appears that any problem can be solved using Machine Learning, the reality is that there is a limited categories of questions that are best suited for machine learning today. The usefulness of applying Machine Learning depends on how precisely the question is formulated, and whether there is data available to support the answers. In the simplified sense, the popular question categories are as follows:

  1. Anomaly detection: Is the latency usually high?; Is the CPU utilization abnormal?
  2. Clustering (Find similar patterns): What is the load during different times of the day?; What is the performance for this workload pattern?
  3. Classification: What is the category of error?; What percentage of workloads will be affected?
  4. Regression (Predict outcomes): What is the expected latency for this workload?; What is the overload value for this resource?
  5. Re-enforcement learning (Learn action impact): What will be the impact of changing this parameter?; Given a specific value of this knob, what will be the impact?

Enterprise IT Management can be broadly defined as ensuring Service Levels for business-critical applications (that are either internal or customer-facing). Service Level objectives are typically defined in terms of measurable metrics for security, performance, availability, scaling. There are different day-to-day activities involved, depending on the lifecycle of Enterprise IT: Day 0, Day 1, Day 2 is a common terminology used to refer activities during the initial deployment, configuration and optimization, and maintenance activities respectively. There are best-practices such as ITIL that define activities involved in various Enterprise IT management tasks. A few key categories are:

  • Capacity Planning: Both initial planning as well as ongoing scaling of deployments
  • Continuous Monitoring: Continuously tracking telemetry information as well as logs
  • Configuration Management: Ensuring correctness of configuration parameters as well as optimization
  • Change Management: Ensuring timely patching, upgrades, service validation testing, etc.
  • Root-cause Diagnosis: End-to-end analysis of issues impacting applications

The table above is a simplified version of applying Machine Learning to different Enterprise IT Management tasks. The cells in green represent low hanging fruits w.r.t. what is possible today in Machine Learning. For instance, anomaly detection for monitoring is already available in a few offerings. In broad strokes, each of the green cells represents some startup activity (either known or in stealth).

In closing, Enterprise IT management is ripe for disruption. The democratization of building blocks for AI, and machine learning in particular will make this space increasingly active. The key to a winning solution is deeply understanding the IT workflows today, and being realistic about the strengths of AI!