Machine learning and IT infrastructure management automation (AIOps)

3 min readDec 11, 2018

Businesses can leverage a lot from capacities that AI and machine learning can afford in the domain of IT infrastructure management automation, to gain in agility, productivity, and efficiency. The added value is so obvious that a lot of IT companies that own data centers like google, amazon, Microsoft and Facebook (e.g. facebook +500TB/day in 2012), and even other non IT companies that generate a huge amount of data per day, are engaged in a great race to adopt its concepts and techniques, with goals like :

· Cost killing to be more competitive and offer a lower price for their services

· Efficient resource management by optimizing efforts while right-sizing workloads

· IT platforms that drive themselves (in a standalone mode), with minimal user intervention

Addressing IT challenges using Machine learning:

DevOps teams manage multiple systems across multiple monitors, tracking dashboards, metrics, incidents, and alerts, to prevent any problem or outage. They are usually overwhelmed and lacked time, visibility and real priority. DevOps need a smarter incident management system that will simplify their work and save time and efforts, by automating processes. A master monitoring system which could be put on the top of all their platforms to filter logs, correlate events, analyze metrics, alert and make a human-like troubleshooting, finding solutions and applying them without a human intervention.

Machine learning is the main key for such kind of tools that redefine the way infrastructure is managed since it transforms IT operations from reactive to predictive mode.

Machine learning & AIOps:

The application of machine learning to manage IT operation by automatically identify and react/solve issues in real time, is called AIOps which stand for artificial intelligence for IT operations.

AIOps added value:

Companies generate massive data sets (events, logs, metrics data) generated by Hardware, Operating systems, Softwares (servers and applications), Network’s elements (routers, firewalls, Intelligent Network), Monitoring systems and Trouble ticketing systems. These datasets can be collected; formatted and finally cutting through it to what’s relevant, by learning operational patterns from this historical data, to improve and automate IT infrastructure management and this is what we call AIOps.

AIOps tackle many aspects of IT platforms supervision and maintenance to deliver solutions that simplify the whole process:

· Probable cause analysis: Aggregating and mining correlating logs to get a complete overview and more easily pinpoint related trends and patterns, and detect relations between events and issues (example, the correlation between an upgrade in one system and breakage in another system). As result, the troubleshooting address root causes more quickly and accurately.

· Reducing alert noise: by grouping, filtering and deprioritize non-important alerts and incidents, so Machine learning algorithms learn to ignore routine log messages such as regular system updates but allow for new or unusual messages to be detected and flagged for investigation.

· Capacity Planning :

o Predictive scaling by forecasting demand, anticipating capacity requirements (Disk, RAM, CPU) and automatically scale-out and scale-in based on a predictive insight that is learned from historical data.

o mapping workloads to the right servers and VM configurations by recommending the right instance family type, storage choices and their IO throughput, network configuration, …

o Detecting zombie VMs and unused resources.

· Anomaly detection: Raise alerts about hardware, software or security (breaches and violations) anomalies through the use of outliers detection techniques in Machine learning (intrusions), so risk events can be detected and avoided.

· Routing incidents: Assigning incidents to the right team at the right time in the trouble ticketing system

· Predictive event management: Give an early warning about issues and possible outages that may occur

Self-monitor and Self-heal your environment in the face of failures (recognize problems and initiate responses: block ports, apply patches and upgrade hardware and software systems)

Conclusion

With AIOps, failures and downtimes are handled proactively and systems progressively improve using historical and current service and technical data. It replaces a software-defined infrastructure (SDI) by an artificial intelligence-defined infrastructure (ADI), that can self-learn and self-heal almost autonomously.

Machine learning and IT infrastructure management automation (AIOps)

Conclusion

Written by Youssef Fenjiro