Episode XV: Magic-Ops

Fatih Nar
Open 5G HyperCore
Published in
11 min readApr 20, 2023

Authors: Fatih Nar Chief Technologist at Red Hat; Vanessa Martini Senior Product Manager at Red Hat; Volker Tegtmeyer Principal Product Marketing Manager at Red Hat; Fatih Baltaci CTO at DDosify.

Article Thumbnail (Licensed from iStockPhoto CN:27675569)

“Any sufficiently advanced technology is indistinguishable from magic.” Arthur C. Clarke

Introduction

As the 5G network deployment continues, telecommunication networks’ complexity and operations are increasing rapidly. They now offer services in multi-cloud environments spanning a multilayer stack of; containers, virtual machines (VMs), and physical components. Therefore, service providers need to adopt a radical approach to network operations to achieve efficiency with timely effectiveness. One technology that can help address this challenge is artificial intelligence (AI).

Artificial Intelligence for IT Operations (AIOps) is a rising star solution that leverages big data analytics and machine learning to identify patterns and anomalies in IT infrastructure and services. This enables IT teams to automate tasks, gain valuable insights, and make informed decisions for faster and more efficient problem resolution. By leveraging advanced analytics, AIOps can analyze vast amounts of data generated by complex IT infrastructures, multi-cloud environments, applications, and services in real time. This enables AIOps to quickly detect issues, diagnose problems, provide root cause analysis (RCA) functionality, and automate remediation processes, resulting in improved speed and efficiency of IT operations. This leads to a lower risk of downtime, reduced total cost of ownership (from a site reliability engineering perspective), and a better overall user experience.

Figure-1 Google Search Trends for AIOps over Years

With more solutions and systems getting spread across the geographies to especially for offering edge computing with lower latency and aiming for higher user experiences, the management of these systems also becomes more complex and challenging, especially in more granular edge IT infrastructures with high traffic patterns (high population with more granular coverage areas) forcing operation models to embrace AIOps.

Figure-2 AIOps Market Growth (Reference: Link)

In order to raise awareness and educate our audience about AIOps, demonstrate its benefits, and promote promising solutions, we will publish a three-part publishing series.

(I) The first article (this article you are reading), at a 101 level, will explain what AIOps is, what it does, and how it works. It will focus on the benefits that AIOps offers to operations and business teams.

(II) The second article, at a 201 level, will highlight market-ready solutions by providing a complete architecture with commercial products and open-source projects.

(III) Finally, at a 301 level, the third article will showcase a sandboxed, tested example solution with use-case recordings and snapshots.

Note: Depending on our resources and timing, we may/may-not merge the second and third articles into a single publication.

AIOps is an approach to achieve better-informed IT operations!

Although we occasionally see misguiding marketing materials positioning AIOps as a product in the field, AIOps is actually a solution built of multiple products and services from various disciplines. The critical components of the AIOps solution include extensive data collection, ETL (extract, transform, and load), and the central AI/ML core where the “magic” happens.

Figure-3 High-Level Reference AIOps Solution Diagram

Data Collection from multiple channels

This foundational base fuels (as data being the oil) complex machinery (AIOps Core) to analyze and quickly see patterns and anomalies that IT Ops teams generally wouldn’t find. Suppose you operate an extensive network, managing thousands of devices and millions of nodes globally. In that case, it is impossible to manually sift through all available and most complex data in your environments. Analyzing metrics, events, logs, traces (MELT), and other unstructured/structured supplementary data and predicting upcoming application or system failure within any reasonable time would be less likely to happen than winning the argument against your spouse.

An effective AIOps solution requires a highly flexible and comprehensive data collection mechanism. To address this challenge, data collection can be implemented through push mechanisms (ex, webhooks) and pull mechanisms (ex, SFTP).

The “V3-attributes” of data; Volume, Velocity, and Variety should be considered to ensure enough, valuable, and high-quality data.

Data from various channels can be, including;

  • Network Fabric (Switch Fabric, Site, and Backbone Routers, Domain Firewalls Carrier Grade NATs, etc.),
  • Baremetal Layer (HP iLO, Dell iDRAC, etc.),
  • Operating Systems,
  • Virtualization Layer (KVM, ESXi, etc.),
  • Infrastructure as a Service (IaaS) layer (VMW, OpenStack),
  • Platform as a Service (PaaS) layer (Kubernetes),
  • Big Data and Analytics systems containing insightful data
  • OSS / BSS systems (Nagios, Zabbix, etc.)
  • Middleware Layer for Application Frameworks (such as Quarkus, Micronaut, Helidon, etc.), and
  • Tenant Applications (such as 5G CNFs).
Figure-4 Top Incident Management Challenges (Ref: Link)

By leveraging V3-attributes of data from these sources, AIOps solutions can generate more accurate and actionable insights, ultimately improving the efficiency and effectiveness of IT operations. Besides of collecting existing data (oil-transport) out in the wild, generating the required valuable additional data may be needed (oil-rig), for which new-generation observability platforms would be beneficial (ex, DDosify for application performance testing -APT-) as supplementary data channel.

ETL Engine makes sure data is accurate, consistent, and up-to-date

Data cleansing involves identifying and correcting or removing any inaccuracies, inconsistencies, or irrelevant information from data sets to improve their quality and reliability. A segregator segregates data into categories or groups based on specific criteria to enable easy analysis and processing. Anonymizer masks or removes personal or sensitive information from data sets to comply with privacy regulations. Together, these components ensure that the AIOps solution has access to high-quality data that is organized, secure, and compliant with regulations (which may vary based on region/country).

Figure-5 Data-Mesh for 5G Core Sample

Data correlation involves identifying and linking related data from various sources to understand better the system or application being monitored. Data mesh (example Figure-5) refers to a decentralized approach to data management, where data is organized into smaller, domain-specific units to enable efficient and scalable analysis. Enrichment involves augmenting data with additional context or metadata to provide deeper insights and enable more accurate analysis.

Governance involves defining policies, procedures, and standards for data management to ensure data quality, security, and compliance with regulatory requirements. A data governor provides oversight and control over the data used by the AIOps solution, including monitoring and auditing data usage, ensuring data privacy and security, and managing data retention and disposal. It also involves establishing clear roles and responsibilities for data management, such as defining data ownership, access controls, and accountability.

AIOps Core processes and analyzes the data to extract information

Observability is an essential building block of an AIOps solution as it empowers IT teams to gain immediate and actionable insights into the performance and health of their systems and applications. An observability framework collects and analyzes clean and enriched data from the ETL (Extract, Transform, Load) layer, including logs, metrics, traces, and events, to provide a comprehensive view of the system’s behavior. This feedback can be used by developers to optimize their code and improve the system’s design, leading to better overall performance and reliability.

Communication and workflow engines are two critical building blocks of an AIOps solution, facilitating the smooth functioning of IT operations. The communication engine acts as a bridge between different teams, stakeholders, and the AIOps platform, enabling the team to stay informed about the system’s status and critical alerts. With an effective communication engine, the IT team can act swiftly and address issues before they escalate.

Figure-6 AIOps Capability Impact Ratio (Ref: Link)

On the other hand, the workflow engine automates the incident management process, from the initial diagnosis to resolution and closure, by following a well-defined workflow. By providing a centralized platform for collaboration, it streamlines communication and decision-making, leading to a more efficient and effective resolution process. The workflow engine ensures that incidents are escalated to the right team members based on their skills and availability, resulting in faster resolution times.

To integrate these communication flows, AIOps solutions often use APIs to connect with existing platforms such as Slack, Email, or CRM systems. This provides a standardized way of communicating and enables a more effective flow of information between different teams, leading to better collaboration and ultimately improving outcomes.

An AI/ML engine is the central component of an AIOps solution, providing the intelligence to automate IT operations and improve efficiency. Using machine learning algorithms and artificial intelligence techniques, it can identify patterns and anomalies in vast amounts of data, enabling IT teams to detect and diagnose issues proactively.

  • One of the vital vertical offerings of an AIOps solution is predictive maintenance, which uses equipment sensors, historical data, and maintenance records to identify potential equipment failures before they occur. This allows organizations to perform proactive maintenance and avoid costly downtime.
  • Another critical offering is root cause analysis, which correlates all types of data collected by the AIOps solution to identify the problem’s root cause. This can be challenging and require significant team involvement, but it helps prevent issues from recurring in the future and mitigates the risk of equipment failure.
  • Service and revenue assurance is another offering that helps ensure efficient and accurate service delivery while maximizing revenue. It uses billing records, customer feedback, and service tickets to identify issues and areas for improvement.
  • Fraud detection and prevention is critical for organizations dealing with sensitive data and transactions. It uses transaction records, customer profiles, and social media activity to identify patterns and anomalies that may indicate fraudulent activity.
  • The AIOps solution can also generate insights into an organization’s operations and help make informed decisions. By analyzing customer feedback, social media activity, and financial reports, it can identify trends, patterns, and opportunities for improvement.

Other essential features and services that an AI/ML engine can provide include network optimization, capacity planning, and resource allocation. It can analyze data from various sources, including network traffic, server logs, and resource usage, to identify areas for optimization and improvement. This enables organizations to maximize their networks' and resources' efficiency and performance while minimizing costs and downtime.

AIOps & OSS/BSS

Operations Support Systems (OSS) and Business Support Systems (BSS) are critical components of telecommunications and IT service providers’ operations and business processes. These systems facilitate network management, service provisioning, billing, and customer care.

Figure-7 Purpose of using AIOps (Ref: Link)

An AIOps solution can seamlessly integrate with OSS and BSS systems using APIs, enabling effective communication and data exchange between systems. With access to OSS and BSS systems data, an AIOps solution can analyze information from various sources, such as logs, metrics, and events, providing a comprehensive view of the system’s health and performance. AIOps can report its findings, conclusions, and recommendations to OSS and BSS systems, empowering them to make better-informed decisions. AIOps can impact organizations in the following ways:

  • Improved operational efficiency: AIOps help automate routine tasks and reduce the time and effort required for manual analysis, which improves operational efficiency. This closed-loop automation feature enables automatic actions to address problems, which helps organizations reduce costs and improve service delivery.
  • Enhanced customer experience: AIOps helps identify and resolve issues quickly, improving service quality and customer satisfaction. This can help organizations retain customers and increase revenue.
  • Predictive maintenance: AIOps can predict and prevent potential issues before they occur, which reduces downtime, minimizes the impact of issues on customers, and optimizes operations.
  • Effective capacity planning: AIOps can provide accurate projections about capacity usage, including peak values and usage patterns. With AI-powered calculations, organizations can make informed capacity planning decisions based on comprehensive historical usage and future usage projections.
  • Faster problem resolution: AIOps can identify the root cause of issues quickly and accurately, which enables operations teams to resolve problems faster, minimizing the impact on customers. This results in improved Mean Time to Repair (MTTR).
  • Better decision-making: AIOps provides operations and business teams with insights into system performance, which helps them make better decisions and optimize their operations.

AIOps can be an empowering companion for OSS and BSS as it helps organizations improve operational efficiency, enhance customer experience, reduce costs, and increase revenue.

Which AIOps Solution to Pick?

As we have been trying to explain/highlight that AIOps is not a product but a very sophisticated solution, and while shopping for one, there are a few essential criteria you shall look for/seek;

Figure-8 Concerns for AIOps (Ref: Link)
  1. Integration and interoperability: The AIOps solution should be able to integrate with your IT infrastructure, os & virtualization layer, network functions, and monitoring tools. It should also provide a unified view of IT operations and correlate logs, metrics, events, and alerts across multiple tools.
  2. AI and machine learning capabilities: The AIOps solution should have advanced AI and machine learning capabilities that can analyze vast amounts of data to cover the key aspect of the operational data in real-time with effective ML techniques to detect and resolve issues. It should also be able to provide predictive analytics to help prevent future problems.
  3. Automation capabilities: The AIOps solution should automate routine tasks and workflows, freeing IT staff to focus on more strategic initiatives. It should also provide automated remediation workflows to resolve issues quickly by taking actions themselves or triggering external entities (ex, Ansible Tower) to provide closed-loop automation functionality.
  4. Scalability and performance: The AIOps solution should scale to meet your organization’s needs and provide real-time insights into the performance of applications and infrastructure.
  5. Usability and ease of use: The AIOps solution should be easy to use and offer a user-friendly interface that can be customized to meet your organization’s specific needs. It should also provide customizable dashboards and reports.
  6. Security and compliance: The AIOps solution should meet your organization’s security and compliance requirements. It should provide secure access to data and offer audit trails and compliance reporting.
  7. Cost and value: The AIOps solution should provide good value for acquiring and running costs. It should offer flexible pricing options and provide a clear return on investment.

Summary

This article aimed to provide an overview of AIOps (Artificial Intelligence for IT Operations) and how it can improve IT operations management. We also discussed the critical components of AIOps, such as data collection, ETL engine, AI/ML engine, observability framework, communication and workflow engine, and data governor. Furthermore, we highlighted how an AIOps solution could integrate with existing OSS and BSS solutions inside the enterprise ecosystem.

We hereby conclude that for the right AIOps solution, you shall consider your specific needs and requirements and select the best-fitting solution that integrates well with your existing OSS/BSS ecosystem.

In this workshop series’s coming part(s), we will dive into ready-to-use reference AIOps solution(s) with sandbox testing and real-world use case(s) analysis.

--

--