Episode XVI: The Eye of the BEE-Holder

Fatih Nar
Open 5G HyperCore
Published in
27 min readJul 13, 2023

Authors: Fatih Nar (Red Hat), Abdurrahim Suslu (AIOPSONE), Eric Lajoie (Red Hat), Min Xie (Telenor), Sean Cohen (Red Hat), Dave Tucker (Red Hat), Shujaur Mufti (Red Hat)

DALL-E: Observing the world from an eye of a bee with feeling the stress & joy of a ship captain.

1.0 Introduction

Welcome to the second article in our AIOps series. In the first article (link), we covered the fundamentals of AIOps, exploring its definition, functionalities, and benefits that it brings to IT operations & enterprise business teams. Building upon that foundation, this article delves deeper into the AIOps technology stack resulting from Red Hat, Telenor, and AIOPSONE collaboration. Here, we will provide insights into the components of the AIOps technology stack and present ready-to-use solution components backed by open-source reference projects.

Figure-1 AIOPSONE Solution Architecture

To begin our exploration, we will first focus on the critical observability data sources/backends that feeds in to AIOps machinery and the related open-source technologies that can be leveraged to implement them effectively. By understanding observability data sources and technologies, you’ll realize how AIOps can be built on top of observability data.

Next, we will delve into the importance of standardizing observability data collection, particularly aligning with OTel and 3GPP standards. By incorporating standardized practices, we ensure data collection compatibility, interoperability, and consistency across different network functions within 5G networks and beyond.

Finally, we will showcase real-world operational use cases that demonstrate the capabilities of AIOps. These use cases have been developed based on field-driven requirements and highlight the practical application of AI-driven insights and automation in optimizing network performance, minimizing downtime, and enhancing overall operational and business efficiency.

2.0 Fundamental Application Data

Here are some base observability data types that applications generate, gather & store -> that can be crunched by AIOps, along with their backend projects.

2.1 Metrics

A metric is a measurement of a service captured at runtime (i.e., time-bound informational elements). Logically, the moment of observing one of these measurements is known as a metric event that consists not only of the measurement itself but also the time it was captured and associated metadata. Application and request metrics are essential indicators of availability and performance. Custom metrics (such as 3GPP 5G CNF Metrics) can provide insights into how key performance indicators (KPI) impact the user experience or the business's success.

Metrics could be from; infrastructure like {CPU usage, memory usage, network IO, storage IO, etc.}, application-platform like {platform resource measures, utilization levels, performance counters, etc.} and/or custom metrics from your business application (i.e., tenant workloads such 5G CNFs). Metrics are usually stored in time-series databases, and popular time-series capable open-source technologies are;

  • Prometheus is an open-source system monitoring and alerting toolkit. It offers a multi-dimensional data model, a flexible query language, and integrates with many sources to gather metrics. In a Kubernetes (K8s) environment, Prometheus can be configured to scrape metrics from individual pods, services, etc. It primarily supports a pull model for gathering metrics but can also push metrics via the Pushgateway component. It’s widely adopted in cloud-native environments and has built-in service discovery for many types of infrastructure. However, it’s primarily designed for domain-bound operation and doesn’t natively support long-term storage or global views (multi-domain) across multiple clusters, which limits large-scale deployments.
  • Thanos is an extension of Prometheus and was designed to address some of Prometheus’s limitations, such as long-term storage, high availability, and multi-cluster monitoring. It integrates with existing Prometheus servers and can store data in object storage systems (like S3 or GCS), allowing for cost-effective long-term storage. Thanos components are designed to be highly available and to provide a global query view across all connected Prometheus servers, which can be a significant advantage in large, distributed environments.
  • InfluxDB is an open-source time series database that handles high write and query loads. It’s often used for applications in IoT, analytics, and monitoring. InfluxDB supports a SQL-like query language and can work with Grafana for visualization. It’s generally easy to use and provides much functionality. However, some advanced features require the enterprise version (i.e., pay-to-use).
  • VictoriaMetrics is another open-source time series database that aims to provide a more cost-effective and scalable alternative to systems like Prometheus and Thanos. It can ingest data from Prometheus, Graphite, and InfluxDB, offering better performance and lower resource usage than similar systems. It can also replicate data for high availability, shard data for increased capacity, and provides long-term storage options.
  • Cortex is an open-source project that adds horizontal scalability to Prometheus. This is achieved by using a microservices architecture to distribute the responsibilities of Prometheus (like ingesting metrics, storing data, and querying data) across multiple nodes. This can significantly increase your monitoring setup’s scalability and add support for multi-tenancy, which means that different teams or customers can have separate, isolated views of their metrics. Like Thanos, Cortex supports long-term metrics storage in object stores like S3 or GCS.

[Field-View]: What we see in the field is Prometheus being popular per k8s cluster domain; however, as noted above, it comes with domain-bound, scalability, redundancy/resiliency limitations where Thanos and Cortex fill the gaps -> to make Prometheus more suitable for large-scale, long-term use.

2.2 Logs

A log is a time-stamped text record with structured (recommended) or unstructured metadata. Of all telemetry signals, logs have the most significant legacy. Most programming languages or well-known, widely used logging libraries (remember the famous Log4J? lol) have built-in capabilities. Although logs are an independent data source, they may also be attached to spans.

Logs provide insights into what happened and when it happened in your application. They’re valuable for troubleshooting issues and understanding the sequence of events leading up to a problem. Popular logging-related technologies are;

  • Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus, designed to handle massive amounts of log data.
  • Fluentd / Fluent Bit are open-source data collectors that unify data collection and consumption. They can gather logs from many sources and deliver them to many destinations, including Elasticsearch, Cloudwatch, and Datadog.
  • Logstash is a server-side data processing pipeline that simultaneously ingests data from multiple sources, transforms it, and then sends it to a stash (like Elasticsearch). Logstash, like Fluentd, also offers a wide array of input, filter, and output plugins to customize its functionality. However, it’s generally more resource-intensive than Fluentd.
  • ELK refers to a stack composed of Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. It’s a widely used solution for log management and analytics, though it can be complex to manage at scale. It is now often referred to as the Elastic Stack since it can also include Beats, a platform for single-purpose data shippers.
  • EFK is similar to ELK but replaces Logstash with Fluentd. Fluentd is often lighter-weight and easier to manage than Logstash, making EFK a popular choice for Kubernetes logging.
  • Rsyslog is a high-performance legacy log processing system. It’s often used in traditional system logging setups and supports a wide variety of sources, transformations, and destinations. However, it needs to be more oriented toward modern, cloud-native environments and required to provide its own storage or visualization layer like ELK or EFK.

[Field-View]: What we see in the field; is Fluentd which is popular at Infrastructure and Platform Layer software stacks; however, Loki is gaining popularity across applications DevOps community. For VNFs, we still see rsyslog being a log delivery mechanism for element management systems (EMS).

2.3 Events

Although not every standalone event is irrelevant from an observability perspective, events can provide robust input for RCA (Root Cause Analysis) in the AIOPS operations. Events record notable changes in your platform/services, which are carried over a messaging bus, popular message bus projects are:

  • Kafka is a distributed event streaming platform that can handle real-time data feeds. It can capture all types of events in a Kubernetes environment.
  • RabbitMQ is a messaging broker that accepts and forwards messages. It can be used to manage event-driven applications in a Kubernetes environment.
  • Pulsar is a scalable, high-performance publish-subscribe messaging platform that you can use as an alternative to Kafka. Its architecture separates the serving and storage layers, making it a good choice for cloud-native applications.
  • NATS is an open-source, lightweight, and high-performance messaging system. It doesn’t support the same level of message durability as Kafka or RabbitMQ, but it is straightforward to deploy and can deliver extremely high throughput with low latency.

[Field-View]: What we see in the field as Kafka is famous due to scalability, durability, and real-time processing performance. However, NATS is getting popular across IT administrations due to its low footprint for resource consumption and ease in lifecycle management.

2.4 Traces

Tracing is a powerful tool that provides insights into the path and life of a request as it travels across a distributed system. It proves invaluable when addressing latency concerns and debugging intricate interactions between services. Here are some popular tracing tech stacks:

  1. Jaeger is an open-source, end-to-end distributed tracing system created by Uber. It tracks requests across multiple microservices in environments such as Kubernetes.
  2. Zipkin is considered obsolete by some due to the rise of newer solutions, Zipkin remains an open-source distributed tracing system to gather timing data needed to troubleshoot latency problems in service architectures.
  3. Grafana Tempo is an open-source, easy-to-use, and high-scale distributed tracing backend. Tempo is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can ingest standard open-source tracing protocols, including Jaeger, Zipkin, and OpenTelemetry.
  4. Elastic’s application performance monitoring (APM) system supports distributed tracing. It’s part of the Elastic Stack (ELK Stack), so it’s a seamless addition if you already use Elasticsearch for log storage.

[Field-View]: What we see in the field as Jaeger is the popular choice due to broad language support, support for multiple storage backends, and integration well with K8s. However, distributed tracing with OTel (see Section 5.0 for more details) is catching up with a high adoption rate.

3.0 Advanced Telemetry Data

Here are some extremely valuable additional advanced observability data that can be used by AIOps, which may require different tools, stacks, and integration points to be utilized.

3.1 Topology/Correlation Data (Graph Data)

Topology data is a blueprint of your infrastructure/platform. It maps out the relationship and interconnections between various nodes (such as servers), services, and other resources within a network. By providing an interconnected view of all components, topology data helps you understand service dependencies and facilitates impact analysis in case a component fails or changes.

Consider a microservices-based architecture. It comprises multiple services that communicate with each other. The topology data for this environment would capture these services and the interaction paths between them. If a service fails, you can leverage topology data to identify other services that may be affected due to this failure.

Here’s a deeper dive into some potential sources for obtaining topology data:

1. The Kubernetes API serves as a treasure trove of data about the real-time state of a Kubernetes cluster. It tells you about the cluster’s nodes (or servers), the pods running on these nodes, and the services that tie them together. By harnessing this data, you can generate a topology map of your Platform-as-a-Service (PaaS) environment.

2. Similarly, the OpenStack API divulges comprehensive details about the current state of your Infrastructure-as-a-Service (IaaS) deployment. This comprises information about the compute nodes, active workloads, network configurations, storage volumes, and more. By collecting this data, you can draw up a topology map of your IaaS environment.

3. A service mesh emerges as a crucial consideration When assembling the topology data for a modern, microservices-based environment. This service mesh is a service layer solution to manage a high volume of network-based inter-process communication among services through APIs.

4. Network fabric data includes details about your network setup, such as segmentation data, insights about network overlays, subnet details, firewall rules, etc. Such data is critical to comprehend how different resources interact and ensure optimal network security and performance.

[Field-View]: There is no winner or loser in this category, as these offer a comprehensive understanding of your infrastructure/platform and how each piece interacts with the others, all of which can be used when & where available for observability’s sake.

3.2 External Tools

Empowering AIOps machinery with data from external tools & services can significantly enhance the depth of insight and precision in handling operations. Here are some additional data sources that can be leveraged:

  • Container Monitoring Tools (Runtime Data): These specialized tools monitor container activities, looking for unusual patterns that may signify issues. They detect anomalies, enforce compliance rules, and protect against threats. Examples of such tools include Falco and Stackrox, an open-source project by Red Hat. They offer runtime security features and the ability to spot abnormal application behavior.
  • Application Performance Testing (APT): By applying performance tests and analyzing the outcomes, we can assess distributed applications’ performance, resilience, and reliability. This form of black-box testing is instrumental in detecting performance bottlenecks and potential failure points under load conditions. One of the popular open-source APT projects is DDosify.
  • Vulnerability Scanning Tools (Image Scanning Data): These specialized tools scan container images to identify known vulnerabilities. Examples include Stackrox, KubeArmor, Trivy, Clair, and Anchore. These tools ensure our container images are secure before deployment, helping prevent potential breaches and improve overall system security.
  • Network Performance Tools (Network Traffic Data): Network performance tools provide insights into the traffic between services in a Kubernetes environment. They can spot suspicious network behavior and, in some cases, prevent attacks. Tools like NetObserv Operator, Pixie, Calico, and Hubble (Cilium), which operate in Kubernetes environments, offer network policy enforcement and a certain degree of network visibility.
  • Identity and Access Management (IAM) Tools (Access Control Data): IAM tools maintain information about access controls in your Kubernetes cluster, detailing who has access to what resources. Examples include AWS IAM, Google Cloud IAM, and Azure Active Directory, all offering robust IAM capabilities on their respective platforms.
  • Compliance/Audit Tools (Compliance Data): These tools ensure that your Kubernetes configurations adhere to best practices or specific compliance frameworks. Stackrox, for instance, checks for secure deployment of Kubernetes by running checks documented in the CIS Kubernetes Benchmark. Meanwhile, Kyverno and OPA/Gatekeeper can enforce policies within your Kubernetes clusters, enhancing overall security.

3.3 Add-On Application Observability with eBPF

If an application is written without observability in mind (i.e., does not generate its own consumable observability data), we can “inject” observability into it via;

  • The use of service mesh that inserts a sidecar container between incoming/outgoing traffic and an application and collects metrics about the ingress and egress traffic in/out of the application. Service mesh does not offer application performance level intelligence or granular inside-the-application observability, only generic network stack-level metrics.
  • Using kernel-level application performance detail gathering with eBPF probes & maps for; capturing granular application performance data (metrics & logs), security posture analysis (sniffing), behavior modeling (low-level tracing), and advanced troubleshooting.

3.3.1 Use of eBPF for Observability

Kernel-level details about a running workload can provide valuable insights into the performance and behavior of your application that were previously difficult to access. eBPF (extended Berkeley Packet Filter) is a powerful tool that enables such capabilities.

For instance, eBPF programs like Runqslower can report scheduling delays, shedding light on tasks waiting for CPU time. Funclatency, another eBPF program, can be used to instrument applications that don’t generate trace spans, revealing the average execution time of functions within a workload. Please see Figure-2 for other BPF capabilities.

Figure-2 BPF Tracing Tools (Ref: IOVISOR/BCC Link)

To simplify the management and monitoring of eBPF programs, bpfd has been developed. bpfd facilitates the loading, unloading, modifying, and monitoring of eBPF programs on individual hosts and Kubernetes clusters. It consists of the following core components:

  1. Daemon-Set: This system daemon supports the loading, unloading, modification, and monitoring of eBPF programs through a gRPC API.
  2. eBPF CRDs: bpfd provides Custom Resource Definitions (CRDs) such as XdpProgram and TcProgram, which allow expressing intent to load eBPF programs. It also includes a bpfd-generated CRD (BpfProgram) that represents the runtime state of loaded programs.
  3. bpfd-agent: This agent runs in a container within the bpfd daemon set and ensures that the requested eBPF programs for a given node are in the desired state.
  4. bpfd-operator: An operator built with the Operator SDK manages the installation and lifecycle of bpfd-agent and the CRDs within a Kubernetes cluster.
Figure-3 eBPF Probes-Based Kubernetes Application Observability

The solution offers several benefits, including enhanced security and improved visibility and debuggability:

  • Security: Only the tightly controlled bpfd daemon has the privilege to load eBPF programs, improving security. Access to the API can be controlled using standard RBAC methods, and administrators have control over program loading permissions.
  • Visibility & Debuggability: The solution provides better visibility into the eBPF programs running on a system, enhancing debugging capabilities for developers, administrators, and customer support. Even if not all applications use bpfd, it still offers visibility into all eBPF programs loaded on cluster nodes.
  • Multi-program Support: The solution supports the coexistence of multiple eBPF programs from different users. It utilizes the libxdp multiprog protocol, allowing multiple XDP programs on a single interface. This protocol is also supported for TC programs, providing a unified multi-program experience across both TC and XDP.
  • Productivity: The solution simplifies the deployment and lifecycle management of eBPF programs within Kubernetes clusters. Developers can focus on program logic, while bpfd handles program lifecycle tasks such as loading, attaching, and pin management. Existing eBPF libraries can still be used for development and interaction with program maps via well-defined pinpoints managed by bpfd.
  • eBPF Bytecode Image Specifications: bpfd provides eBPF Bytecode Image Specifications, enabling fine-grained versioning control for userspace and kernelspace programs. It also allows for signing these container images to verify bytecode ownership. The gathered information from kernel space probes can be written into eBPF maps, which userspace processes can export, transform, and load into the desired backend/sink (please refer to section 5.0 on open telemetry for data collector and exporter mechanisms).

By leveraging bpfd with eBPF application probes, we can gain valuable insights and streamline the management of eBPF programs within Kubernetes clusters.

3.4 Hardware Observability

Regarding hardware observability, two primary approaches can be employed. Each approach offers unique advantages and allows retrieving observability data from different sources. These approaches are:

  • Direct Communication with Hardware Baremetal Controller: One way to achieve hardware observability is by directly communicating with the hardware’s bare-metal controller. This involves leveraging well-defined and supported communication protocols.
  • Retrieving Host Operating System Data: Another approach to hardware observability involves retrieving data from the host operating system through system calls. The host operating system collects and exposes various metrics and information about the underlying hardware.

3.4.1 Bare Metal Level

The Redfish (when and where available properly, i.e., compliant with relevant specifications) provides a standardized, vendor-neutral interface for hardware-level observability in modern data centers and server environments with a REST API. The protocol manages distributed, converged, or software-defined resources and infrastructure.

We can leverage the Redfish API to collect real-time telemetry data from hardware components and gain insights into their health, performance, and operational status. The Redfish API offers a rich set of capabilities for hardware observability, including:

  1. Sensor Data Collection: The Redfish API allows the retrieval of sensor data from various hardware components, such as temperature sensors, power sensors, fan sensors, and voltage sensors. This data provides valuable information on the operating conditions of the hardware, enabling proactive monitoring, alerting, and predictive maintenance. Caveats can exist with an active collection of sensor data from a Bare Metal Controller (BMC). BMC access also needs to be restricted due to operational security needs, often resulting in a layer boundary that should be approached on a case-by-case basis.
  2. Events: Hardware events, such as component failures, power supply changes, or system restarts, can be captured and retrieved using the Redfish API. These event data provide a historical record of critical hardware-related incidents, facilitating troubleshooting, analysis, and compliance reporting.
Figure-4 Redfish Event Relay Data Flow

We can leverage an event relay approach (a solution to be implemented on the application platform) to subscribe applications running in your K8s cluster to events generated on the underlying bare-metal server, where the Redfish service publishes events on a node and transmits them to subscribed applications on an advanced message queue.

  1. Power Management: The Redfish API enables control and monitoring of power-related aspects of hardware, including power supply status, power consumption, and power capping. This allows organizations to optimize power usage, identify energy inefficiencies, and ensure stable power delivery to critical components.
  2. Inventory and Asset Information: Through the Redfish API, organizations can retrieve detailed inventory information about hardware assets, including manufacturer details, serial numbers, firmware versions, and component configurations. This data aids in asset tracking, warranty management, and lifecycle planning.

While the Redfish API is a widely agreed standard for hardware observability, other protocols and interfaces exist. For example, Intelligent Platform Management Interface (IPMI) is another widely used hardware management and monitoring interface.

3.4.2 Operating System Level

Another approach for hardware telemetry collection involves gathering data within the operating system (OS) using;

The Linux kernel libsensors library provides an interface to access raw sensor data through the sysfs interface, which exposes hardware-related information.

By leveraging the libsensors library, we can extract hardware telemetry data directly from the sensors present in the system. This includes temperature sensors, fan speed sensors, voltage sensors, and other relevant hardware sensors. The library provides an abstraction layer that simplifies the process of retrieving sensor data from various hardware components.

The Prometheus node exporter is a specialized component that exposes system-level metrics and data for external consumption. The node exporter includes a collector known as “hwmon,” specifically designed to collect hardware telemetry. This collector interacts with the libsensors library to gather sensor data from the system. It retrieves the raw telemetry values and converts them into Prometheus-compatible metrics, which Prometheus can scrape for further analysis, visualization, and alerting.

By combining the kernel libsensors library with the Prometheus node exporter’s hwmon collector, we can effectively collect and expose hardware telemetry data within our observability infrastructure. This approach enables us to monitor and analyze critical hardware metrics, such as temperature, fan speed, and voltage, alongside other system-level metrics.

Figure-5 Prometheus Node Exporter HWMON (Ref: Link)

Project Kepler (Kubernetes-based Efficient Power Level Exporter) is a platform energy telemetry observation solution which also leverages eBPF technology (background provided in Section 3.3), to probe performance counters and kernel tracepoints that encompass hardware telemetry as well.

Figure-6 Kepler Architecture

4.0 Data Processing / Data Pipelines

Creating an ETL (Extract, Transform, Load) layer with sub-layers like data cleansing, data segregation, data anonymizer, data correlation, data mesh, data enrichment, and data governance is crucial to handling large amounts of data. Here’s how we can fulfill them using various open-source projects and popular products:

(I) Data Cleansing: This involves removing or correcting errors in the data, filling in missing values, removing duplicates, and validating the accuracy of the data.

  • OpenRefine is a powerful tool for working with messy data, cleaning it, transforming it from one format into another, and extending it with web services or external data.

(II) Data Segregation involves separating data based on specific criteria or rules.

  • Apache NiFi is a robust data ingestion and distribution framework that allows you to segregate and route data based on various attributes.

(III) Data Anonymizer ensures that sensitive data is anonymized to protect privacy.

  • ARX Data Anonymization tool is open-source software for anonymizing sensitive personal data.

(IV) Data Correlation involves discovering relationships or patterns across different fields.

  • Elasticsearch with the Kibana visualization tool can provide robust data correlation features.

>> Korrel8r, an advanced correlation engine, excels in navigating relationships and uncovering related data across multiple heterogeneous stores. It harnesses a comprehensive set of rules that define intricate connections between objects and signals. With a designated start object, such as an Alert within a cluster, and a specified goal, like “finding related logs,” Korrel8r dynamically scours through the available data, tracing a chain of rules to identify and retrieve the goal data associated with the start object. Korrel8r efficiently establishes connections and uncovers relevant insights through its intelligent rules-based approach.

(V) Data Mesh involves creating a distributed data architecture where teams can access data from where it’s naturally produced, promoting greater agility and scalability.

  • There needs to be a specific tool for this, as it’s more of an architectural principle. However, tools like Istio can help implement a data mesh architecture.

(VI) Data Enrichment involves improving raw data by linking it with additional data sources.

  • Apache Nifi has processors that can call out to external APIs to enrich data in the flow.

(VII) Data Governance involves managing the availability, usability, integrity, and security of data in systems.

  • Apache Atlas is a scalable and extensible set of core foundational governance services that enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop.

5.0 Open Telemetry (OTel)

Let’s pause here to close a gap that we have not touched on previously: the “standardization and instrumentation” of data collection, processing, and sharing (export), -> Which OpenTelemetry (OTel) has born for!

Figure-7 OTel Collection

OpenTelemetry is an open-source observability framework that provides a set of APIs, libraries, and tools for collecting, transmitting, and processing telemetry data from distributed systems. It aims to standardize the collection and instrumentation of telemetry data, making it easier to observe, monitor, and debug complex applications regardless of which programming language you are using. The full power of OpenTelemetry gets unleashed when processors sample, filter and transform the telemetry data.

Figure-8 OTel Pipeline Flow

OTel benefits:

1. Standardization: It provides a vendor-neutral, community-driven standard for instrumentation and telemetry data collection across different programming languages, frameworks, and platforms. There is a very active community working behind this standardization. Distributed traces have already reached stable status, and the other signals are closer to the same level of maturity day by day.

2. Distributed Tracing: It enables distributed tracing, which involves capturing and correlating trace information across multiple services or microservices. This helps identify the path of requests as they traverse different system parts, providing end-to-end visibility and allowing for performance optimization and troubleshooting.

3. Metrics Collection: It supports the collection of metrics, which are quantitative measurements of the system’s behavior and performance. By instrumenting applications with OpenTelemetry, developers can collect metrics related to CPU usage, memory consumption, latency, error rates, and other relevant indicators. These metrics provide valuable insights into system health and performance.

4. Integration with Logging and APM: It can be integrated with logging systems and Application Performance Monitoring (APM) tools. This integration enables the correlation of logs, traces, and metrics, providing a holistic view of system behavior and allowing for deeper analysis and debugging.

5. Flexibility and Extensibility: It provides flexibility regarding the choice of collectors, exporters, and backends. It supports various telemetry data formats and can be integrated with monitoring and observability systems. It also allows custom instrumentation and extensions to meet specific monitoring requirements. OpenTelemetry also provides SDKs to instrument the code and even auto-instrument it in a long (and growing) list of programming languages.

6.0 3GPP’s Take On Observability

3GPP (3rd Generation Partnership Project) has defined a framework for observability in 5G networks, specifically for the Service-Based Architecture (SBA). The SBA is a critical architectural component of 5G networks that enables flexible service deployment and scalability. Observability in this context refers to the ability to gather, analyze, and gain insights from various data sources within the 5G SBA to ensure efficient network operation and performance. Compared to 4G, 5G Networks extensively use network slicing, requiring additional capabilities in the AIOps framework to consolidate KPIs at the slice level to constitute a single output for a network slice.

The observability solution for 5G SBA is based on a set of specifications defined by 3GPP. These specifications outline the protocols, interfaces, and mechanisms to enable observability within the 5G network. 3GPP has introduced Network Data Analytics Function (NWDAF), which collects the network data from other network functions and performs data analysis to help improve 5G network management automation. The NWDAF collects data from the 5G Network and provides analytics to support network automation, closed-loop operations, self-healing, experience improvement, and reporting. NWDAF helps automate and deliver network and slicing optimization, cost efficiency, and resource management to meet SLAs and quality-of-service guarantees. Utilizing intelligent modeling on refined data streams and ingested statistics, it supports a broad spectrum of analytics capabilities ranging from threshold-driven to more sophisticated algorithm-based analytics.

Critical aspects of the 3GPP TS 23.288 NWDAF-based observability framework include:

1. Data Collection Coordination Function (DCCF): Data Collection Coordination and Delivery coordinates the collection and distribution of data requested by NF consumers. It prevents data sources from handling multiple subscriptions for the same data and sends multiple notifications containing the same information due to uncoordinated requests from data consumers. The DCCF may collect the analytics and deliver it to the NF, or the DCCF may rely on a messaging framework to collect analytics and deliver it to the NF. A DCCF can support multiple Data Sources, Data Consumers, and Message Frameworks.

Figure-9 3GPP Custom Data Bus Architecture for 5G

2. Analytics Data Repository Function (ANDRF): The 5G System architecture allows ADRF to store and retrieve the collected data and analytics. The following options are supported:

  • ADRF exposes the Nadrf service for storing and retrieving data by other 5GC NFs (ex NWDAF), which access the data using Nadrf services.
  • Based on the NF request or configuration on the DCCF, the DCCF may determine the ADRF and interact directly or indirectly with the ADRF to request or store data. The interaction can be direct or indirect via a messaging framework.

3. Messaging Framework Adaptor Function (MFAF): MFAF offers 3GPP-defined services that allow the 5GS to interact with the Messaging Framework. Internally, the Messaging Framework may support the pub-sub pattern, where received data are published to the Messaging Framework, and requests from 3GPP Consumers result in Messaging Framework specific subscriptions. Alternatively, the Messaging Framework may support other protocols required for integration.

7.0 Observability Solution Architecture Samples

In our previous discussions, we have explored different aspects of observability data characterization, including its collection, storage, processing, and sharing across various operational landscapes. We will illustrate these concepts with architectural diagrams that exemplify influential approaches to clarify understanding.

[A] Reactive Approach: Incidents occur frequently, and we should focus on learning from them using the resources at hand. This approach incorporates the concept of a message bus, where observability data is reported and carried over a best-effort delivery system. Listeners can then process the data based on availability.

Please note, as we covered in Section 2.3 Events, message bus frameworks are designed to pass the information of noticeable changes at the end systems, not necessarily complete state information.

Figure-10 Reactive Use of Observability

In Figure-10, we present a sample solution for observability data collection using a message bus-driven approach. It integrates various open-source backends, following a bottom-up approach (network-fabric -> server-hardware -> application-platform -> applications), and forwards the collected data to a central Operations Support Systems (OSS) umbrella system via a message bus. This approach aligns with the 3GPP approach depicted in Figure-9.

[B] Proactive Approach: The key is to have a substantial volume of data (big data) delivered with high velocity (fresh data) and encompassing variety (data from multiple endpoints). To achieve this, we should implement high-throughput and low-latency data stream pipelines that transfer data from applications to data sinks, from data backends to data pipelines, and from data pipelines to AIOps machinery. This enables proactive operational management based on real-time (or near-real-time) data-driven insights.

Figure-11 Proactive Use of Observability

Figure-11 illustrates a dynamic and data-focused observability solution topology. It utilizes different technology stacks, such as RedFish, Victoria, Icinga, Loki, and Logstash, to provide comprehensive data collection and processing. The valuable and actionable data is then shared with external OSS platforms. This solution continually observes system states, offering proactive management capabilities.

[Field-View]: Both reactive and proactive approaches have been implemented by various service providers, and neither approach is inherently better than the other by definition. The choice of solution design depends on specific operational goals to be achieved.

However, from a data collection framework perspective, we suggest several viable and future-proof options that leverage the power of the community and the freedom of open source. This allows us to easily select data backends and data processing pipelines that best suit different needs and circumstances:

  • Direct Collection from CNFs: CNF/VNF observability data can be collected directly from the CNFs using OTel’s OpenTelemetry Protocol (OTLP). This enables direct data ingestion from CNFs to a designated data backend (e.g., AWS S3 with per CNF buckets and isolated access credentials) via the OTel exporter.
  • Collection via OTel Collector: Alternatively, CNFs observability data sinked to a time-series cluster-scoped backend (e.g., Cortex with tenant isolation), which can then be collected by an OTel collector — the OTel collector aggregates, processes, and exports the data to a central observability/AIOps platform.
  • Utilizing eBPF: Another option is to employ eBPF to collect observability data directly from CNFs/VNFs. This requires the CNF/VNF vendor (or a platform vendor) to provide configurable eBPF probes that collect and export data in an OTel-compliant format.

By leveraging OTel, our proposed approach enables efficient data collection, processing, and export of CNF/VNF observability data. It offers flexibility in data collection, while the OTel components handle aggregation, transformation, and integration with the central observability/AIOps platform.

These offers align with modern practices in observability, utilizing standard protocols and components to achieve efficient and scalable data collection from CNFs/VNFs.

8.0 Multi-Site Observability + AIOps Insights Machinery

Red Hat, Telenor and AIOPSONE have been collaborating to address the following;

  • Full stack (Network, Compute, Storage, OS, Platform, Application) observability, pre-integrated with Thanos, InfluxDB, VictoriaMetrics, Loki backend, and built-in Kafka client to receive published events.
  • Multi-Site single-pane dashboard with Openshift, K8s, and OpenStack API support.
  • Embedded visualization and on-demand data query & filtering with Grafana Dashboards and embedded native visualization capabilities.
  • ML/AI-driven insights generation with event-driven workflow triggers for action resolutions, pre-integrated with Ansible Tower.
Figure-12 AIOPSONE Single Pane Experience

We have worked on the following widespread operational problems with possible solution approaches enabled by AIOps:

1. Root Cause Analysis (RCA) with network observability

In our 5GCore Test Bed, we intentionally introduced a failure or bottleneck in the network fabric to simulate real-world scenarios. This results in anomalies within the Key Performance Indicators (KPIs) of the 5GCore Cloud Native Functions (CNFs) and provides various reason indications. Here, the AIOPSONE engine showcases its capabilities by performing Root Cause Analysis (RCA) and accurately identifying the source of the problem.

Key Highlights:

  • Intentional network fabric failure/bottleneck simulation for realistic testing.
  • Anomalies observed in 5GCore CNF KPIs with exact reason indications.
  • AIOps engine performs accurate RCA to identify the root cause of the issue.

Key Data Sources:

  • OTel Metrics for platform network performance measurements.
  • Application data collected via eBPF probes.
  • NEP Performance KPI(s) via Rsyslogs.
  • OS network interface & device logs and metrics via Node Exporter.
Figure-13 AIOPSONE Prediction / Possible RCA for Network-Fabric Related Issue (MTU Misconfig)

Issue Resolution:

  • Such insight (Figure-11) enables quick troubleshooting and problem solving (ex, use of Ansible network fabric modules for configuring problematic hops accordingly), minimizing downtime and optimizing network performance.

2. VNF/CNF Performance Bottlenecks Analysis

In this scenario, we focused on the performance bottlenecks after upgrading to a new release of the 5G CNF. Some CNF PODs start consuming excessive CPU/Memory resources, leading to unforeseen issues reported within CNF OTel Data. AIOPSONE engine leverages this data for performing RCA analysis, identifying the POD version change (i.e., release change) as the cause of the bottleneck.

Key Data Sources:

  • OTel Metrics for platform network performance measurements.
  • Application data collected via eBPF probes.
Figure-14 AIOPSONE Prediction/Possible-RCA for CNF Performance/Behaviour

Issue Resolution:

  • Such insight (Figure-14) enables the execution of an intelligent rollback to the previous POD Image Version by Ansible Tower that eliminates bottlenecks and ensures stable performance. Also, the application deployment manifest was updated by Ansible LightSpeed for using GenAI to scale a particular CNF POD by increasing the replica-set (RS) size and with optimal resource allocation.

3. Hardware Failure Prediction: Analyzing IaaS observability data and performing predictive maintenance modeling

We analyzed received IaaS observability data in this use case to predict hardware failure patterns and create resolution paths. By utilizing advanced predictive maintenance models, our AIOPSONE engine can proactively identify potential hardware failures and provide resolution paths to mitigate risks.

Key Highlights:

  • Analysis of IaaS data enables predictive hardware failure pattern recognition.
  • Proactive identification of potential hardware failures for timely resolution.
  • Predictive maintenance modeling reduces downtime and enhances network reliability.
  • Resolution paths were provided to address identified hardware issues efficiently.
Figure-15 AIOPSONE Prediction/Possible RCA for Hardware Failure

Key Data Sources:

  • Hardware Telemetry (mainly sensor data collection and event logging) over Redfish.
  • OS device metrics and logs via Node Exporter.

Issue Resolution:

  • Such insight (Figure-15) enables the execution of an external workflow order (ex, Ansible ServiceNow modules) for hardware replacement in a specific data center and server enclosure.

9.0 Summary

This article delves into various aspects of observability and AIOps, exploring different technology stacks and implementation approaches. We emphasized the importance of data collection and highlighted vital sources such as metrics, logs, events, traces, and topology data. Open-source technologies like Prometheus, Thanos, Loki, Fluentd, Kafka, and Jaeger were discussed for data collection, processing, and enrichment.

Standardization and instrumentation in data collection led us to the OpenTelemetry framework, which provides a standardized approach for collecting, transmitting, and processing telemetry data from distributed systems. This framework enhances the observability and monitoring of complex applications.

The 3GPP’s approach to observability in 5G networks, specifically the Service-Based Architecture (SBA) and Network Data Analytics Function (NWDAF) framework, were explored. NWDAF standardizes data collection and management from user equipment, network functions, and operations and maintenance systems within the 5G network.

For application add-on observability, eBPF was highlighted as a powerful tool for injecting observability into applications. The bpfd software stack simplifies the management and monitoring of eBPF programs, providing enhanced security, visibility, multi-program support, and improved productivity.

In terms of hardware observability, two significant approaches were discussed. Direct communication with the bare-metal hardware controller using protocols like Redfish enables real-time telemetry data collection while retrieving host operating system data through system calls, and libraries like libsensors and hwmon provide insights into hardware behavior.

The collaboration between Red Hat, Telenor, and AIOPSONE in addressing operational challenges and implementing AIOps solutions were showcased, highlighting the power of AI-driven insights and automation in optimizing network performance and reducing downtime.

In conclusion, the collaboration between Red Hat, Telenor, and AIOPSONE, along with open-source projects and standardization efforts like OpenTelemetry, and bpfd, empowers organizations to collect, process, and analyze data for improved observability and decision-making without vendor lock-in. Striving for ultimate open-source solutions with scalability, resilience, and efficiency remains a crucial objective.

--

--