Expanding the Observable Universe🌌 (or Scalable Model-Driven Telemetry with SR Linux Custom Agents and gNMI)

Published in

Geek Culture

6 min readSep 20, 2021

The Observable Universe [source: Wikipedia]

Compared to the universe, a data center represents a relatively small and well organized place. These artificial digital systems designed by human engineers have a relatively small set of states they can be in, and an equally small set of orchestrated behaviors that are entirely predictable and reproducible. In theory, at least.

In practice, things are not quite that simple. Sure, there is the Simple Network Management Protocol — SNMP, from RFC1098(1989) to v3 defined in RFC3414(2002); it provides a model-driven telemetry framework (“avant la lettre” one could say) through its standardized MIBs with management objects organized in tree structures. Notification events (called “traps”) can be sent using UDP, with a mix of generic and vendor specific OIDs (like this tmnxDyingGasp trap for example). And even though the protocol is aging and has many known issues and drawbacks (such as lack of security/encryption, limited filtering and data retrieval options, use of unreliable transport, and inconsistent encoding between versions), it is still commonly used for network monitoring today.

Streaming Telemetry

More recently, the internet community introduced new concepts like “streaming telemetry” and OpenConfig with its NETCONF(2011)/RESTCONF(2017) protocols to manage YANG(2016) data models. In parallel, Google open sourced gRPC(2015) and the gRPC Network Management Interface (gNMI) as an alternative to NETCONF. Major network vendors started supporting these concepts and protocols in network devices about 5 years ago, around 2016 (Nokia SROS 16.0R1, Juniper JunOS 16.1R3, Cisco IOS XR 6.0.0, Arista EOS 4.20.1F).

At a high level, compared with “traditional” SNMP based monitoring it is said that streaming telemetry provides the following benefits:

More efficient collection (with less impact on device CPU)
Higher granularity/quality data
More accurate and real-time (proactive “push” versus reactive “pull”, collected data includes timestamps)
More secure (better authentication, encryption)

These benefits are all associated with having newer protocols, management architectures and hardware support, and all vendors are quick to claim them by virtue of having implemented gNMI protocol support and YANG models. However, not all implementations are equal — and the differentiators are in the details.

Periodic sampling versus “on change” events

gNMI supports two types of streaming retrieval methods: SAMPLE and ON_CHANGE. The former periodically sends updates, with an interval defined in nanoseconds(ns) — though typical usage scenarios today are measured in seconds. SAMPLE is a direct replacement for SNMP polling. ON_CHANGE only sends an update when something changes, enabling event-driven responses to alarming situations. This is more similar to SNMP traps, but much more powerful — when implemented properly. The question is: What constitutes an “alarming” change, and how can one determine this efficiently?

Traditional monitoring can miss events reported by conditional on-change events

Custom conditional “on change” alerts using agents

Threshold Crossing Alerts (TCAs) are a well known concept in system monitoring. Some variables (e.g. those related to resource usage, like CPU/disk utilization, memory, buffers) change frequently, but most changes are irrelevant until they reach a certain limit. System vendors can easily incorporate features to send SNMP traps or gNMI events to signal such events, and even make them user configurable.

To illustrate how this can be insufficient, consider a situation we once faced in our Nokia labs in Ottawa. The management network was using a pair of redundant switches running VRRP, but every so often the gateway would stop responding and we would lose connectivity. Upon investigation, it was found that a high level of multicast traffic on an unrelated port could overwhelm the CPU — the very same one that was processing VRRP for us.

When troubleshooting a situation like this, it is not sufficient to just stream CPU utilization samples every X seconds and respond if they cross a threshold. Even when gNMI ON_CHANGE is supported (and some vendors don’t), by the time a central monitoring system has received the event and decided to get stats on some other system component, there’s a good chance things have changed, and you no longer see the issue. There is no observability.

To address such gaps, a modern monitoring architecture benefits from — if not “requires” — having the ability to run custom logic/software on each node. These software “agents” can:

Implement complex, custom logic that aggregates a set of arbitrary conditions (“report CPU utilization changes as they happen, but only those over 80%; include statistics on multicast traffic queues at that moment”).
Respond to local conditions in a timely fashion, quicker than any centralized system could
Be implemented independently by anyone, not only as part of a vendor’s release cycle
Scale along with the deployment (more nodes=more agent processing capacity)

Troubleshooting modern software defined networks requires flexible and advanced distributed monitoring techniques, because failure conditions are exceptional by nature and cannot be anticipated. There is a need for programmable, custom logic that can be triggered on arbitrary conditions, and collect any data necessary to validate any possible hypothesis on what might be going on.

Custom agents: Edge computing for data centers

Similar to the ideas behind Multi-access Edge Computing (MEC), the local processing enabled by custom software on network nodes reduces the latency between the moment some event occurs, and the moment a management application becomes aware of it. This increases the range of possible applications, the amount of data that can be collected, and the level of detail and timeliness of state information retrieved upon occurrence.

In troubleshooting scenarios for example, this can make the difference between observing a certain issue and reaching conclusions about possible correlations and mitigations, or not seeing it. Note that this does not eliminate the need for centralized monitoring — far from it. In essence, local agents augment the capabilities of the centralized monitoring system, and supply it with more detailed, more accurate, more tailored and more timely information.

Prototype: The SRL Docter Agent

This SRL Docter Agent project on GitHub provides an example implementation to help you get started. It illustrates the following use cases:

Generic ON_CHANGE monitoring of summarized data/control plane health, comprised of configurable contributing elements

Monitoring changes in BGP paths as part of CONTROL_PLANE health

Timely reporting of arbitrary system state upon programmable conditions
(the CPU threshold crossing example illustrated above)

Per-application CPU usage reporting upon threshold crossing

Scalable local SAMPLE monitoring over a longer time period, feeding a central ON_CHANGE dashboard

Reporting unbalanced uplink usage as ON_CHANGE events for DATA_PLANE health

Visualizing custom health issues reported by agents

Monitoring state for which ON_CHANGE is not supported

The BGP RIB subtree does not support ON_CHANGE events, and can only be sampled. Here a given application prefix (8.8.8.8/32) is monitored for availability, using 10 second intervals over a 100 second history period

Expanding your adjacent possible with SR Linux

The “adjacent possible” (by Stuart Kauffman) articulates how systems and organisms naturally evolve to a set of next possible configurations. By doing so they increase the diversity of what can happen next — and if you are an architect or network engineer responsible for designing or operating a modern network, consider Nokia SR Linux — with its built-in support for model-driven custom agents, so you can know what’s going on with your critical network. Q.E.D.