High-Performance Alerting Platform at ThousandEyes

Published in

Cisco® ThousandEyes® Engineering

8 min readMar 28, 2024

by Rahul Pandey, Software Engineering Technical Leader, and Balint Kurnasz, Software Engineering Technical Leader at Cisco ThousandEyes

ThousandEyes’ Alerting Platform empowers customers to gain a proactive edge by delivering in-depth visibility into their networks and reducing the mean time to resolution (MTTR) for network issues. The platform continuously monitors network metrics against user-configured conditions. If any deviations or anomalies are detected based on these conditions, the platform immediately alerts users through various predefined notification channels. This blog dives into the real-time data streaming architecture and functionality of the ThousandEyes Alerting Platform, exploring how it facilitates the seamless introduction of new features.

Key architecture goals for the alerting system

1. Real-time Data Processing with Low latency and High Throughput: To handle increasing data volumes with low latency and high throughput, the alerting platform requires several key functionalities. Seamless integration with upstream services is crucial for consuming and processing high-volume, real-time data streams without introducing latency. The platform must continuously update the state of several thousands of alerts at high frequency in response to incoming events. Moreover, to ensure seamless restarts and robust failure recovery, the platform must include mechanisms to efficiently maintain snapshots of the processing state at any given point in time. These snapshots should be stored both in-memory and on persistent storage.

2. Handling Stream Imperfections: The platform must seamlessly manage data streaming related imperfections, such as late arriving, missing or out-of-order data. Having the capability to handle stream imperfections allows the system to process delayed events without impacting overall streaming velocity, to define thresholds for delayed events and take appropriate actions (e.g., discarding events), and to perform idempotent processing to ensure accuracy when handling duplicate events.

3. Scaling and High availability: As the number of customers grows, the platform must scale to accommodate the increased volume of events and to provide reliable processing of events. Alerting platform must be highly available and durable to ensure seamless recovery from failures in parts or the entire platform.

Alerting Platform Architecture

Before we dive deeper into each component of the alerting platform, let’s take a look at the high-level design to visualize the different tiers constituting the platform and their functions. Figure 1 shows that the alerting platform consists of three main components.

In the Alerter Control Plane, real-time synchronization of alert configurations is achieved through continuous ingestion of configuration updates and metadata state from configuration services. CDC events from datastores are also leveraged to maintain the most current alert configuration.

The Alerts Engine applies alert conditions to raw data events in real-time, comparing event metrics with complex conditions and thresholds to determine anomalies. It aggregates these anomalies across multiple intervals and triggers or clears alerts based on the specified rule configurations. As a result, alerts are generated at each interval and are published to a
results stream.

The Alerter Data Plane acts as a centralized storage solution for all generated alert events, making them accessible to downstream internal services through real-time Kafka topics or REST APIs. This allows for seamless integration with other systems, enabling efficient processing and analysis of alert data.

Alerter Control Plane

ThousandEyes customers can create alert rules using the web application or the public APIs. The alert rules are composed of thresholds for one or more key performance indicators (KPIs), conditional evaluations, lookback periods, and scope defined by the number of agents or sensors reporting the issue.

For instance, an alert rule applied to HTTP measurements might look like this: “(resolutionTime > 2 standard deviations over mean) && (responseTime > 3 standard deviations over mean) for 2 intervals of 5 minutes occurring across at least 3 sensors”. Rules like these can be applied to measurements across various layers, such as HTTP servers, DNS servers, BGP routes, network device metrics, API, and web transactions. The rule can also specify the integration channels for sending notifications through email, Slack, PagerDuty, ServiceNow, or any custom webhook, etc.

Apart from custom alert rules set up by the customers, we also provide default alert rules that cover a lot of use cases for them. These alert rule configurations are stored in the respective set of tables in the MySQL database. Capturing updates to alert rules is essential for evaluating network data and generating alerts. Alerter Debezium connects to a read-only replica of the MySQL database. It observes changes to the alert rule configuration tables and produces them as events on dedicated Kafka topics.

Alerter Engine

Alerter Engine is an intricate system that efficiently processes two distinct data streams to generate and clear alerts in real-time. The first stream, sourced from upstream services, contains enriched events representing measurements for any network paths or network asset and comprises several metrics and metadata. Raw events are generated by ThousandEyes agents, enriched by upstream services, and then published to Kafka topics. Secondly, it processes CDC events from the control plane, which provides rule configuration metadata and formulas that need to be applied to each enriched data event.

To achieve near real-time results and ensure exceptional platform reliability, we’ve harnessed the power of the Apache Flink framework. Flink’s ability to process continuous and finite data streams, coupled with its rich set of state and window functions, empowers us to perform complex data processing and evaluation with low latency. Furthermore, Flink’s robust fault tolerance mechanisms, including incremental checkpoints and savepoints, support our platform’s resilience in the case of failures making it a highly reliable and responsive alerter engine.

The three stages — data ingestion, data evaluation, multi-datapoint aggregation and alert generation — form the core of the alerter engine pipeline, as shown in Figure 3.

Data Ingestion

During data ingestion, the alerting pipeline transforms incoming raw data points based on business logic. It also enriches them with control plane and baseline statistics information. This information instructs the pipeline on evaluating the data and aggregating it in later stages.

Data Evaluation

Engine then passes on the transformed and enriched data points to the next stage where each data point is evaluated against one or more user-configurable expressions. The expression evaluator will first parse the raw alert expression into an Abstract Syntax Tree (AST), which then will be transformed into an evaluable predicate chain, as shown inFigure 4.

Finally, the predicate chain is evaluated against a set of key-value pairs that represent all of the measurements and their corresponding values collected by the agent. The evaluator performs a name-based lookup to get the current value and executes the chain against it. The whole evaluation process will determine if the data point is eligible for further aggregation and also whether it’s a potential candidate for firing an alert or not.

For instance, an expression like ((responseTime > 500 ms) && (waitTime >= 100 ms)) would evaluate a data point with [responseTime: 0.75, waitTime: 0.55] as ‘true’ making it a candidate for triggering an alert.

Data Aggregation and Alert Generation

Regardless of evaluation results, the evaluator sends data points for multi-point aggregation. Here, Apache Flink’s windowing capabilities (TumblingWindows, SessionWindows, GlobalWindows) group data by event time. Alerts are triggered or cleared when the minimum required sensors (as specified in the alert configuration) meet the alert condition.

At this point, we can already answer practical questions like “Does this particular agent at this particular instant violate this exact expression?” We can also answer more complicated questions, such as, “Do at least 80% of total agents of this test satisfy this expression at least 2 intervals of 5 minutes?”

Alerter Data Plane

To handle high write and read volumes, the Alerter Data Plane leverages NoSQL datastores like DynamoDB and Elasticsearch. NoSQL databases excel at scaling to meet these demands. Figure 5 shows thatthe data plane ensures data integrity by applying business logic to events before storing them in DynamoDB and Elasticsearch. These events are then streamed for consumption by downstream services.

Additionally, the data plane provides APIs for internal services to query alerts with specific parameters. The frontend service utilizes these APIs to populate the Alert list view. This view enables users to correlate network metrics and identify the root cause of alerts, allowing them to take appropriate action.

Alerts Platform As a Service

The alerting platform serves a dual purpose, catering to both our thousands of end customers, including many Fortune 500 companies, and our internal product engineering teams. This platform excels in its data agnosticism, effortlessly ingesting and processing data from diverse sources independent of the data format. This flexibility fosters seamless integration for product engineering teams by extending interfaces at any layer within the platform to match their specific needs perfectly. This adaptability has accelerated innovation, leading to the launch of alerts for over six ThousandEyes products in just the past three years. Moreover, the platform’s configurable pipelines are a game-changer. They facilitated the implementation of advanced features like Dynamic Baseline and Anomaly Detection, significantly boosting the platform’s overall alerting capabilities.

Conclusion

ThousandEyes Alerting platform empowers users to stay ahead of network issues with actionable and real-time alerts. Our scalable platform processes data at Internet scale, helping to ensure critical alerts reach the users instantly. To further elevate your digital experience assurance, we have recently introduced Anomaly Detection and Real-Time Alert Suppression Windows features. Stay tuned for part two of this blog, where we’ll delve deeper into real-time Anomaly Detection.

But that’s not all! We are constantly innovating, and exciting new features are on the horizon for FY 2024. Are you passionate about building world-class network observability solutions? Join our team and be a part of this journey! Please consider exploring our open engineering roles today.

Want to be a part of our team? ThousandEyes is hiring! Please see our Careers page for open roles.