System Design Interview: Metrics Monitoring and Alerting System
Check out ByteByteGo’s popular System Design Interview Course
In today’s cloud-native and highly distributed computing environments, observability is essential. Without it, even a small blind spot can lead to cascading failures, costly downtime, and frustrated users. This article presents the architecture of a scalable metrics monitoring and alerting system designed to deliver real-time insight into the health and behavior of a large-scale infrastructure. A well-architected system ensures operational continuity by detecting anomalies early, reducing downtime, and improving response times.
While industry-standard platforms like Datadog, InfluxDB, Nagios, Prometheus, Munin, Grafana, and Graphite are widely adopted, this discussion focuses on designing an internal observability solution optimized for large-scale organizations, one that prioritizes flexibility, reliability, and cost efficiency.
Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions.
Requirement Gathering
We will design a system meant for internal use by a large-scale company. The goal is to provide engineers and SREs with real-time operational visibility and reliable alerting on infrastructure health.
The following functional requirements define the core functionality of the system to meet operational monitoring goals:
- Scale & Environment: The system must handle a massive distributed setup to the tune of 10M metrics.
- Types of Metrics: We will collect system-level metrics like CPU, memory, disk usage, network I/O, request throughput, and queue/message counts.
- Data Retention: We will keep raw data for 7 days and then downsample it to 1-minute and 1-hour aggregates as time progresses.
- Alerting: Alerts are delivered through various channels like email, PagerDuty, and HTTP webhooks for fast incident response.
The system does not handle:
- Log aggregation & analysis (use ELK stack instead)
- Distributed tracing across microservices (tools like Dapper or Zipkin handle this).
Non functional requirements of our system include:
- Scalability: Must support rapid growth in metrics volume and infrastructure size
- Low Latency: Fast queries for dashboards and alerting systems
- High Reliability: No missed alerts
- Flexibility: Must be modular and adaptable to integrate with evolving infrastructure and tools
High-Level System Design
Once the requirements and scope are well-defined, the next step is to translate these expectations into a high-level system architecture. The metrics monitoring and alerting system consists of six key components.
- Metrics Sources are the origin points of metrics such as application servers, databases, message queues, or any infrastructure component that emits performance data like CPU usage or request counts.
- Metrics Collector ingests raw metrics data from various sources and forwards it to the storage layer.
- Time-Series Database is the core of the system. It stores metrics as time-series data, indexed by metric name and labels (tags). Popular time-series databases include InfluxDB and Prometheus.
- Query Service exposes a query-friendly interface to other components like the alerting engine or visualization dashboards.
- Alerting System continuously monitors incoming data using scheduled queries. It receives alerts when thresholds or anomaly patterns are met. Alerts are pushed to various destinations, such as Email, PagerDuty, Webhooks, and SMS or phone calls.
- Visualization Layer presents real-time and historical metrics data in graphs, charts, or heatmaps. This helps engineers identify trends, patterns, or issues at a glance.
Now let’s discuss our design in detail.
Step 1: Collecting Metrics
In the first part of the system data flows from Metrics Sources to Metrics Collector. Metrics data can be collected using either a pull or push model. Both have trade-offs, and large-scale systems often support both for different use cases.
Pull Model
In a pull model, a pool of metrics collectors periodically polls services for metrics data over a predefined HTTP endpoint (such as /metrics). This model keeps control in the hands of the monitoring system as it decides when and what to pull.
The challenge is ensuring collectors know where to pull from. In static environments, this could be a list of IPs, but in dynamic infrastructure (such as auto-scaling environments), this can become unmanageable. In such cases services register themselves using tools like etcd or Zookeeper, and collectors are automatically informed of any changes in the server pool.
Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions.
For scalability, multiple collectors are used. These must coordinate to avoid duplication. A common solution is to apply consistent hashing where each collector handles a range of the hash ring and is assigned specific servers. This ensures even distribution and prevents overlap.
Push Model
In the push model, each metrics source runs a collection agent that gathers metrics and pushes them to the collectors at fixed intervals. Push-based collection can better handle frequent changes in server topology. However, if the collector is overwhelmed or unreachable, data loss is a risk. To mitigate this, agents may buffer data locally and retry.
For reliability, metric collectors should be deployed behind a load balancer in an auto-scaling cluster. This ensures the system can dynamically handle changes in traffic volume without dropping metrics or slowing down.
Pull vs Push: Pros and Cons
✅ Pull Benefits
- Easier debugging: hit
/metricsand instantly see what a service exposes - Doubles as a health check: if scraping fails, the service is likely down
- Collector controls when and what to fetch, reducing untrusted data
- Works well for long-running, stable services
❌ Pull Limitations
- Struggles with short-lived jobs as they may terminate before being scraped
- Requires service discovery in dynamic environments (autoscaling, ephemeral nodes)
- Harder across NATs or firewalls since collectors must reach every endpoint
✅ Push Benefits
- Ideal for short-lived or ephemeral workloads: metrics are sent before a job disappears
- Simpler networking: only outbound connections are needed
- Can deliver metrics with lower latency, especially in high-throughput setups
❌ Push Limitations
- Harder to debug: missing data could mean a service failure or a network hiccup
- Collector overload or downtime risks data loss unless agents buffer and retry
- Needs authentication/whitelisting to avoid accepting untrusted metrics
So… Which one should you use?
In real-world systems, it’s rarely an either/or choice. Pull works best for stable, long-running services where you need strong control and easy health checks. Push is ideal for dynamic environments with short-lived jobs, serverless functions, or autoscaling clusters. Many production setups combine both approaches to balance reliability and flexibility.
Step 2: Storing Metrics
How the time series data is stored is central to a metrics monitoring system and selecting the right database is key to achieving performance and scalability.
Relational databases (e.g., MySQL) and NoSQL stores (e.g., Cassandra, Bigtable) can hold time-series data but struggle at scale. Relational DBs face high write bottlenecks and require complex SQL for time-series queries, while NoSQL needs careful schema tuning to stay efficient.
In contrast, purpose-built time-series databases (TSDBs) are optimized for this workload. They offer high-throughput ingestion, native label/tag indexing, efficient compression and rollups, and custom query languages tailored for time-series analysis.
Popular TSDBs include InfluxDB, Prometheus, AWS Timestream, OpenTSDB, and MetricsDB. These databases are designed to handle billions of points efficiently, without the operational overhead of tuning general-purpose systems.
Land a higher salary with Grokking Comp Negotiation in Tech.
Time-series Data Model
The time-series data model is a format optimized for efficiently storing and querying metric values over time. Each time series is uniquely defined by three components:
- Metric Name: A unique identifier
- Labels: Key-value pairs that identify the source or context which allow the system to slice and group data
- Values with Timestamps: The actual data points collected over time recorded as sequences of <timestamp, value> pairs
This structure enables both fine-grained queries (e.g., retrieve the exact request count for a specific host at a specific time) and aggregated views (e.g., average request count across a cluster over 5 minutes). Sample time series data for the http.request.count metric is shown:
Scaling with Kafka
As your system scales, the volume of metrics grows exponentially with collectors handling continuous data streams. To keep up, they’re often deployed in auto-scaling clusters that adjust elastically with load by spinning up more instances during traffic spikes and scaling down during idle times. But even with dynamic scaling, a critical question remains: what happens when the time-series database slows down or becomes unavailable?
One of the most effective ways is to insert a queueing system like Kafka between the metrics collectors and the database. Instead of writing directly to the database, metrics are first published to Kafka, where they are temporarily stored and then consumed by downstream services that push them to the database.
Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions.
Kafka’s built-in partitioning model offers a natural way to distribute workload across consumers. It allows partitioning by metric name so that all updates to the same metric land in the same partition, making aggregation and compaction easier.
But what if you don’t want to use Kafka?
Running a production-grade Kafka cluster is complex, and not every team is ready for that overhead. Some rely on highly available, write-optimized databases like Facebook’s Gorilla, an in-memory time-series engine built to remain available even during network failures. But such solutions are usually reserved for massive, custom infrastructures. There’s no one-size-fits-all. Kafka is reliable and widely adopted, but it comes with operational cost. For simpler setups, you can skip Kafka and use a more direct pipeline between collectors and storage but you need to be aware that if the database slows down or fails, you risk losing data.
Summarizing Metrics
After scaling ingestion with Kafka, the next challenge is deciding where to reduce metric volume without losing critical insights. There are three natural aggregation points in the pipeline: at the source, during ingestion, or at query time. Each comes with tradeoffs in complexity, precision, and performance.
- Client-side aggregation happens before metrics even leave the source, reducing what the collector needs to handle.
- Ingestion-time aggregation happens in the pipeline after the collector receives data but before it’s stored in the TSDB. It is often implemented with Kafka Streams or Flink sitting between collectors and storage.
- Query-time aggregation is handled by the query service layer, after the TSDB stores the raw data.
Choosing where to aggregate is a tradeoff between precision, storage cost, and query performance.
Step 3: Optimizing Metrics Storage
The time-series database (TSDB) is the core of the metrics storage layer but not all TSDBs are equal. Some excel at high write throughput, others at fast queries or long-term data retention.
So how do you choose?
Research shows that most workloads focus on recent, high-velocity data. This suggests prioritizing short-term performance with fast caching, compression, and indexing while keeping long-term queries reasonably efficient. If you optimize for what’s most frequently accessed, you’ll meet the majority of user needs without overengineering the rest.
Space Optimization
Metrics data grows rapidly in any large system. Without careful planning, storage can become expensive over time. To manage this growth, systems use techniques like compression, downsampling, and tiered retention to balance performance, cost, and long-term accessibility.
Encoding and Compression
Instead of storing full timestamps (e.g., 1723456000, 1723456015, 1723456030), systems store a base value plus small deltas (15, 15, 15). This encoding cuts storage needs dramatically, improving both cost efficiency and I/O performance.
Downsampling
Not all data needs high resolution forever. Downsampling is the process of reducing the resolution of old metrics to save space. For example, keep 1-second metrics for 7 days, 1-minute metrics for 30 days, and 1-hour metrics for a year. Imagine a disk_io metric collected every 10 seconds: values like 10, 16, 20, 30 can be downsampled into 30-second intervals, storing only the max (e.g., 20 for the first half-minute, 30 for the next). This reduces storage and speeds up long-range queries.
This not only reduces storage costs but also accelerates long-range queries, since the system has fewer data points to scan.
The key is to define retention and rollup policies based on use cases. Business-critical metrics might stay untouched, while less important data can be aggregated or discarded over time.
Cold Storage
Eventually, some data becomes so infrequently accessed that it’s not worth keeping on fast, expensive disks. Cold storage systems like AWS S3, or Google Cloud Archive offer cheap long-term retention.
Don’t waste hours on Leetcode. Learn patterns with the course Grokking the Coding Interview: Patterns for Coding Questions.
Step 4: Querying Metrics
Once metrics are in your time-series database, the next challenge is how to extract insights from them quickly and reliably.
Query Service: Do You Really Need One?
A query service sits between your time-series database and tools like dashboards or alerting engines. It translates queries, adds caching, load balancing, and custom logic like access control or logging. This layer offers flexibility (easier to swap databases or visualization tools) and improves performance by caching repeated queries.
However, many modern tools like Grafana, Datadog, or New Relic integrate directly with popular TSDBs, making a separate query service unnecessary. You only need one if you’re at extreme scale, using multiple backends, or require custom logic. Otherwise, native integrations are simpler and more efficient.
Time-Series Query Language
Time-series data needs specialized operations like time-window aggregation, trend detection, and temporal analysis. Expressing these in SQL can be overly complex and hard to maintain. Purpose-built time-series databases offer concise, optimized query languages that simplify these tasks and improve developer productivity. The right query interface reduces complexity, speeds up implementation, and scales better with your system’s growth.
Step 5: Monitoring Metrics for Alerts
It’s not enough to just collect and store metrics, we also need to act when things go wrong. An alerting system makes sure that anomalies are caught and escalated before they turn into outages.
Alert rules, often defined in YAML, specify conditions like trigger an alert if an instance is unreachable for 5 minutes. These are loaded into cache servers for fast access by the alert manager, which periodically queries the metrics backend. If thresholds are breached, alerts are created.
The alert manager performs multiple tasks:
- Groups similar alerts in a short window to prevent spamming engineers or downstream systems.
- Restricts who can modify alert rules or silences to reduce errors and improve security.
- Ensures notifications are delivered at least once, retrying if downstream systems are unavailable.
- Maintains alert states (inactive, pending, firing, resolved) in a durable key-value store.
Validated alerts are sent to Kafka, decoupling generation from delivery. Alert consumers pull from Kafka and dispatch notifications via email, SMS, PagerDuty, or webhooks.
In practice, off-the-shelf tools like Prometheus Alertmanager, Grafana OnCall, or PagerDuty are used for easier integration and reliability.
If you are interviewing, consider buying our number#1 course for Java Multithreading Interviews.
Step 6: Visualizing Metrics
Once your metrics are being collected, stored, and processed, the next challenge is making sense of them. Raw numbers alone don’t drive decisions. That’s where visualization comes in.
The visualization system sits on top of the data layer and transforms streams of time-series metrics into meaningful charts, graphs, and dashboards. These visual tools allow engineers to track system health, observe trends over time, and spot anomalies in real-time.
Get insights on designing machine learning systems with Grokking Machine Learning Design.
Dashboards typically show metrics like request rates, CPU/memory usage, page load times, network traffic, and login success rates. Alerts are also surfaced to aid quick incident diagnosis.
Building a visualization system from scratch is rarely worth the effort. Tools like Grafana are the industry standard, offering rich integrations, plugins, templated dashboards, and alert overlays with minimal setup.
Final Thoughts
Designing a scalable metrics monitoring and alerting system means balancing trade-offs in data collection, processing, storage, and visualization. By combining pull/push models, Kafka for durability, a suitable time-series database, and optimizations like downsampling and caching, you get an efficient, adaptable setup. Continuously reassess build vs. buy choices and refine the design with real-world feedback. Ultimately, the best system meets reliability goals without sacrificing maintainability.
Your Comprehensive Interview Kit for Big Tech Jobs from DesignGurus
1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.
2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.
3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.
4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.
5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions

