What does “streaming” mean as it relates to monitoring? Why is it better?

Published in

The Startup

11 min readJun 5, 2019

Streaming technology is revolutionizing the field of data processing, and is being adopted by monitoring vendors too as a result. In my role as CTO@SignalFx (which is one of those vendors) I get asked about streaming a lot, especially as it relates to the observability domain. Below is a framing of how I look at both the challenges and opportunities with this, and how you can evaluate streaming monitoring technology yourself.

Part 1: The Backdrop

We live in fast times. Teslas accelerate from 0–60 mph in under 3 seconds, news travels around the world instantly through social media, and online flash sales sell out within minutes. Digital applications and services that power many of these trends have become highly dynamic and fast moving as a result. Monitoring these digital services effectively is key to “Move fast, don’t break things!”, and monitoring systems need to keep up with the pace of modern cloud native environments.

Part 2: The Hype

It’s no wonder then that phrases like “real time” and “streaming analytics” are reaching peak hype in the monitoring landscape. Commercial vendors and OSS solutions loudly proclaim these capabilities. And why not? Technologies like Apache Spark Streaming have truly transformed big data analytics, allowing us to process far more data volumes far more quickly — making concepts like streaming data processing familiar to most.

Part 2(a): Streaming for monitoring datasets faces unique challenges

“Processing more data, faster” never hurt anyone and streaming technology would obviously benefit the monitoring space. The exponentially growing data volumes and complexity of modern environments are rendering traditional batch approaches insufficient. However, streaming technology when applied to monitoring datasets faces unique challenges. Telemetry data tends to be smaller (Gigabytes to Terabytes instead of Petabytes and Exabytes). Payloads are smaller and more numerous (many individual events instead of fewer, larger data blocks). Timing and proper ordering of data are also far more critical in monitoring because these systems are used to maintain high SLAs for digital services. Clearly, monitoring analytics is not exactly the same as traditional big data processing.

Part 2(b) — The Question

Given that streaming monitoring can be greatly beneficial to businesses yet is harder to achieve, the question arises — “What does streaming mean in the context of monitoring? Why is it better?”. What should we expect from a streaming monitoring system? What qualities should we design for if we’re building such a system? As an adopter or buyer of this technology, how can you cut through the marketing hype and objectively evaluate “streaming” monitoring systems?

This article is an attempt to answer these questions.

Part 3: Why is streaming better for monitoring? What qualities must it have?

When someone mentions streaming monitoring, the audience subconsciously associates multiple qualities with it. If we break things down objectively, there are 3 qualities that are necessary to truly qualify as an effective streaming monitoring system. Lacking even one of these qualities will make it lose the benefits that streaming provides. That’s a strong statement that needs justification, so let me dig into what these qualities are, why they are beneficial and why we need all of them.

Part 3(a): Scale: Streaming systems should support dramatically more concurrent queries and users compared to batch systems

This is probably the most obvious quality of a streaming system, and one that everyone will agree with intuitively. For monitoring use cases, efficiency is good, because it equates to performance. Have you counted the total number of containers you have lately? How many zeroes do you need to express the number of hosts in your environment? How about the number of humans and bots that are using your monitoring system concurrently? Answer these questions and you’ll quickly realize why you need a scalable monitoring system (or will need one soon).

Streaming solves the performance problem by breaking large queries (e.g. average latency of a 1,000 container service over last week) into small “incremental” chunks that are processed in sequence. This not only makes very large queries tractable (because the system does not have to load *all* the data into memory at the same time), but also allows dramatically more queries to be run concurrently (because each query is utilizing far less resources at any given time).

Let me explain with some numbers. In the example above, Let us assume your 1,000 containers are reporting business critical metrics like latency every second. In batch mode, you’ll be processing 1,000 * 86,400 * 7 = 600 Million datapoints each time you run the 1-week query! To make matters worse, this query is probably backing one just chart out of 10–20 similar charts in a single dashboard out of hundreds of dashboards being viewed by a single user out of many users in your organization! In streaming mode however, you’ll process 1,000 datapoints each second (one new data point reported by each container every second) which is *orders* of magnitude more efficient. In other words, 600,000 concurrent streaming queries will process data at the same rate as the batch query running just once per second. Furthermore, this ratio 600,000::1 for a 1 week chart remains the same whether your containers report data every second or every hour!

Clearly, streaming is a superior approach. So how do you evaluate this capability? If you’re doing-it-yourself and building a streaming system, make sure it can update its results in an incremental fashion with the latest set of data to arrive. This will require you to maintain some state with each streaming query of course. If you’re evaluating a 3rd party solution, investigate how the product works, whether it can support large number of concurrent queries. The key is to smoke out the fakers — ones that try to run batch mode queries just more frequently to ‘simulate’ the feeling of streaming — because that approach will definitely not scale and will lead to unpleasant surprises down the line.

Part 3(b): Timeliness: Streaming monitoring systems should produce results in near real time

Timeliness is the second important requirement in modern monitoring along with scale. Everybody’s a service provider for someone — whether it is a SaaS serving their customers, or an internal micro-service providing a function to the other micro-services within a company. Service providers must maintain SLAs. What’s more, in today’s social media age of instant communication, bad news spreads like wildfire, and poor user experience with your service can get virally amplified before you can spell o.u.t.a.g.e.. So yes, reacting to problems as close to real time as possible is important.

Since streaming works in an incremental fashion and looks only at data that arrived in the very recent past, it has the potential to produce results very quickly, very close to real time. In fact, we often automatically assume that “if it’s streaming it must be real time” too. Worse still, this tendency of ours causes this term to be exploited to deceptively market non-real-time services. So are we justified in assuming streaming services must be timely? Unfortunately, the answer is no. Let me explain …

A streaming system that produces results a few hours late is obviously not a real-time system. For example, one can run streaming Spark pipelines every night to perform routine data processing tasks like data rollup and aggregation. While it is true that those pipelines work in ‘streaming’ fashion, they won’t magically make the overall service real-time by any stretch of the imagination. In fact, as far as an end user is concerned, it would make no difference whatsoever whether these offline data processing tasks were performed using a batch or streaming technology. So, while evaluating streaming monitoring, one must understand how and where streaming techniques are employed, and whether that results in the service becoming timely.

What does timeliness mean? How quickly should a ‘timely’ alert be fired?

A more important question is — what is the timeliness standard that must be met — i.e. what does real time mean? If you talk to a high frequency trader, real-time could mean within micro-seconds. If you talk to monitoring vendors, they might say ‘seconds’. So what’s the right answer? The answer depends on the end goal we are trying to achieve. For high frequency traders, the end goal is to execute trades before the other traders can do the same. For monitoring, the end goal is to maintain service uptime and SLAs. Unfortunately, alerts are only the first step in an incident’s lifetime. For example, if you get alerted about a slowing API endpoint, most of the time will be spent triaging and isolating the offending component (e.g. microservice X), debugging the root cause (e.g. CPU starvation), coming up with a remediation plan (e.g. add more capacity), executing the remediation plan (e.g. deploy new containers for microservice X), and finally verifying the fix (i.e. ensure that the API slowness alert got cleared). If you are aiming for 4 9’s of uptime which give you 4 mins/month across all incidents end-to-end, alerting needs to happen within single digit seconds to really give automation (alert bots) or orchestrators (e.g. Kubernetes) time to react and take remedial actions (e.g. auto-scaling). If you are aiming for 3 9’s (40 mins/month across all incidents end-to-end), then alerting must happen within a few 10’s of seconds.

Part 3(c): Accuracy: Streaming monitoring systems must address data-ordering issues in order to prevent false alarms and results

False alarms are the bane of operations. Operators are drowning in alert volume already, and false alarms only exacerbate an already bad situation. They waste valuable time and reduce the quality of life of engineers through unnecessary interruptions. Far more worrisome is that when operators lose trust in their monitoring system, they will start ignoring valid alarms too, leading to further service degradations.

Data accuracy is by far the hardest problem with streaming monitoring. It is also the one challenge that is least obvious or understood. In the real world, the vast majority of streaming data analytics pipelines ignore timing to a large extent. Similarly, batch monitoring systems (ones that rely on repeatedly querying a database) solve the accuracy problem at the expense of delay. They will recommend that critical queries like alerts be run a few minutes late, allowing data to “settle” and all relevant data to arrive before evaluating a condition. Offline data pipelines (e.g. nightly jobs) whether streaming or batch don’t need to worry about timing because they process data minutes or hours after it arrives. However, streaming monitoring has a higher bar. Since it needs to produce accurate results in a timely fashion, it must worry about the logical timestamp of data (i.e. when it was generated) as opposed to when it was received by the monitoring system. Not doing so will produce inaccurate results and false alarms — in other words a monitoring system that is unfit to monitor anything.

Let me explain with an example. Let us say we wanted to monitor the number of API errors across the previously discussed service running on 1000 containers with each container reporting an error count every second. It is pretty much guaranteed that those 1000 containers won’t report data all at the same time. There are numerous reasons for data to get delayed — some that may be due to the conditions on the server (e.g. slowness due to high CPU), and some that are external — e.g. network latency. Making matters worse, the number of containers can keep changing — containers tend to die suddenly for no apparent reason, or Kubernetes might spin up a few more when load is high. Getting accurate results requires that we aggregate data points and events that share the same logical timestamp regardless of when they reached our monitoring system. E.g. we don’t want to compute total # of errors across data measured by host1@12:00, host2@12:01 and host3@11:58 just because those measurements happened to reach our monitoring system at the same time. How long do we wait to make sure we got all the measurements made @12:00? Therein lies the true challenge with streaming monitoring, the hardest nut to crack. When do we compute a result so that it is accurate yet timely?

As we already discussed, waiting an arbitrary number of minutes or hours is not a solution because it throws timeliness out of the window. Another approach is to generate partial results and to update them as and when new data is seen — a strategy sometimes used by AWS Cloudwatch. For example, I might produce a total API error count quickly from 800 servers which already reported, and then subsequently update the result 2–3 times in future as the other 200 send their data in. The problem with this approach is that the user of these ‘evolving’ results has no idea when the result will become final. When is it a partial result and when is it not? In the end, this is no better than the previous strategy, because our best bet is to wait for data to settle — which again violates the timeliness requirement. The only effective solution is for the streaming monitoring system to be acutely aware of the timing characteristics of every data stream individually, so that it can produce one answer with high confidence and low latency.

This sounds hard and it is, but it is solvable and there are solutions that have solved this problem. As a builder of streaming monitoring, you will need to solve the same problem yourself. If you are evaluating a streaming monitoring system, you must dig into the question of accuracy and understand how the system in question guarantees accurate results and prevents false alarms.

Part 4: The Conclusion

While streaming analytics technologies are gaining traction throughout the industry, streaming analytics for the purposes of monitoring has to deal with some additional challenges. When it comes to monitoring, engineers associate the word streaming with qualities that may or may not be present in the system in question. If you are evaluating (or building) such a system, you must be mindful and ensure that these requirements are met.

Part 4(a): A framework for evaluating streaming monitoring systems

To summarize, here is a quick cheat-sheet to help guide you on this journey.

I hope this makes the case for the three qualities of streaming monitoring and provides justification for why all of them are necessary. What do you think? What have you seen? Any interesting stories to share? Would love to hear your feedback to continue the discussion.