Completeness, Speed, and Cost — Three Knobs Controlling Your Data Analytics Strategy
How do speed, accuracy, and cost-effectiveness influence the choice between batch and streaming analytics?
We use data analytics to make better and improved business decisions. Data analytics systems can be broadly categorized into batch and streaming (real-time) analytics systems based on their event time and processing time semantics.
Completeness, speed, and cost are crucial for selecting batch or streaming analytics systems to implement a specific analytics use case. This article explores batch and streaming systems in light of the above three factors and helps you pick the right technology for the job.
Businesses keep producing data as long as they remain in business. Website visits, e-commerce orders, and ride-hailing requests are a few example data points intercepted and captured by operational systems in the business.
We can see facts about past transactions if we zoom into each data point collected. For example, the following purchase record captures three facts: the user, item, and time.
A user with ID 1234 purchased item 567 on 2022/06/12 at 12:23:212
We can use events to represent these facts. An event has event time, the time at which the event occurred, and a payload describing additional facts about the event.
Data analytics — making sense of events
A business can make better and more informed business decisions by analyzing the events generated in the business. That is often done by a data analytics system, which collects, stores, and processes business events to make them ready for decision-making.
Event time vs. processing time
An event has two attributes related to specific points in time: event time and processing time. Event time was when the event occurred, whereas the processing time was when the analytics system processed it.
We can categorize data analytics technologies based on the variance of event time and processing time.
- Batch or historical analytics
- Streaming or real-time analytics
Batch analytics systems
Batch analytics systems collect events from operational systems in large batches, transform them, and load them into storage suitable for faster querying. These operations are done using periodically running ETL/ELT pipelines.
Although the business continuously generates events, batch analytics forces us to draw artificial boundaries across events based on the event time. So we can see a significant lag between event time and processing time in batch analytics systems, making it impossible to analyze events as they occur.
For example, such a system collects orders from an e-commerce website and writes them to a Parquet file. A periodic batch process reads that file, analyses it, and finally writes the results into a data warehouse, allowing reports and BI dashboards to be generated.
Assuming the batch process runs hourly, the dashboard user must wait until 9.10 AM to see the sales summary between 8 and 9 AM.
Time to insights = Total (event ingestion time + event processing time + query time)
Thus, latency is the biggest drawback of batch analytics systems.
Streaming analytics systems
Streaming analytics derives insights from business events as they happen, enabling faster decision-making and quick reactions. That is also called real-time analytics.
A streaming analytics system generates insights by collecting and processing a continuously flowing “stream of events.” Hence the name streaming analytics.
Unlike its batch counterpart, there’s no batching involved, allowing end-users to access insights soon after the events occur. So there’s no latency between event generation and processing, enabling us to take immediate actions afterward. For example, an anomaly detection system detects an anomaly in a stream of sensor reading events and triggers an alert to take corrective measures.
Completeness, speed, and cost — know the three knobs
There’s no one-size-fits-all solution in data analytics. We can’t claim that streaming is always better and more modern than batch analytics or vice versa. Both technologies have their strengths and weaknesses. So we must be mindful about picking either of the technologies depending on the use case.
We can consider the following three criteria to evaluate analytics use cases before deciding to use either batch or streaming analytics.
- Completeness of data
I will consider ACME Corp, a fictitious e-commerce store use case, for a more straightforward explanation of that criteria. Suppose ACME’s analytics system should provide the following metrics.
- Detect and block abnormal user login attempts.
- Provide a daily spending report for the marketing department on the ads displayed on the website.
- Provide a real-time dashboard for e-commerce ad performance.
- Train an ML model that makes product recommendations.
- Perform ad-hoc analyses on past sales data to detect trends.
Speed — when the whole business depends on the speed of making decisions.
Speed matters in a business where we need to analyze events as they happen and also react to them with minimal latency.
When we prioritize the speed, the events going through the analytics system should have identical event time and processing time. In other words, it should analyze events as they occur and react to them ASAP. That leaves us with streaming analytics as the only viable option as batch analytics systems have an inevitable latency of ETL.
In our ACME example, we should use a stream processing engine to process and detect anomalies in user logins. Detection alone is not enough, as we need to take countermeasures to block such login attempts. For example, a login attempt must be prevented if a user tries to log in with IP addresses in different geographic locations within 24 hours.
The streaming analytics system must be capable of triggering reactions almost immediately as events happen. Oftentimes, human decision-making is impossible at this level, so we program and deploy application logic on stateful stream processors (such as Flink) and event-driven Microservices frameworks (Kafka Streams, Akka, Dapr, etc.).
Completeness of source data — when the accuracy matters
Accuracy is also a demanding factor when choosing batch or streaming analytics for a given use case. The accuracy of analytics degrades when we don’t have the entire data set when making decisions, causing inconsistencies and financial losses for applications.
Streaming analytics process events as they arrive and don’t have access to the entire set while processing data. Some events may fail to arrive on time for various reasons, including network outages, users going offline for specific periods, etc. That causes a skew in the event time vs. processing time curve.
For example, a user playing a mobile game during a flight may sync the game score only after the flight lands. The real-time application processing user scores now has to deal with late-arriving scores for this user. Stream processors like Apache Flink employ watermarking to wait for delayed events for a specific threshold. Events arriving late than the watermark are dropped or handover to the user application to be dealt with. High watermarks can cause latencies, whereas low watermarks can cause inaccuracies. So stream processors provide best-effort accuracy most of the time.
Conversely, batch analytics systems can “wait” until the entire data set arrives. They collect data in batches, enabling late-arriving events to be backfilled easily compared to streaming systems. Also, including schema evolutions in upstream systems in the results is straightforward compared to their streaming counterparts.
That leads us to pick batch analytics systems when we need accuracy over speed.
In the ACME example above, we can use batch analytics to produce a daily report on marketing spending on the ads displayed on the website. Additionally, we can use a stream processor to provide an estimate of spending at any given point in time.
Cost of generating insights
Cost is also crucial for picking an analytics strategy as use cases are prioritized based on the organizational budget.
Stream processing systems must always be available; the processing infrastructure must run 24x365, incurring additional operational costs. Also, their cost is unpredictable as the input data set is infinite, causing the system to scale on-demand to accommodate more inflow of events while maintaining a consistent latency.
Conversely, operating batch analytics systems is more cost-effective. Their compute and storage cost is usually predictable as the input data set is finite, enabling us to utilize on-demand cloud infrastructure to run periodic workloads and decommission them when not in use.
Therefore, batch analytics systems are the preferred choice when your analytics strategy is cost-effective, and speed is not the need of the hour.
Considering the ACME use case above, we can use a batch analytics pipeline to train its product recommendation ML model. Ad-hoc data exploration can also be another use case. A data scientist might use a notebook environment like Databricks to perform an exploratory data analysis on past sales data to figure out trends.
So far, we have explored batch and streaming capabilities from speed, accuracy, and cost perspectives. Picking either technology should be driven case-by-case, based on those three “knobs.”
Streaming analytics is ideal when you want to analyze events as they happen and respond to them quickly. Therefore anomaly detection, real-time monitoring, and event-driven Microservices become de facto use cases for stream processing. Real-time dashboards can also be a good use case when you can deal with metrics based on estimates.
Batch analytics is desired when the accuracy of results and cost is considered over speed. Use cases that demand high accuracy, such as BI and reporting, billing, etc., use batch analytics systems to take benefits of complete data set. Also, ad-hoc data analysis and training ML models are use cases that concern the cost, which typically uses data lake architectures to run ad-hoc analytics workloads.
Although promising concepts like data lakehouses claim that they provide these three knobs; speed, accuracy, and cost-effectiveness in one box, we should patiently wait and see their tradeoffs. I believe there’s no silver bullet for containing these knobs in a single solution. Organizations should maintain an ecosystem of diverse analytics systems to accomplish different analytics needs.
I envision something similar to the following, which can’t be framed precisely as either Lambda or Kappa architecture. But it will deliver most of the analytics needs today.