Architecting a Kafka-centric Retail Analytics Platform — Part 3
What are the different real-time processing methods available to uncover insights from the data ingested into Kafka?
This is the third installment of the article series “Architecting a Kafka-centric Retail Analytics Platform.”
Previous posts in the series discussed:
- What retail metrics are crucial for business decision-making?
- What data sources should we tap in?
- How can we leverage the Kafka ecosystem to ingest retail data from diverse data sources?
This post discusses what to do with the ingested data — how they can be processed to surface valuable insights in real-time and act upon them before their business value become irrelevant.
In case if you missed the first two posts in the series, you can find them here.
Data have a short shelf life of actionability.
So far, we have discussed collecting data from different sources and storing them in Kafka until they are processed.
In this post, we discuss real-time data processing to extract retail business metrics. In other words, data is being processed as they are ingested into Kafka, in contrast to traditional batch-oriented ETL analytics systems.
Why? Because every business event has a shelf life. The Sooner you extract the insights, the more you can gain an edge in the competition.
Analytics engines are Kafka consumers
Four types of analytical consumers can process data in Kafka as they arrive.
- Event-driven Microservices
- Stateful stream processors
- Real-time OLAP databases
- Streaming data pipelines (data lake/warehouse analysis)
Although they have differences in their architecture, they all are Kafka consumers who adhere to Kafka consumer protocol. Deep down, each consumer implements a Kafka consumer, giving them advantages over scalability, load balancing, and failover features built into Kafka consumer protocol. The Kafka partition model allows many analytics engines to consume from the same topic in parallel without stepping on each other. Also, the offset commits and replication add more reliability into the mix.
The rest of the post describes these analytical styles in deep along with their architecture, retail use case, implementation tools, and potential consumers who would benefit from it.
Event-driven Microservices
Event-driven Microservices react to business events coming from Kafka. The reaction must be quick enough before the event gets stale and irrelevant. Typically, reactions latencies should be within the millisecond range.
For example, consider a Microservice reacting to a transaction event to determine whether it’s fraudulent or not.
Architecture
Based on the use case, an event-driven Microservice can be stateless or deal with state stores such as databases. They leverage native streams processing constructs such as window operations, temporal joins, filters, and transformations to determine specific patterns on the incoming stream of events.
The above figure illustrates two architectural styles. In (a), the Microservice acts as a pure function, receiving input, performing computation, and saving the output to another Kafka topic. An example would be analyzing payment events to determine anomalies and firing alerts to another Kafka topic.
In (b), the Microservice triggers one or more side effects. An example would be receiving an e-commerce order and then triggering a fulfillment workflow and inventory re-order.
Retail use cases
- Anomaly detection: Quickly detect anomalies in transactions.
- Real-time recommendations: When a customer browses an e-commerce store, product recommendations can be generated and presented based on his browsing patterns.
- Real-time pricing: Analyse orders and inventory streams to decide item prices based on the demand and supply.
- Streamlined inventory fulfillment: Rather than waiting for the next day, the supply chain can be triggered in real-time based on the stream of items sold so far.
Implementation choices
Event-driven microservices are nothing but Kafka consumers. There are many language-specific frameworks to build them. Some includes:
- Kafka Streams (Java)
- ksqlDB (SQL)
- Spring Cloud Stream (Java)
- Quarkus (Java)
- Faust (Python)
- Akka Streams (Scala)
- Microsoft Dapr (multi-language support)
The above applications don’t need a distributed runtime to work. They can be started with the Microservice runtime. That gives you more control over scalability.
But stream processing engines manage the scalability and reliability by themselves, and they operate as distributed systems. Most of the engines allow you to write the Microservices using SQL or provide language bindings.
Some examples include:
- Apache Storm
- Apache Flink
- Apache Spark Structured Streaming
- WSO2 Streaming Integrator (SQL-like Siddhi language)
- Apache Beam (Google Cloud Dataflow)
- AWS Kinesis Analytics
- Azure Stream Analytics
Consumers
Consumers of this type of analytics include customer-facing applications, monitoring and alerting systems, event-driven APIs, etc. They demand ultra-low latency in reaction time. Hence, there’s little or no human involvement.
Stateful stream processors
Unlike event-driven Microservices, stateful stream processors or streaming databases do not operate on reacting to events in mind. Instead, they perform stateful operations on incoming streams of events to maintain a durable and scalable local state. This state can be quickly accessed by downstream consumers, such as BI dashboards and Microservices.
For example, consider a stream processing application aggregating a stream of order events to maintain a running total of sales at any given time. Later, another application like a dashboard can update its UI by querying that running total.
Architecture
Stateful stream processors allow you to define incrementally updated materialized views with standard SQL. These views consist of queries that perform stateful aggregations on incoming events such as counts, summation, and lookup joins. Materialized views are persisted to the stream processor’s local storage and replicated across the cluster to achieve fault tolerance and scalability.
You can find more information on this architecture in my recent post:
Retail use cases
- Real-time BI dashboards: Aggregated results can be pushed out to a real-time dashboard like Kibana or Grafana. For example, a sales dashboard may display the metrics like orders received, processed, and shipped in real-time so that business stakeholders can make decisions based on that.
- Event-driven APIs and data-products: These products query aggregated states from the stream processor to update themselves in real-time.
- Real-time personalization
Implementation choices
Each of the following treats Kafka as a streaming data source, and they internally implement the Kafka consumer protocol.
- Apache Flink: Flink’s DataStream API, ProcessFunctions, and SQL support can be leveraged to build streaming analytics applications.
- ksqlDB: Tightly coupled with the Kafka ecosystem to build materialized views with SQL.
- Materialized: A great streaming database that allows you to build materialized views using SQL.
Consumers
- BI dashboards
- Stream processors who can perform lookup join with the materialized state.
- Data products that demand real-time processed data such as social timelines, news feeds, etc.
Real-time OLAP databases
These are particular types of databases that ingest events from Kafka topics to create a highly performant indexing structure that can answer complex OLAP queries within millisecond latencies.
For example, consider a database ingesting order events stream and enabling a user-facing merchant dashboard to query while data is still fresh.
Architecture
These databases don’t maintain materialized views. Well, some variants do, but that’s not their core feature. Instead, they produce queryable “segments” from ingested real-time data. Segments are laid out on the disk in columnar format and indexed with multiple strategies to enable low-latency querying.
Retail use cases
- User-facing analytics dashboards: These dashboards are open to the end-users — for example, the merchant dashboard of an e-commerce store should display real-time store performance information.
- Real-time personalization: A user should be presented with the next best alternative based on their browsing patterns.
- BI dashboards: Data analysts want to perform low-latency ad-hoc querying on large volumes of operational data while it is fresh.
- Fast root cause analysis (RCA): Fast, interactive analysis can be performed on the indexed data with fine-grained dimensions to figure out what went wrong with a system at a specific point in time.
Implementation choices
The following databases provide means to ingest streaming data from Kafka and make it queryable instantly.
- Apache Druid
- Apache Pinot
- Clickhouse
- Rockset
Consumers
User-facing dashboards, data products such as news feeds, social timelines, Data analysts with a need to perform exploratory data analysis on large volumes of data, APIs
Data lake/Data warehouse analysis
Events land in Kafka can’t stay there forever. They must be eventually moved out to cheap and durable storage for historical data analysis and regulatory compliance. Moving can be done in parallel while events are being processed in real-time.
For example, we can move all the processed orders to a data lake such as an S3 bucket or HDFS. Later, a data analyst or a data scientist could use that data set to discover patterns in customer purchase behavior.
Architecture
A real-time data pipeline subscribes to Kafka topics to read incoming data and load them into a data warehouse or a data lake like a regular Kafka consumer.
The events land in Kafka may not be in the proper format optimized for batch analytics. That needs to be cleansed, normalized, transformed and enriched before loading into a data lake. To minimize the latency, we can do all that while moving events from Kafka to the destination. So you can think of that as a “streaming ETL” pipeline.
Once the data lands in the data lake or warehouse, it can be analyzed with batch analytics tools such as Spark or Hadoop.
Retail use cases
- Move Kafka events to long-term storage: This is for data archival and backup.
- Historical data analysis: Data analysts can use archived data to discover patterns over time.
- Train machine learning models: Archived data can be fed into an ML model for training. The output could be a model predicting something like customer churn rate or buying propensity.
- BI dashboards: Data lake engines like Databricks and Presto can directly query data stored in a data lake to produce metrics and plot them on a dashboard.
Conclusion
Data coming from Kafka can be processed in different styles to uncover insights. Real-time processing styles range from quick reactions to just dumping them into a data lake for later analysis. Each style has its technical complexity, toolset, and associated business value. Deciding where to begin is solely based on your business priority.
For example, in the beginning, you may move data in Kafka into a data lake like S3 using Kafka Connect for historical analysis. As you understand the business needs better and your team grows, you may proceed with robust ML training pipelines or fraud detection use cases.
A retail analytics platform is a never-ending process. It doesn’t stop after harnessing business insights. There are other aspects such as data governance, schema evolution, data quality, etc. A data platform is incomplete without a proper control plane.
We will discuss that in the next episode.