Architecting a Kafka-centric Retail Analytics Platform — Part 1
What is retail analytics and why it matters to your business? How to build Kafka-centric analytics platform that ingests and processes business data at scale?
You either work for a retail organization, or you own one.
You may already have invested in a Kafka cluster or plan to do so in the future.
You have a tremendous amount of business data coming towards you. But you have no idea to extract meaningful insights from them so that you can make data-driven decisions for the betterment of the business.
If all the above applies to you, you are in the right place. This article series talks about architecting an analytics platform to ingest, store, and analyze retail data at scale. To be more specific, we will talk about an analytics platform that revolves around Apache Kafka and its ecosystem.
This post sets the stage for a discussion that spans multiple posts. It provides an overview of retail analytics and why you should care about it from a business standpoint. We will then briefly look at Kafka and its ecosystem before considering what type of data to capture for analysis.
We will discuss the architectural details of the platform in detail as we progress through the series.
What is retail analytics, and why does it matter to your business?
A business of any scale regularly produces many “business events” or “signals” over time. These signals originate from internal business systems, consumer devices, and other sources like business partners and social media.
Retail analytics collects and processes real-time and historical business events to measure consumer behavior and sales performance. These measures are then presented to business stakeholders for decision-making.
A retail analytics platform is a systematic approach to collecting, processing, and communicating business data to answer three problems.
1) To understand what is happening and why it has happened
Descriptive analytics is about analyzing business data to figure out what is happening in the business. That primarily focuses on analyzing business data to produce a set of KPIs that matters to the business. An example would be calculating quarterly sales performance.
If we see a variance in business KPIs, we can use diagnostic analysis to figure out why. For example, if there’s a drop in sales for the last week, we can analyze the past data to come up with a conclusion.
Typically, these problems are answered by business intelligence (BI) products today.
2) React to business events in real-time
What are the top-selling products in the last hour? Has the promotion campaign been attributed to that? What actions would maximize the ROI of the campaign? Is this transaction fraudulent or not?
Those are a few examples where real-time analytics can be used in the context of a retail business. Real-time analytics enables you to react to business events as they occur so that you can take necessary actions or make decisions on-time.
3) Predict the future of the business
Predictive analytics tells what is likely to happen in the business. It is made possible after training machine learning (ML) models over historical data.
For example, a recommendation system makes product suggestions for users based on their past buying patterns.
Answering the three of the above questions has its complexity levels. Typically, the complexity increases from historical to real-time to predictive analytics.
So, it is critical to think from a business standpoint before designing a retail analytics platform to answer the above questions. An organization may start with a BI solution today and eventually adopt real-time analytics and machine learning as they grow.
Apache Kafka and its ecosystem
Architecting a retail analytics platform is no doubt a challenging task. Although many packaged solutions claim that they deliver a one-size-fits-all solution, they are often expensive and offer less flexibility in customization to support your known analytics needs.
This article series discusses building an analytics platform centered around Apache Kafka, an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. If you already own a Kafka cluster or planning to invest in one in the future, this series will make sense to you.
The primary reason why we chose Kafka was the broad level of interoperability that it provides for the analytics ecosystem. While Kafka sits in the middle, providing durable and scalable storage to incoming business events, its ecosystem components provide the means to analyze those events at scale.
For example, Kafka Connect provides interfaces to many enterprise systems and data storages to build pipelines to bring that data in and out of Kafka. Moreover, you can build Kafka Streams and ksqlDB applications to process events coming out of Kafka in real-time. Fraud detection pipelines, recommendation engines, and live dashboards are just a few examples of this ecosystem.
We will talk about these integrations in detail as we progress through the post series.
Deciding what data to capture
Before building the platform, we need to think about what data to capture and how it should be done.
A retail business generates a vast amount of data while in operation. We can categorize them into two broad categories based on their origin.
Transactional data
The data generated from transactional systems fall into this category, such as POS transactions, credit card payments, inventory restockings, supplier purchase orders, etc.
Apart from that, we can collect operational data from third-party systems as well. Two examples are advertising data extracted from Facebook Ads and campaign performance data from the marketing automation system.
Customer behavioral data
These are not business transactions. But they describe the interactions customer had with different business touchpoints.
For example, if the business has an online store, we can track the pages that customers have clicked, abandoned shopping carts, and failed payments.
Other examples include:
- Comments made by customers on social media
- Mobile application events
- Support tickets raised by customers and call center interactions
While these two categories represent the widely captured data points, you can also define your data formats depending on the business use case. For example, a delivery company can track the live location of its fleet via an IoT device.
Collecting these data into the analytics platform is called data ingestion. There are several strategies for data ingestion, and Kafka often plays a critical role there.
We will discuss the ingestion process in the next post in detail.
Where next?
Now that we learned what is a retail analytics platform and what values it brings into the business. Also, we learned why Apache Kafka is a good choice for building such a platform and what type of business data you can feed into the platform.
In the next post, we will discuss the data ingestion in detail while focusing on Kafka and Kafka Connect. The rest of the series discusses extracting value from real-time and historical data using components working closely/interfacing with Kafka.
You can find the part 2 here
If you have prior experience on building such systems, please share your opinion as a comment or DM me on Twitter. Those will be valuable inputs for the rest of the series.