Architecting a Kafka-centric Retail Analytics Platform — Part 2

The layered architecture of the platform, what types of data exist in the retail domain, and how that should be ingested into Kafka using its ecosystem components.

Dunith Danushka
Tributary Data
8 min readOct 31, 2021

--

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

The first post of this series discussed retail analytics in detail and its business value to retailers. Also, we explored how Apache Kafka can its ecosystem can be beneficial when building a data platform to ingest, process, and consume retail data.

In the second installment of the series, we discuss data ingestion in detail. We will explore the retail data landscape to understand what data to capture and how they can be ingested into Kafka using its ecosystem.

Platform architecture from 33,000 feet above

Instead of treating it as a giant monolith, we should partition the platform architecture into a set of layers that work together to achieve a common goal.

We can categorize these layers based on their role.

  1. Ingestion layer: This layer is responsible for ingesting structured, unstructured, and semi-structured data into the platform from different data sources. The ingestion layer is the main entry point to the analytics platform, and we place Kafka there.
  2. Processing layer: Once the data is ingested, it is passed through this layer to derive valuable insights. The processing is twofold as “hot path” processing and “cold path” processing. We will discuss them in the next post in detail.
  3. Consumption layer: The processed data is consumed at this layer for business decision-making. Typically, BI and ad-hoc analytics fall in here. Low-latency OLAP databases and ML model building can also be included here to cater to modern data needs.
  4. Discovery and governance layer: This layer overlooks the operational aspects of the platform in terms of provisioning security, managing schema evolution, metadata management, and data sharing.

You will see how different components of the Kafka ecosystem fit into each layer as we walk through the series.

Layered architecture of a typical retail analytics platform

The retail data landscape

Today, omnichannel retailers offer many touchpoints for customers to maintain an optimal shopping experience. That includes web, mobile, and in-store channels that generate a vast volume of data every day.

Omnichannel retail model

Based on their origin and nature, we can classify those data as follows.

Data generated by the business applications: These are customer-facing web, mobile applications, and Microservices in the backend. Retailer’s e-commerce store and mobile application are some examples. Also, we can consider internal line-of-business (LoB) applications used for internal retail operations. ERP, CRM, warehouse management system, inventory are a few examples.

Data captured in OLTP databases: Operational databases are another source for analytics. Business applications store their state in these databases. That includes transactional records such as orders, payments, and shipment information.

Data captured in external SaaS applications: Retailers may use SaaS platforms for internal operations like ERP, CRM, customer support, marketing automation, and payment processing. Some examples are Salesforce, SAP, ServiceNow, Zendesk, HubSpot, and Stripe. This data often resides in the vendor infrastructures and is extracted using periodic API calls.

Data captured in legacy systems: Retailers are still stuck with legacy COTS packages and mainframes for operational purposes. This data has to be extracted for analytics using special techniques(e.g., integration middleware), such as file-based integrations, message broker integrations, etc.

Clickstream data: User interactions on retail channels are usually recorded as log files. Pageviews, button clicks, and ad campaign performances, A/B testing results are a few examples.

Social media data: This data includes the activities performed by users on different social media platforms. For example, tweets and the retailer’s Facebook page comments reflect user sentiment and brand loyalty.

Data comes in different shapes, sizes, and velocities

We need to collect all these data and feed them into the analytics platform to derive meaningful insights. But there are several challenges while doing so. First, data comes in different formats and sizes. There can be structured, unstructured, or semi-structured data coming from different sources while having different sizes.

Also, data comes in different velocities. Some data items come in very fast while others are not. Clicks on the e-commerce site and inventory restockings have different velocities.

There’s one thing clear here. This incoming data is not in good shape to perform different analytics workloads on them. They need to be cleansed and enriched beforehand. After the extraction, we need to store them until it is ready for analysis.

That is called data ingestion. Let’s look at that in the next section.

Data ingestion layer

The ingestion layer

The role of the ingestion layer is to extract data out of the systems we discussed above. Apache Kafka plays a critical role in the ingestion layer as ingested data will be stored in Kafka until they are analyzed.

Data ingestion can be periodic or streaming. Nevertheless, Kafka sees every incoming data item as an event. Each event has metadata that describes the event and a payload that contains the actual business data.

For example, we can represent a customer order like this:

Why Kafka?

There are several advantages of using Kafka for ingestion storage.

  • Kafka provides highly scalable and low-latency event storage for incoming events. Its architecture has been optimized to perform fast sequential writes to an append-only log structure, storing all data durably.
  • Kafka’s partition-based design enables breaking down this commit log into chunks and storing them in different Kafka cluster nodes with multiple replicas. That provides durable and fault-tolerant storage and the ability to store unlimited amounts of data by scaling out the cluster nodes.
  • Kafka decouples event producers and event consumers, in this case, analytical appliances. Event producers and consumers are no longer tightly coupled. That promotes a highly scalable pub-sub architecture where many data processing components can consume the same set of events without stepping on each other’s toes.

Kafka offers several interfaces for the ingestion. Let’s look at them in detail.

1. Native Kafka producer protocol

Kafka provides APIs for its clients to produce events. They have support for multiple programming languages including, Java, Python, .NET, etc.

Various business applications, including web applications, Microservices, and API backends, can utilize these APIs to send events to Kafka.

For example, frameworks like Spring Boot and Quarkus offer rich developer experience to communicate with Kafka for Java applications.

Moreover, enterprise middleware systems can also build integration artifacts on top of Kafka producer APIs. For example, an ESB can extract supplier invoices from a remote SFTP site and publish each invoice to Kafka as an event.

Quarkus reference application that publishes to Kafka

2. Kafka Connect

Kafka Connect is a free, open-source component of Apache Kafka that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems.

Kafka Connect architecture in a nutshell

Kafka Connect is a significant component in the Kafka ecosystem to pull data in and out of Kafka from different systems. For that, Kafka Connect provides a set of pre-built sources and sinks. That eliminates the need to write and maintain code to integrate systems with Kafka.

Extracting data from OLTP databases: You can use the Kafka Connect JDBC Source connector to import data from any relational database with a JDBC driver into Apache Kafka topics. The JDBC connector supports a wide variety of databases without requiring custom code for each one.

Change data capture (CDC) with Debezium: Debezium is a source connector for Kafka Connect that can listen to the changes made to relational databases, capture them, and reliably transport them to Kafka.

The benefit of the CDC is that you get the chance to analyze relational data as they arrive while data is still fresh. That eliminates costly periodic ETL pipelines that introduce delays for insights.

For example, we can use Debezium to stream online orders from the orders database to Kafka. Later, an analytics process can use that data to power a real-time dashboard.

Debezium architecture

Periodic extracts from file systems

Kafka Connect provides File System connectors to pull out data from different file systems. We can use these source connectors to extract clickstream logs and stream them to Kafka as events.

3. Pre-built ETL connectors

With the invention of the Modern Data Stack, we can see many vendors offering support to build low-code and no-code ETL pipelines. They provide a collection of connectors to well-known operational systems to extract and load data so that you don’t have to re-invent the wheel.

Stich, Fivetran, Singer, and Airbyte are a few examples in this domain. Almost all these vendors support Kafka as a destination connector.

Some of the connectors offered by open source ETL tool, Singer

4. Application event aggregators

Apart from that, platforms like Segment, Mixpanel, and Snowplow provide embeddable SDKs for developers to track user activities performed inside their applications.

For example, a button click event inside a mobile app can be captured and propagated as an event to a destination of your choice.

Segment’s Javascript SDK to trigger an event

These platforms integrate well with Kafka to ingest user behavioral events into Kafka and later perform analyses like A/B testing, user journey mapping, conversion funnel optimization, etc.

Where next?

Now that we walked through the retail data landscape and discussed different strategies that you can use to ingest data into Kafka, which is the main entry point to the analytics platform.

The following post discusses the processing layer and its two varieties, real-time processing and batch processing, which we can use to derive insights from the ingested data.

If you have missed the first post of the series, you can find it below.

--

--

Dunith Danushka
Tributary Data

Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my own.