Real-Time Sales Analytics using Kappa Architecture

Published in

Sysco LABS Sri Lanka

11 min readApr 8, 2021

This article was written by Nilaxan Satgunanantham, Hashan Gallage, Malith Wickramaratne and Milroy Fernando.

The change in customer behavior has never been more dynamic. To make quick decisions, it is important to analyze various data-driven business segments in real-time. One such business scenario is sales.

Why capture real-time Sales?

Most organizations decide to implement traditional business intelligence systems to analyze company performance. In many cases, organizations tend to consider the company performance (e.g. Sales) as “current” based on a transactional dataset that happened hours or days before and was stored either in a Data Warehouse or an Operational Database. Hence, companies fail to capitalize on opportunities as they occur because the sales data and the analyses are not current nor “real-time”.

In this article, we explore the ways in which we believe real-time Sales would help a company to capitalize on opportunities or mitigate threats.

Making campaign adjustments in real-time.

For example, in the restaurant industry, social media initiatives depend on real-time sales to provide daily promotions (e.g. BOGOF offer).

Monitoring customer behavior.

During the day, a company was able to detect a significant increase in the number of purchases of a specific item in real-time. This analysis could be used to swiftly respond to the situation and increase the production of that specific item to satisfy excess demand.

In Batch processing, data is processed as a group of transactions that are collected over a specific period (depending on the required time frame). Accordingly, batch processes are executed daily, weekly or monthly.

Because of this, there is a delay between the collection of data and generate the results. Businesses must wait for the batch process to be completed to see the result for that time.

To offer a good service, you need to attend to customer needs and improve overall satisfaction while monitoring loyalty. In this journey, it is important to gain insights from data which provides the current and up-to-date status of the business. Real-time sales data processing can play a significant role in this process.

Advantages

If you use real-time data processing over batch processing to analyze the data, it allows businesses to make quick decisions without a significant delay in response.
Unlike batch processing, real-time processing allows deals with master files that are always up to date, hence the information is always updated and so are the actions.
It allows businesses to immediately see what their customers prefer, their needs, and the problems they face. This information guarantees a more personalized service.
It is possible to take corrective measures.
It makes it easier to identify changes in the business as it happens.

Data Processing Architectures

When it comes to real-time data processing, there are various data processing architectures being used around the globe. Let’s look at the Lambda and Kappa architectures in detail and find out what makes each of them special and in what circumstances one should be preferred over another.

Lambda Architecture

Lambda architecture is composed of three layers; a batch layer, a streaming (or real-time) layer, and a serving layer. Both the batch and real-time layers receive events parallelly. .

The serving layer then aggregates and merges computation results from both layers into a complete answer.

It consists of two parallel paths for data flow:

A cold path (batch layer) stores the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored in the serving layer.
A hot path (speed layer) analyzes data in real-time. This layer is designed for low latency, at the expense of accuracy.

Kappa Architecture

Kappa Architecture is a simplification of the Lambda Architecture system with the absence of batch processing. It has the same basic goals as the Lambda architecture, but with an important distinction: all data flows through a single path, using a stream processing system.

How is Kappa different from Lambda architecture?

A drawback to Lambda architecture is its complexity. Processing logic appears in two different places, batch processing & stream processing, using different frameworks. This leads to duplicates in computation logic and the complexity of maintaining two separate code bases for batch and streaming layers, which makes debugging difficult.

Kappa architecture on the other hand, removes duplicate computation logic and the complexity of managing the architecture. But its important to note that Kappa architecture is not a replacement for the Lambda architecture, except for where your use case fits.

Solution for real-time Sales Analytics

The deployment diagram below depicts collecting sales data from various source systems in real time. It processes them periodically, and it visualizes the summarized analysis and information on a dashboard.

Data Source Layer

The first layer is the Data Source layer, from which the streaming data is being generated. In this solution, sales data could reside in databases, flat files etc. As denoted in the diagram, the data is passed into the streaming layer which consists of Amazon Kinesis and Informatica Cloud Mass Ingestion.

Streaming Layer

Two services/platforms are used in this layer to gather data from sources and move them onto the streaming layer to process in real-time. The two services used in this layer are:

Amazon Kinesis

Amazon Kinesis Data Streams can be used to collect and process large streams of data records in real time. Kinesis acts as a highly available conduit to stream messages between data producers and data consumers.

KDS is a massively scalable and durable real-time data streaming service. Generally, the speed layer/stream layer reduces the processing latency to seconds or milliseconds. Real-time processing happens in motion when the Kinesis streams enter into the streaming layer.

Even though Amazon Kinesis is used here as the technology to produce data streams, Apache Kafka, Confluent, Google Cloud Dataflow, Azure Event Hubs could be used as alternatives.

In this case, we are consuming a Kinesis Data Stream that is already being produced.

Informatica Cloud Mass Ingestion

A consumer is an application that processes data from a Kinesis data stream, hence in our solution, we use Informatica Cloud Mass Ingestion as the Consumer application.

Informatica Cloud Mass Ingestion is an Enterprise-Level Cloud service and is a Platform as a Service (PaaS) provided by Informatica which is designed specifically to ingest data efficiently from mainly 3 ingestion types.

From local files (File Ingestion task)
From Databases (Database Ingestion task)
From Streaming data (Streaming Ingestion task)

The Cloud Mass Ingestion service can ingest data into the cloud or on-premises data lakes or data warehouses. Informatica Cloud Mass Ingestion is a Wizard-based authoring solution which supports variety of sources include Apache Kafka, Amazon Kinesis, Flat files, JMS, MQTT, and web logs. It also supports targets including Apache Kafka, Amazon S3, Amazon Kinesis, Amazon Firehose, Databricks delta, Google cloud and Microsoft Azure. Further, Informatica provides a wizard dashboard to monitor the performance of the system and ingestion jobs in real time.

Even though the Informatica Mass Ingestion is used here as the technology to process and transform data streams, Azure Stream Analytics, IBM Streams, Google Cloud Stream Analytics, Oracle Stream Analytics could be used as alternatives.

In our solution, Informatica Cloud Mass Ingestion is used to read real-time sales data from an Amazon Kinesis Stream (the Source), transform JSON formatted stream data according to the business logic, and then push the transformed data into an Amazon S3 bucket (the Target). To accomplish this, the following steps need to be completed:

Defining a Streaming Ingestion Task: The first step is to create a Streaming Ingestion task from the wizard where the task name, a folder location for the project and the runtime environment will be provided.
Configuring a source for the Amazon Kinesis Stream: When configuring the source, you need to first create an Amazon Kinesis Stream source connection providing a user-defined name, a relevant AWS Access Key Pair, Region etc. Once the connection is created, the connection name should be selected from the drop down which should automatically populate “Amazon Kinesis” as the connection type in the wizard.
Configuring a target for the Amazon S3: When configuring the target, you need to first create an Amazon S3 target connection, providing a user-defined name, a relevant Access Keys and Secret Keys, Bucket Folder path etc. Once the connection is created, the connection name should be selected from the drop down which should automatically populate “Amazon S3 V2” as the connection type in the wizard. Then S3 target file name should be given as the “object” in the wizard.
Transforming ingested data based on business logic: Streaming data can be transformed via Informatica Mass Ingestion only if the streams are provided either in Binary, JSON or XML format. In our use case, the streaming data is ingested as JSON. Informatica Mass Ingestion supports four types of transformations which can be added multiple times in a single ingestion task.

Combiner: Using this transformation, multiple events from a Kinesis Stream can be combined into a single event. To combine multiple JSON formats, the transformation combines the incoming data into an array of data and returns JSON array objects as the output. Always ensure that this is the final transformation in the task flow.
Filter: Filter transformation is used to filter out the data streams based on various conditions. To filter out JSON data, a JSONPath expression needs to be provided. E.g. $.food.meat[?(@.price<100)]
Python (need to be modified): A python code or the path to a python script can be provided as the script input type in the Python transformation configuration wizard. For incoming JSON data the message format is stored as string in the inputData variable, and for the outgoing JSON data the message format is also stored as string in the outputData variable. Also, ensure that directory to the Python path libraries are entered in the wizard.
Splitter: A Splitter transformation can be used to split message arrays into separate messages based on specified conditions. For example, a JSON file can be split into separate files based on the array element specified by a JSONPath expression. Each generated file is comprised of an element of the specified array.

5. Configure Runtime Options: Runtime options need to be configured to notify users about reject events and errors. Hence, parameters such as user email addresses, time frequency to receive notifications need to be entered.

6. Deploy the Streaming Ingestion Task: The Ingestion task can be deployed after you save it. If the deployment is successful, the job will be queued to run.

Storage Layer

Amazon S3

Amazon Simple Storage Service (S3) is a cloud object storage service developed to store, backup, or archive data & applications on AWS. In our solution once data is processed and transformed through the Informatica mass ingestion, the data will be pushed into the target storage of S3 as JSON files.

To convert the JSON files into a format that can be queried from a BI tool, the JSON data will be converted into CSV within S3 using AWS Lambda.

Serving/ Visualization Layer

This layer is used to create a dataset from the files in Amazon S3 buckets and use them to cater to end-user queries on an ad-hoc basis either in real-time or periodically. To accomplish this, Amazon QuickSight is used in our solution.

Amazon QuickSight

Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include Machine Learning-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites.
- Amazon Web Services

Amazon powers the QuickSight platform with its SPICE (Super-Fast, Parallel, In-Memory Calculation Engine) data engine to perform in-memory optimized calculation for data fast and is designed for quick and up-to-date analysis. Quicksight has its own unique architecture when compared with its competitors such as Microsoft Power BI and Tableau.

Amazon QuickSight supports a variety of data sources that you can use to provide data for analysis. It reads data from Amazon Aurora, Amazon Redshift, Amazon Relational Database Service, Amazon Simple Storage Service (S3), Amazon DynamoDB, Amazon Elastic MapReduce, and Amazon Kinesis.

The service also integrates with on-premises databases, file uploads, and API-based data sources, such as Salesforce.

The right information at your fingertips is the key to making sound business decisions. However, your ability to interpret this information and the speed at which you can access it are equally important.

In our solution, we provide interactive dashboards with the following metrics from sales data in real-time.

Sales Performance:

When data is collected in real-time, this data can be used to monitor the sales performance in real-time. The above dashboard depicts several visualizations updated near real time (every hour) to monitor the ongoing sales of a business. A line chart which displays hourly collected total sales, a tree map which is updated hourly and aggregates the sales within the day by category, and a geo map which indicates location based sales during the last hour.

Sales Growth Between Periods

Real-time data can also be used to compare the sales growth of a company. The Line chart which is updated near real time compares the company current year sales with the last years’ sales.

This chart shows there is a good growth in Sales for this year compared to the previous year.

The vertical bar chart indicates that the current years’ sales within the month of September is significantly higher compared to that of September last year .

The horizontal bar chart indicates comparison of year to date sales between each sales manager of the company which helps to identify the good performing sales managers within the current Month, current Quarter and the current Year.

Sales Target (Actual Revenue vs Forecasted Revenue)

These 2 charts shows actual sales of the company (hourly and aggregated hourly) within the day against the forecasted sales which was predicted from a time series model based on a last 6 months actual sales of the company.

Even though hourly sales are higher than forecasted in some hours, the aggregated chart shows actual aggregated sales per the day are $3,535.00 less than the forecasted aggregated sales by 7:00 PM.

Further, the aggregated chart is supported by a trend line which is updated in near real time based on a simple linear regression model using the actual sales within the day. At the last updated time of 7 PM, the trend line shows a good fit for the actual sales data with a significant higher R-squared value.

In summary, data is the new god. In the highly competitive business environment that we’re in right now making fast and accurate decisions is the key to thrive and be a cut above your competition. Real time analysis is the answer.