Implementing Serverless Real-Time Analytics for SaaS with AWS Services

Jari Ikävalko
Skillwell
11 min readSep 6, 2023

--

In today’s data-driven world, the ability to make real-time decisions based on analytics is crucial for any SaaS business success. Real-time analytics offer SaaS business owners a powerful tool for improving customer experience, operational efficiency, and bottom-line metrics.

Benefits of real-time analytics for a SaaS business include:

  1. Real-time analytics can help to understand how users are interacting with the platform at any given moment. It makes possible to identify which features are most popular and which ones may be causing user drop-offs, allowing to make immediate adjustments to improve user experience and retention.
  2. With real-time data, it is possible to track how features or updates are received instantly. This allows for agile product development as it is possible to quickly iterate based on immediate user feedback.
  3. Real-time analytics can reveal bottlenecks or inefficiencies in the system, like slow-loading pages or features that are not being utilized as intended. Quick identification and resolution of these issues can improve operational efficiency.
  4. Real-time monitoring allows SaaS business operational team to better understand the resource utilization. This enables the team to make immediate adjustments to the resource allocation, saving costs in the long run.
  5. By tracking customer behavior in real-time, it is possible to upsell or cross-sell opportunities more readily. For instance, if a user spends a considerable amount of time looking at a premium feature, an automated prompt could offer them a limited-time discount on that feature.
  6. Real-time analytics can help in identifying abnormal patterns or potential security breaches as they happen, allowing for immediate action to prevent data loss or compliance violations.
  7. Access to real-time insights enables faster decision-making, giving an edge over competitors who may be relying on batch processing or less sophisticated data analytics tools.
  8. Real-time analytics can also provide SaaS business team with invaluable insights into customer issues, allowing them to provide more quicker and more effective solutions.

So real-time analytics for a SaaS business has many valuable benefits. However, implementing a robust real-time analytics solution can be complex and costly.

In this blog post I have chosen to build the real-time analytics solution using a specific suite of AWS services that allow for easy ingestion, processing, and storage of large streams of real-time data.

The services in this suite are:

  • Amazon Kinesis Data Stream: Ideal for collecting large streams of data in real-time.
  • AWS Lambda: Offers serverless computing, allowing you to run code without provisioning or managing servers.
  • Amazon Redshift Serverless: A fast, fully-managed data warehouse that allows complex SQL queries on structured data.

Together, these services can be used to create a seamless, serverless real-time analytics pipeline that is robust and cost-effective. Each service within this architecture comes with built-in fault tolerance and disaster recovery features, ensuring that the real-time analytics system remains operational.

Use Case: Real-Time User Behavior Analytics for a SaaS Platform

Imagine you are operating a SaaS application and you want to track how users interact with your platform. By analyzing this data in real-time, you can improve your service by understanding what features are most used, identifying bottlenecks, and even detecting fraudulent activities.

In this blog post I do not cover how the user activities are tracked or the details involved when communicating with AWS Kinesis Data Stream. For these there are different alternatives and the selected method is dependent of the SaaS solution itself and its architecture.

Neither does this blog post cover in detail the different data analysis methods available after the data is written into Amazon Redshift.

Instead, I focus to the data ingestion part. How to setup a serverless solution that is able to receive streaming data with AWS Kinesis and store the data into Amazon Redshift.

Implementation Steps

The implementation of the solution consists of the following steps:

  1. Data Ingestion with Kinesis: Configure a Kinesis stream to collect events from your SaaS application. These events could include user logins, feature usage, and other interactions.
  2. Data Processing with Lambda: Attach a Lambda function to the Kinesis stream. This function will be triggered whenever new data arrives. You can then transform the raw data into a more digestible format or filter out irrelevant events.
  3. Data Storage in Redshift: Use Lambda to populate a Redshift table. This table will serve as the foundation for your analytics queries.
  4. Analytics Queries: Use SQL queries on your Redshift table to gain insights into user behavior.

Configuring Kinesis Data Stream for Data Ingestion

In our company, we primarily use AWS CloudFormation and the Cloud Development Kit (CDK) for provisioning AWS resources. The choice between these two options depends on a variety of factors. As a general rule of thumb, we opt for CDK when we need to deploy unique instances of a solution for multiple customers across single or multiple AWS accounts. In other scenarios, CloudFormation is our go-to choice.

Here’s a simple CloudFormation template for setting up a Kinesis Data Stream:

Resources:
SaaSAppKinesisStream:
Type: "AWS::Kinesis::Stream"
Properties:
Name: "SaaSAppStream"
ShardCount: 1
RetentionPeriodHours: 24

And here’s a similar setup written with TypeScript and use CDK:

import * as cdk from '@aws-cdk/core';
import * as kinesis from '@aws-cdk/aws-kinesis';

export class SaaSRealTimeStack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);

new kinesis.Stream(this, 'SaaSAppKinesisStream', {
streamName: 'SaaSAppStream',
shardCount: 1,
retentionPeriod: cdk.Duration.hours(24)
});
}
}

Both of these methods ultimately deploy a CloudFormation stack that includes a Kinesis Data Stream resource. In the example provided, two key configuration parameters are emphasized: shardCount and retentionPeriod. The former determines the data throughput capacity of the stream, while the latter specifies the duration, in hours, for which the data records remain accessible in the stream. It's worth noting that a single shard can accommodate up to 1MB per second or 1,000 records per second in write throughput.

It’s important to grasp that both of these example setups leverage a robust AWS managed service. Kinesis Data Stream is not only reliable for ingesting streaming data but also offers built-in scalability options through shard count adjustments. Additionally, this scaling can be further automated as required, offering a highly adaptable solution.

Data Processing with Lambda

After the Kinesis Data Stream is set up, we need to define a Lambda function that processes data from the stream. In our company we usually write this kind of data processing Lambda functions using TypeScript and with the AWS Lambda Node.js runtime.

Here’s an example of the Lambda function that receives data records from the Kinesis Stream:

import { KinesisStreamEvent, KinesisStreamRecord } from 'aws-lambda';

export const handler = async (event: KinesisStreamEvent): Promise<void> => {
console.log(`Received ${event.Records.length} records.`);

for (const record of event.Records) {
// Decode the Kinesis record payload
const payload = Buffer.from(record.kinesis.data, 'base64').toString('utf-8');
const parsedPayload = JSON.parse(payload);

// Process the received data
console.log(`Processing record ${record.kinesis.sequenceNumber}:`, parsedPayload);

// TODO: Here's a place for storing the data to Redshift
}
};

The example code assumes that the data arriving in the Kinesis Stream is in JSON format.

After we have written the Lambda function we need to define it as a resource and configure it to receive the data from the Kinesis Stream.

Here’s an example that extends the above CloudFormation template with the needed additions:

Resources:
SaaSAppKinesisStream:
Type: "AWS::Kinesis::Stream"
Properties:
Name: "SaaSAppStream"
ShardCount: 1
RetentionPeriodHours: 24

KinesisLambdaExecutionRole:
Type: "AWS::IAM::Role"
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Principal:
Service: "lambda.amazonaws.com"
Action: "sts:AssumeRole"
Policies:
- PolicyName: "KinesisAccess"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Action:
- "kinesis:GetRecords"
- "kinesis:GetShardIterator"
- "kinesis:DescribeStream"
- "kinesis:ListStreams"
Resource: "*"

KinesisLambdaFunction:
Type: "AWS::Lambda::Function"
Properties:
FunctionName: "KinesisLambdaFunction"
Handler: "index.handler"
Role: !GetAtt KinesisLambdaExecutionRole.Arn
Code:
S3Bucket: "example-bucket-name"
S3Key: "lambda-function.zip"
Runtime: "nodejs16.x"

KinesisLambdaEventSourceMapping:
Type: "AWS::Lambda::EventSourceMapping"
Properties:
FunctionName: !Ref KinesisLambdaFunction
EventSourceArn: !GetAtt SaaSAppKinesisStream.Arn
StartingPosition: "TRIM_HORIZON"

And here’s the same for the CDK example:

import * as cdk from '@aws-cdk/core';
import * as kinesis from '@aws-cdk/aws-kinesis';
import * as lambda from '@aws-cdk/aws-lambda';
import * as iam from '@aws-cdk/aws-iam';
import * as lambdaEventSources from '@aws-cdk/aws-lambda-event-sources';

export class SaaSRealTimeStack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);

// Create a Kinesis Stream
const kinesisStream = new kinesis.Stream(this, 'SaaSAppKinesisStream', {
streamName: 'SaaSAppStream',
shardCount: 1,
retentionPeriod: cdk.Duration.hours(24)
});

// Create an IAM role for the Lambda function
const lambdaRole = new iam.Role(this, 'KinesisLambdaRole', {
assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
});

lambdaRole.addToPolicy(new iam.PolicyStatement({
actions: [
"kinesis:GetRecords",
"kinesis:GetShardIterator",
"kinesis:DescribeStream",
"kinesis:ListStreams"
],
resources: [kinesisStream.streamArn],
}));

// Create a Lambda function
const kinesisLambda = new lambda.Function(this, 'KinesisLambdaFunction', {
functionName: 'KinesisLambdaFunction',
handler: 'index.handler',
runtime: lambda.Runtime.NODEJS_16_X,
code: lambda.Code.fromAsset('path/to/lambda/code'),
role: lambdaRole,
environment: {
STREAM_NAME: kinesisStream.streamName
}
});

// Add Kinesis as an event source to the Lambda function
kinesisLambda.addEventSource(new lambdaEventSources.KinesisEventSource(kinesisStream, {
startingPosition: lambda.StartingPosition.TRIM_HORIZON
}));
}
}

It’s important to note that when these AWS services interact, specific permissions are required for the Lambda function to read from the Kinesis Data Stream. To address this, the examples include an IAM role specifically for Lambda execution. This role is endowed with the necessary IAM policies to fetch data records from the stream.

Data Storage in Redshift

Once data is successfully channeled through AWS Lambda and Kinesis, our next step is to establish a robust and scalable storage solution to transform this data into actionable insights. To achieve this, we’ll employ a serverless variant of Amazon Redshift, which is AWS’s fully managed data warehouse service optimized for high-performance analytics.

When setting up Amazon Redshift Serverless we need to define a workgroup and a namespace. A workgroup is a collection of compute resources and a namespace is a collection of database objects and users.

Setting Up a Workgroup for Amazon Redshift Serverless

For an Amazon Redshift Serverless workgroup, we need to define various compute resources such as RPUs, VPC subnet groups, and security groups. It’s important to note that because Amazon Redshift Serverless runs within a VPC, this configuration also affects the Lambda function and its ability to access the data storage. For a secure access to the Redshift I would recommend running the Lambda function also in a VPC.

It’s also worth noting the pricing structure for Amazon Redshift Serverless. Costs are calculated based on Redshift Processing Units (RPUs). For instance, in the US-East-1 (North Virginia) region, the rate for an RPU is currently $0.375 per hour. A minimum number of RPUs is 8 and you pay for the workloads you run in RPU-hours on a per-second basis (with a 60-second minimum charge). The defined data warehouse capacity automatically scales up and down to meet the workload demands and it shuts down during periods of inactivity. You incur no costs during periods of inactivity.

A tip especially for development environments: Given that the Kinesis Data Stream retains data for a specific retention period, you can also manage Redshift costs by controlling the timing of data ingestion from the stream via the Lambda function. One approach is to schedule enabling and disabling of the Lambda function at specified intervals.

Here’s an example CloudFormation template for defining a workgroup for the Amazon Redshift Serverless:

Resources:
RedshiftServerlessWorkgroup:
Type: "AWS::RedshiftServerless::Workgroup"
Properties:
BaseCapacity: 8
EnhancedVpcRouting: true
NamespaceName:
Ref: NamespaceName
PubliclyAccessible: false
SecurityGroupIds:
- Fn::ImportValue: !Sub "${app}-${env}-redshift-sg"
SubnetIds:
Fn::Split:
- ","
- Fn::ImportValue: !Sub "${app}-${env}-subnet"
WorkgroupName: "SaaSRealTimeWorkGroup"

As shown in the template, Redshift requires defining subnet and security group details. And because of this, the definition of the Lambda function needs to be updated to also contain a VPC configuration:

KinesisLambdaFunction:
Type: "AWS::Lambda::Function"
Properties:
FunctionName: "KinesisLambdaFunction"
Handler: "index.handler"
Role: !GetAtt KinesisLambdaExecutionRole.Arn
Code:
S3Bucket: "example-bucket-name"
S3Key: "lambda-function.zip"
Runtime: "nodejs16.x"
VpcConfig:
SecurityGroupIds:
- Fn::ImportValue: !Sub "${app}-${env}-lambda-sg"
SubnetIds:
Fn::Split:
- ","
- Fn::ImportValue: !Sub "${app}-${env}-subnet"

For simplification, I do not update the templates to contain all the needed details, like subnet and security group definitions.

This time, I’ll also omit the CDK example, as I don’t yet have experience setting up Amazon Redshift Serverless using CDK.

Setting Up a Namespace for Amazon Redshift Serverless

Together with the workgroup we also need to define a namespace that will contain the database objects and users.

The definition can be done using CloudFormation. Here’s an example template with values as references.

Resources:
RedshiftServerlessNamespace:
Type: "AWS::RedshiftServerless::Namespace"
Properties:
AdminUsername:
Ref: AdminUsername
AdminUserPassword:
Ref: AdminUserPassword
DbName:
Ref: DatabaseName
NamespaceName:
Ref: NamespaceName
IamRoles:
Ref: IAMRole

From Infrastructure to Action: Setting Up the Database Schema and Facilitating Data Analytics

Having covered the foundational elements of our serverless real-time analytics architecture — including Kinesis Data Stream, Lambda function, and Amazon Redshift Serverless — it’s time to shift our focus to the operational aspects.

The missing steps before being able to analyze the data are:

  1. Setting up the Redshift database schema
  2. Updating the Lambda function to write data into Redshift
  3. Deciding a tool for data analysis

Setting Up the Redshift Database Schema

To be able to create the required database schema we first need to establish the database connection. It can be done through the AWS Management Console, or by using a SQL client that supports PostgreSQL, as Redshift is based on this database engine.

For this example use case let’s define just a simple table for tracking user interactions within a SaaS application. The following SQL command creates a table to store this data.

CREATE TABLE saas_user_interactions (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
event_type VARCHAR(50),
event_timestamp TIMESTAMP,
additional_data JSONB
);

And let’s also create a sort key for speeding up queries that filter on event timestamp.

-- Adding a sort key
ALTER TABLE saas_user_interactions
ADD SORTKEY (event_timestamp);

Updating the Lambda function to write data into Redshift

For writing the event data into Redshift let’s update the Lambda function to use a library named pg-promise.

Here’s an example of the updated function code:

import { KinesisStreamEvent, KinesisStreamRecord } from 'aws-lambda';
import pgp from 'pg-promise';

const dbConfig = {
host: process.env.REDSHIFT_HOST,
port: process.env.REDSHIFT_PORT,
database: process.env.REDSHIFT_DATABASE,
user: process.env.REDSHIFT_USERNAME,
password: process.env.REDSHIFT_PASSWORD
};

const db = pgp(dbConfig);

export const handler = async (event: KinesisStreamEvent): Promise<void> => {
console.log(`Received ${event.Records.length} records.`);

for (const record of event.Records) {
// Decode the Kinesis record payload
const payload = Buffer.from(record.kinesis.data, 'base64').toString('utf-8');
const parsedPayload = JSON.parse(payload);

// Process the received data
console.log(`Processing record ${record.kinesis.sequenceNumber}:`, parsedPayload);

// Extract required fields from eventData
const { userId, eventType, eventTimestamp, additionalData } = parsedPayload;

try {
await db.none('INSERT INTO saas_user_interactions(user_id, event_type, event_timestamp, additional_data) VALUES($1, $2, $3, $4)', [userId, eventType, new Date(eventTimestamp), JSON.stringify(additionalData)]);

} catch (error) {
console.error(`Failed to insert event: ${error}`);
}
}
};

In addition to updating the Lambda function, we would also need to specify an IAM Role that grants permission to execute the SQL queries. I will omit that now for simplification.

Deciding to Use Amazon QuickSight for Data Analysis

My choice for data analysis and visualization in this example solution would be Amazon QuickSight. This tool offers several advantages:

  1. Seamless integration with Amazon Redshift: QuickSight effortlessly connects with Amazon Redshift, allowing for smooth data transfer and analysis.
  2. Efficient Handling of Large Datasets: The platform is designed to manage large volumes of data, making it ideal for robust analytics.
  3. Cost-Effectiveness: Being a fully managed service, QuickSight eliminates the overhead costs of managing infrastructure and software updates.
  4. ML Insights: QuickSight integrates machine learning capabilities, providing predictive analytics and anomaly detection for more advanced insights.
  5. Accessibility and Collaboration: QuickSight is accessible from multiple devices and allows for easy sharing of dashboards and reports, making collaboration simpler across teams.

These features make Amazon QuickSight a compelling option for our data analytics needs.

Given that this blog post has expanded beyond its originally planned length, I will reserve the details of setting up and using Amazon QuickSight for a future discussion.

Conclusion

In this blog post, we’ve journeyed through the process of setting up a Serverless Real-Time Analytics system for SaaS using a range of AWS services. Starting with data ingestion through Amazon Kinesis Data Streams, we’ve looked at how AWS Lambda functions can be leveraged to process the incoming data. We then delved into storing this data in Amazon Redshift Serverless, a scalable and robust data warehouse solution.

Though we’ve just scratched the surface of what’s possible, especially in terms of data analysis and visualization using Amazon QuickSight, the foundation laid here offers limitless possibilities for future enhancements.

Whether you’re a technical decision-maker, a software architect, or a SaaS business owner, understanding and implementing these technologies can provide you with actionable insights, operational efficiencies, and a significant competitive edge in the marketplace.

Thank you for reading, and stay tuned for more in-depth articles on leveraging AWS services for your SaaS solutions.

--

--

Jari Ikävalko
Skillwell

Solutions Architect at Skillwell. AWS Ambassador. Specializing in SaaS and AWS integrations. Author on scalable, secure SaaS.