Stories by Kamalakar Bejugama on Medium

Unpacking Pelago Data Platform Layered Architecture

Kamalakar Bejugama — Wed, 17 Sep 2025 03:53:32 GMT

Introduction :

Pelago, a rapidly growing traveler attractions & experiences ticketing platform from Singapore Airlines, faced significant challenges with its initial data platform. The previous setup was not standardised, leading to issues with scaling, data refresh times, and ease of making changes. This resulted in:

Challenges :

Cluttered tables and views with interdependent logic.
Data sets and dashboards taking several hours to refresh.
Complex data patching tasks when changes were needed.

To address these challenges, We have implemented a layered data architecture using dbt, organised into Bronze, Silver, and Gold layers. This approach is inspired by popular Medallion Architecture principles, ensuring data quality, reliability, and accessibility. Ingestion tools like Airflow, Glue, and Hevo are used to bring data from the various sources & system covering in both realtime & batch timings.

These issues underscored the urgent need for a more structured, resilient, and scalable data platform.

Modern Ingestion Pipelines:

To ensure a robust foundation for our layered architecture, we established a modern ingestion pipeline utilising a combination of industry-standard tools:

Airflow: Orchestrates complex data workflows, ensuring timely and reliable data loading from various sources.
AWS Glue: Leveraged for server-less ETL jobs, particularly for processing large volumes of data and handling diverse data formats.
Hevo Data: No-Code ETL.

Crucially, all ingested data undergoes a set of platform-defined standardisation rules:

Date Time/Format Handlings: Consistent UTC timestamps and standardised date formats across all datasets.
Currency/Format Handlings: Ensuring all monetary values adhere to a single currency standard (e.g., SGD) and format.
Semi-Structured Data Handlings: Parsing and flattening JSON or other semi-structured fields into usable columns.
Addition of DE Tracking Columns: Including essential metadata columns like sys_process_date, sys_process_time,..etc for improved data governance and lineage tracking.

Layered Data Architecture: Bronze, Silver, Gold

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

At the heart of Pelago’s modernised data platform is our layered architecture, implemented using dbt (data build tool). This framework has been instrumental in defining, transforming, and managing our data models across three distinct layers, inspired by the popular Medallion Architecture

Data Architecture

Bronze Layer(T0 — Raw/Source of Truth):

It holds data AS IS from its source, preserving the immutable source truth.

Purpose: To capture raw, untransformed data. This layer acts as a readily available historical archive, enabling re-runs and reprocessing in case of unexpected failures in downstream pipelines without needing to re-ingest from external sources.
Key Characteristics: Minimal transformations applied, primarily focused on schema inference and initial data type conversion. Our platform-defined ingestion standardizations (as mentioned above) are applied here to create a consistent raw foundation.

Silver Layer (T1 — Curated & Conformed):

The Silver layer is where medium to complex transformations are applied to the Bronze layer tables. This layer focuses on data cleansing, enrichment, and conforming data from various sources into a unified, business-friendly structure.

Purpose: To build curated, highly usable, and often denormalized record-level data tailored for specific business cases. This layer integrates data from multiple Bronze tables, resolves discrepancies, and applies business rules.
Key Characteristics: Data quality checks, deduplication, joining related entities, and initial feature engineering for analytical purposes. The Silver layer is designed to be highly reusable for various analytical and operational needs.

Gold Layer (T2 — Aggregated & Domain-Specific) :

The Gold layer represents the highly curated, aggregated, and business-domain-specific data. This is the layer primarily exposed to business users and tools for direct consumption.

Purpose: To provide readily consumable, day-level metric data and key performance indicators (KPIs) for each business domain. This layer is optimised for fast querying and easy integration into dashboards and reporting tools, leading to faster decision-making across the organization.
Key Characteristics: Aggregations (e.g., daily sales, weekly bookings), pre-calculated metrics, and highly denormalised tables optimised for specific reporting requirements. Data in this layer is often structured to directly power specific dashboards or analytical applications.

Metadata, Data Governance, and Security

Effective management of our data assets goes beyond just layering. Pelago places a strong emphasis on robust metadata management, data governance, and data security to ensure data quality, compliance, and controlled access:

Metadata Management: Comprehensive metadata, including data lineage, schema definitions, transformation logic, and business context, is meticulously captured and maintained.

This allows us to understand where data comes from, how it’s transformed, and its meaning to the business, crucial for debugging, auditing, and future development. Now finally its powering NLQ(DataBot)

Data Sensitivity Handling: Data is classified based on its sensitivity (e.g., PII, financial data). This classification drives how data is handled throughout its lifecycle, including encryption, anonymisation, or tokenisation where necessary.
Robust Data Governance: We implement a clear data governance framework that defines roles, responsibilities, and policies for data ownership, quality, and usage. This ensures accountability and consistency across the data platform.
Coarse-Grained Access Control: Access to our data assets is managed at a higher level, primarily through table and view-level permissions. This ensures that only authorised teams or individuals can access specific datasets relevant to their roles. For instance, financial data might be accessible only to the finance department.
Fine-Grained Access Control: For highly sensitive data, we implement row and column-level security. This allows us to restrict access to specific rows (e.g., only show data for a user’s own region) or columns (e.g., mask personal identifiers like email addresses) even within an accessible table or view, providing granular control and enhancing data privacy.

Finally, This robust architecture enables powerful downstream data-dependent applications, providing:

Enhanced Reporting & Analytics: Rapidly generated, highly accurate dashboards and reports for business intelligence and performance monitoring.

Personalised Customer Experiences: Data-driven recommendations and tailored offers based on customer behaviour and preferences.
Optimised Business Operations: Insights for supply chain management, pricing optimisation, and operational efficiency.

Powering Next-Generation AI and NLQ Use Cases

A significant advantage of this layered architecture is its ability to serve new, cutting-edge AI-powered use cases, including Natural Language Query (NLQ) capabilities. By structuring data into distinct layers, Pelago’s data platform provides a clean, curated, and easily accessible foundation for advanced analytics and machine learning.

Machine Learning & AI Initiatives: The Gold layer, with its highly curated and aggregated data, serves as an ideal source for training and deploying machine learning models, enabling predictive analytics and advanced insights.

Natural Language Query (NLQ) (Databot) : With standardised and well-defined data in the Silver and Gold layers, business users can leverage NLQ tools to ask questions about data in plain English, eliminating the need for complex SQL queries and accelerating data-driven decision-making. This democratizes data access and empowers a wider range of users to extract valuable insights.

Use Cases & Impact

The implementation of this layered architecture has yielded significant improvements across Pelago:

Dramatic Reduction in Data Refresh Times: We’ve seen critical dashboard refresh times reduced from several hours to minutes (e.g., from 4 hours to 15 minutes (E2E) for key operational dashboards). This has led to more timely insights and responsive business operations.
Increased Data Agility and Reliability: The modular nature of dbt models within each layer allows for faster development, easier maintenance, and more robust error handling. Reprocessing data due to errors is now a streamlined process, minimising downtime.
Enhanced Business Insights: Our curated Silver and Gold layers provide a single source of truth for key metrics, eliminating data discrepancies and fostering greater trust in data. This has empowered our analysts, product managers, and business stakeholders to derive deeper, more reliable insights.

Conclusion & Advice

Building a robust, scalable, and reliable data platform is an ongoing journey. Our experience at Pelago demonstrates the profound impact a well-structured layered data architecture, powered by tools like dbt, can have on an organization’s ability to leverage its data effectively.

For other data engineers and architects embarking on similar transformations, our key takeaways are:

Embrace Layering: The Bronze-Silver-Gold approach simplifies data management, improves data quality, and accelerates consumption.
Invest in Tooling: Tools like dbt are invaluable for managing complexity, ensuring data governance, and promoting collaboration.
Standardise Early: Define and enforce ingestion and transformation standards from the outset to avoid technical debt.
Focus on Business Value: Always align your data architecture decisions with the actual business problems you’re trying to solve, whether it’s faster reporting or enabling advanced AI.

By following these principles, organisations can build data platforms that not only meet today’s analytical needs but are also future-proof for emerging AI and NLQ-powered use cases.

Want to connect and talk about data stuff, feel free to follow/connect with me on LinkedIn,

Event Streaming @ Pelago

Kamalakar Bejugama — Thu, 13 Mar 2025 06:50:14 GMT

Real-Time Data Processing Pipeline with AWS Lambda, Kafka, Faust, and DynamoDB

This blog post is the first installment in our three-part series exploring how we built and scaled a robust data platform at www.pelago.com. Throughout this series, we’ll deep dive into our design principles, data architecture strategies, and efficient data processing pipelines that drive Pelago’s impactful Data Platform.

Problem Statement:

With the growing scale of traveller interactions At pelago, businesses face the challenge of capturing, processing, and analysing massive volumes of event data in real time. Traditional batch-based data pipelines struggle to keep up with dynamic user behaviours, leading to delayed insights, inconsistent user experiences, and missed opportunities for personalisation. A scalable, low-latency event streaming architecture is essential to ensure data integrity, power ML-driven recommendations, and optimize decision-making — all while keeping infrastructure costs in check.

Introduction

In today’s data-driven world, real-time event processing has become crucial for businesses that rely on user behaviour analytics, recommendation engines, and machine learning (ML) models. This article explores, How pelago overcome this problem by following a robust event-driven architecture using AWS Lambda, API Gateway, AWS MSK (Kafka), Faust, ECS, DynamoDB, and Redis for scalable and efficient real-time data processing.

Event Streaming with Realtime Data Processing

Note: As we Prefer To Pick The Tech from Native Tech service of AWS Eco System.

Architecture Overview

The architecture is designed to efficiently capture, process, and store user activity events in real time while adhering to the Kappa architecture pattern. This approach ensures flexibility in extending the pipeline while maintaining hot and cold data and a single source of truth for clickstream data.

1. Event Collection: Lambda with API Gateway (A.k.a Producer)

The event ingestion layer consists of AWS Lambda integrated with API Gateway, responsible for collecting events from mobile and web platforms.
These events undergo basic schema validation and standardization before being published to Kafka Topics.
Ensures real-time capture of user actions that help recommendation systems adapt quickly to user preferences.

2. Message Broker: AWS MSK (Kafka)

AWS Managed Streaming for Apache Kafka (MSK) acts as the central message broker, facilitating real-time event handling.
Events are held in Main Topics, while interim topics are used for collating related events before being considered for group processing.
This approach optimises I/O operations, enabling efficient data streaming for analytical and recommendation workflows.

3. Processing Layer: Faust Agents & Workers on ECS

The Faust framework running on AWS ECS processes incoming Kafka events in real time.
Each Kafka topic is assigned a dedicated Faust agent, following a structured event processing flow as per business workflows.
Events are validated and transformed before being stored in DynamoDB for downstream ML applications.
Cross-device sync management ensures user experience consistency across multiple devices.

4. Storage Layer: DynamoDB as a Real-Time Data Persistent Store

DynamoDB is used to persist processed user activity data.

A maximum of 30 data points per user activity category are stored in a time-recency order.
Each data point is associated with a timestamp, which helps in adjusting weightage using a time decay factor for ML feature processing.
Data retention is managed with a TTL of 90 days, ensuring data sanitisation and reducing unnecessary storage costs.
Optimised vertical partitioning in DynamoDB ensures that read and write costs are minimised by avoiding unnecessary data retrieval.

5. Warehouse Storage (Redshift)

AWS S3 Object storage used as Staging layer before loading event data into AWS Redshift. Here Data is processed in Chunks & Batches to store it S3 (Parquet data file) and with the help of Redshift Copy Utility being loaded to Redshift and made this data efficiently available to Analytic Business Dashboards, Data Science Model trainings

6. Caching Layer: Redis for High-Frequency Data Syncing

Redis serves as the caching layer for frequently changing business rules and lively tiny data exchanges.
This ensures fast access to dynamic data and minimises latency for critical workflows.

Scalability and Optimisation

The architecture is designed to scale efficiently with increasing data volume.
Ensures minimum read/write costs by optimizing data storage strategies, such as vertical partitioning in DynamoDB.
Supports analytical workflows, ensuring that data is utilised by recommendation engines and other systems to enhance user experience.

Conclusion

This event-driven architecture efficiently handles real-time data ingestion, processing, and storage for ML applications. By leveraging AWS MSK, Faust App on ECS, and DynamoDB, businesses can build scalable, high-performance event pipelines that support advanced analytics and machine learning models.

If you’re planning to implement a similar architecture, consider optimising Kafka topic partitioning, tuning Faust worker concurrency, and using Redis for fine-grained caching.

Stay tuned for the upcoming parts of this series, where we’ll further explore the intricacies of our data platform. In Part 2, we’ll examine our data architecture in greater detail, while Part 3 will focus on Impactful Product & Business Decisions while we continue to build and enhance on the data-driven culture at www.pelago.com

“In the world of data, clarity turns noise into insight.”