Unlocking the Power of Data: Exploring the Tamara Data Platform — A Deep Dive

Duy Nguyen
Tamara Tech & Product
5 min readMar 15, 2024

Tamara is a buy now, pay later (BNPL) platform for consumers in Saudi Arabia and the wider GCC region. After its latest funding round, Tamara has landed a valuation of $1 billion. It is a hassle-free, interest-free payment solution designed with your best interests in mind.

Introduction

Data infrastructure is a critical component of any modern business. It enables organizations to harness the power of data, leading to smarter decisions, improved operational efficiency, and better customer experiences. However, building and maintaining data infrastructure requires significant investment and expertise, highlighting the need for skilled data professionals in today’s workforce.

In this article we will go through the high-level design of our current data infrastructure.

  • Overview of our data infrastructure
  • Datasources and data warehouse
  • Metrics and Monitoring
  • Data exploration
  • Data governance and quality

Overview

Let’s talk about overview infrastructure. Tamara Data team apply gitops (thanks to ArgoCd project) and use Dataops methodology (Check it out here https://datakitchen.io/what-is-dataops/ if you are not familiar with that term); every single change has to be tracked in git.

We heavily rely on Kubernetes deployment. We use Kubernetes Operator as much as possible to deploy our stack.

Below is the very high level design of our Data infrastructure

Data sources and Data Integration

Our primary data sources come from our Mysql transactional databases. At Tamara, we apply micro-services architecture, so we have many databases. The Data Team doesn’t read data directly from the central database but from replicas. One challenge is managing, maintaining and monitoring our slave databases.

When it comes to BNPL data, several key data points are typically collected and analyzed:

  1. User Information: includes demographic data such as age, location, and income level. It can also have a credit score and payment history.
  2. Transaction Details: Information about the purchase itself, such as the product or service bought, the purchase amount, and the date and time of the transaction.
  3. Payment Plan Details: This includes the terms of the BNPL agreement, such as the number of instalments, the instalment amount, and the due dates.
  4. Repayment History: Data on whether payments are made on time, early, or late, and if there are any defaults.
  5. Merchant Information: Details about the retailers offering BNPL, including the industry and location.
  6. Customer Behavior: This can include data on how often customers use BNPL, their average transaction size, and their preferred merchants.

These data points provide valuable insights into consumer behaviour, credit risk, and market trends, helping BNPL providers, retailers, and financial institutions make informed decisions. However, it’s important to note that all this data must be handled in compliance with data privacy regulations.

Other than that, we also have many data integrations with third-party providers. The integration varies; we can read providers’ data and push data to them. We utilize Airbyte to do integrations, and some other integrations are Python pipelines triggered from Airflow.

In the context of BNPL services, the key data challenge lies in handling the volume, velocity, and variety of transaction data. These services process a large volume of transactions rapidly, often from diverse sources. Ensuring reliable data processing and analytics is crucial for accurate risk assessment, fraud detection, and personalized customer experiences. Without robust data management, BNPL providers risk making flawed decisions that impact both consumers and their own business outcomes.

Data warehouse

CDC: Mysql to Bigquery

We are using Datastream (Google Cloud’s Datastream is a serverless, real-time data replication service that allows you to synchronize data across databases in real-time) to read MySQL’s Change Data Capture (CDC) and write data to BigQuery.

Initially, we paid as we used, but now we already applied for a slot reservation. The cost is going down and we continue optimizing the cost.

CDC: Mysql to Clickhouse

We are building our own CDC Solution based on open-source Debezium (an open-source distributed platform for change data capture). At the first stage we mainly use Kafka Sink Connector to write data to our Clickhouse.

The current setup is quite a typical Debezium setup.

  • We deploy the Strimzi Connect Cluster in K8s.
  • We use Redpanda instead of Kafka

We maintain a fork of Altinity Sink connector, but with many enhancements and optimizations to persist data from Kafka to Clickhouse Cloud.

Data Exploration

We use Superset to serve different purposes

  • Visualize our business metrics
  • As ad-hoc query UI for Data Analytics/Product team
  • Sending alerts for many business key metrics

The connections vary; we have connections to Mysql Slave, Clickhouse, and Bigquery… Permission is essential in our Superset instance, as we have some PII data and Business dashboards that only allow a set of people to access.

Data governance and quality

We deployed a Datahub (A Metadata Platform for the Modern Data Stack) instance and ingest all our metadata from Mysql and Bigquery. We are investing more time in adding more integrations, primarily with Superset/Airflow and Dbt. From there, we can visualize all of our data lineages and metadata.

Regarding data quality, we utilize Soda Core. We have experienced team members with Great Expectations, but it seems too complicated to deploy and complex to extend. Primarily Soda Core supports SQL syntax — Data Engineer and Data Analytics love SQL. We have some dashboards to visualize our data quality, and alerts notify us when a problem with our data quality appears.

Metrics and monitoring

We collect various metrics to understand and improve our system’s performance and stability. These metrics provide insights into different aspects of our operations, such as usage trends, system health, user behaviour and pipelines, data quality, and data integration.

We use dashboards to visualize these metrics in a user-friendly manner. Dashboards provide a real-time snapshot of the system’s performance and can be customized to highlight key metrics.

We also have an alert system in place. An alert is triggered if a metric crosses a certain threshold or exhibits unusual behaviour. This allows us to proactively address potential issues and maintain the high quality of our data platform.

Deployment

As mentioned, we store our infrastructure manifests in Git, using ArgoCD to deploy to the Kubernetes cluster via Kustomize.

Challenges

We are a pretty new team, and currently, we are handling a lot of requests, from building core data infrastructure (Data warehouse, Data streaming) to ad-hoc requests (Data integrations)

Kafka (Redpanda) and Debezium are new, and we must learn from other users and the community.

We are building a scalable, robust data platform. Besides challenges, it is an excellent opportunity to explore new tools and revolutionize the way millions shop, pay and bank.

--

--