Building Bazaar’s Data Platform

A platform, capable of solving Petabytes scale problems

Published in

Bazaar Engineering

4 min readOct 25, 2021

I recently joined Bazaar Technologies, a Pakistani startup with big dreams and super talented team to make those dreams a reality.

At Bazaar, many people rely on data every day to do their jobs. In just fifteen months we have added 6+ applications to support the business with dozens of micro-services at the back-end. Which serve more than 200+ brands and 750k+ merchants in Pakistan. Tons of data is generated every day, while there are some data solutions already present for analysis, we have started to overload them.

So, we embarked on the journey to re-imagine our data platform. We wanted to be as lean as possible while tackling following personas.

Data Engineers
ML Engineers & Data Scientists
Data Analysts

After deep brainstorming sessions and little bit of hair pulling, we finally came up with blueprints for our new data platform. At Bazaar we refer to it as “Buraq”, which signifies speed and multi-dimensional travel.

The Burāq is a creature in Islamic tradition that was said to be a transport for certain prophets.

Buraq is based upon delta architecture and mix of our unique take on Data-Lakehouse and Data-Mesh philosophy.

Buraq is provisioned on two different types of clusters, one of them in a Kubernetes cluster which is primarily used to host supporting tools and applications (Airflow, Hue, Superset, Prometheus etc.), the second one is a dedicated Map Reduce Clusters for Heavy Data Processing and Machine Learning Tasks.

Buraq can be broken down into following layers.

Injection/Integration Layer
Processing Layers
Service API Layer
Query Layer

Injection / Integration Layer

At Bazaar we adopted the data mesh practice for decentralized data ownership. Teams in Bazaar can just bring their own bucket (object storage) with data in parquet, avro, csv and json formats and connect it to Buraq in a Plug n Play fashion. This approach enables central data team to focus on further evolving and improving Buraq.

After attaching the data buckets, we move data to our based layer cleaning and governance process which is created using Apache Hudi. We have four level in this process

Raw Level — unmodified data which reflects the actual data source state
Bronze Level — governess and quality layer which enforces necessary compliance and provides ability to clean, transform and filter data.
Silver Level — in this layer we maintain our take on Data Vault based modeling techniques to create de-normalized lake house.
Gold Level — this is the final state for our data life cycle where we create Data Products which are consumed by our end users

Processing Layer

This layer is super abstract, most of the time people creating the pipeline don’t know if the back-end data will be real-time or batch, this is achieved by using continuously running HUDI Streamer which read data from both the buckets and Kafka topics and maintain MOR (merge-on-read) tables. Our main processing framework is Apache Spark, which really works well for us.

Service API Layer

From the get-go we planned Buraq to be a DAAS (data as a service), so that we can enable analytics and machine learning inside our core apps. Buraq API Layer is a set of APIs which cater to these types of use cases.

Query Layer

As all of our data reside in data-buckets, we needed some sort of query engine to retrieve them, here Trino was simple decision as we already had hands-on knowledge over it. Some of us were already contributing to this project, and were formally acknowledged on Trino’s website.

What are the next efforts?

1. Automated Data Quality

Bazaar’s data analysts & data scientists would rely on the data ingested and processed on Buraq. Solving for data quality is the need of the hour. We have a few solutions in the works.

2. User Facing Analytics

As a Data Driven Organization, we might want to enable our customers to utilize certain kinds of analytics.

3. Feature Store

Data science can bring competitive advantage to any startup, we are working to reduce the frictions and time for getting models in productions, building a feature store in one of those efforts.

4. Minimize Compute Cost

We are constantly looking for ways to reduce ownership cost of Buraq, we are now trying to use efficient ARM instances along with some optimizations to our Kubernetes cluster.

We will follow up with more detailed blogs on our data platform design decisions and tech stacks. Each component of Buraq deserves it’s own blog. Buraq is all set to solve Petabytes scale problems.