Bifrost: Data Serving Layer at Myntra

Published in

Myntra Engineering

6 min readJan 18, 2024

Myntra Data Platform (MDP) integrates data from different teams and different types of sources, performs different kinds of processing and generates insights.

In the previous blogs, we have discussed ingesting different kinds of data from different types of sources (Sourcerer). We have also discussed the processing of big data and creation of aggregates (Janus). With the growth of Myntra, our ingestion and processing platforms in MDP have scaled up to collect and process varied data, running well over PBs, across the expanding customer base and touch-points. With all of this data, we had to come up with a solution to process and serve the data in as less time as possible.

In this blog, we will discuss how we solved for Data Serving at Myntra.

What is a Data Serving Layer?

In simple words, Data Serving Layer is a single, central point of contact for users to query data from the Data Platform.

This layer gives visibility and optimizes execution of workloads. It should support multiple backend data stores with users being agnostic of the data store being used.

Based on the cost of computation in different data stores, it should be able to select the best data store for querying, convert the query according to the data store and submit it for execution.

Accordingly, access policies and audits should be unified.

Guiding principles

Let us look at the following major guiding principles of a Data Serving Layer.

Multi-store support — Serving Layer should be able to support different backend data stores which may contain different types of data. For eg. Mysql or Hive containing transactional data, Druid containing clickstream data — all should be accessible from Serving Layer. Joins between different data stores should also be possible.
Abstraction — If two datastores have the same data, then users should be agnostic of which data store they should choose. That decision should be dependent on the type of query submitted and its performance in different data stores.
Workload Prioritization — Few workloads have higher priority than the rest. Serving layer should be able to serve priority workloads first with minimum SLA guarantee.
Statistics Collection — Serving Layer should be able to collect the statistics of each workload. Based on these statistics, query compute selection should happen and optimized clusters should be chosen. Complete the cycle.
Visualization — Serving Layer should have the ability to visualize the results of their queries in different forms.
Performance — The overall performance of the layer should be the best. Average run time of the query should be below 30 secs.
Multi-client support — The Layer should be able to support different types of clients. We have users like analysts and data scientists who run their queries in an ad hoc manner. We have systemic integrations where processes can interact with the layer and get required results.
Support different output formats — Outputs may be required in different formats. Analysts would query in a Workbench, Systemic integrations may require File based or Streaming outputs.
Authentication and Authorization with Audits — All datasets in Myntra cannot be accessed by everyone. There might be Financial or Personal Identifiable Information (PII) which may be accessed only by few analysts. Proper protection of the use of data through Authentication and Authorization is a must.

Evolution of Serving in Data Platform

The serving platform did not evolve in a single day. It took us a few years to build and come up with a performant solution. Let’s look at the journey:

Phase 1: UDP & DDP

We started with Data Democracy Platform (DDP) as the serving platform where users would be able to submit their queries and get their results. Behind this frontend, we had a single data warehouse where all the data was ingested into. We used Rolebook, an in-house tool, for AuthNZ. For visualization in the form of bar charts and graphs, we built an internal tool Universal Dashboarding Platform (UDP).

Challenges:

In the Data Warehouse, Storage was coupled with Compute. Hence, even if we wanted to store historical data, it increased the cost.
We observed high planning time for some of our queries. It used to be 1–2 mins.
Data Warehouse had their own SQL which had a different syntax than the universally used ANSI-SQL.
We faced performance issues with the Data Warehouse at high scale. Our clickstream data was around 10s of TBs per day.
Other technical issues like Restricted User defined WLMs, Generic error logs, limited telemetry etc.

Phase 2: Data Lake and Data Warehouse

We couldn’t ingest the data directly into the Data Warehouse with the growth of Myntra. Apart from the increase in volume, there were different types of data being generated as well. We introduced a layer in between and stored the data in Data lake. Data would be ingested into the data lake and then into the Data Warehouse. Majority of the processing was done between data lake datasets.

Only selected datasets were kept in the Data Warehouse, and whenever required, they were fetched from the data lake.

But this was not a long term solution due to the disadvantages of the Data Warehouse at scale.

Phase 3: Bifrost

Let’s look at how we re-architected and solved the challenges.

Immutable Datasets Querying

First, we solved for clickstream and immutable datasets. Apache Hive and Trino were set up for querying directly on the Data Lake using External tables. By this, we basically separated Storage from Compute. Here, we were able to solve for historical data querying as well.

For AuthNZ, we integrated Apache Ranger with Trino. Apache Solr was used for audits.

Transactional Datasets Querying

On data lake, updating mutable transactional data like critical orders, items, and customer fact tables, became challenging. For this, we chose Hive ACID which allowed updation of datasets on Data Lake. Hive Catalog was used to manage schemas and metadata of tables.

With the onboarding of immutable as well as transactional datasets on Hive and Trino, we were able to solve for querying of data. Trino also supports multiple data stores on the backend. Users need not learn different SQL syntax for every data store as Trino uses ANSI-SQL and internally converts the SQL for them.

Visualization of Data

We introduced Apache Superset linked with Trino for visualizing the data in the form of bar graphs and charts for engineering dashboards.

We used Power BI for visualizing business metrics.

Phase 4: Bifrost II

We evolved Bifrost to use Delta Lake instead of Hive ACID. Delta lake by Databricks is an optimized storage layer. It facilitates ACID properties on files. It had the following advantages over Hive ACID:

Delta stores metadata in files. There is no need to connect to HMS. Also, scanning time of a table reduces making queries run faster.
Delta handles ACID properties better than Hive. In Hive, compaction was run to delete extra files that were created with each update which is compute heavy.
Schema evolution across multi-compute is supported in Delta.
State of the data is stored in Delta. If data gets corrupted, it can be reverted.

This blog gives a ten thousand foot view of Bifrost and its major components. We will continue this series with the Architecture of Bifrost and various features introduced to handle PB scale of Data Serving at Myntra.