Janus : Data processing framework at Myntra

Kedar Gupte
Myntra Engineering
Published in
8 min readNov 18, 2022

Data-driven decisions play an important role at Myntra. As India’s leading fashion e-commerce portal, understanding the customer and their evolving needs is the driving factor to increase engagement, provide right search results, personalised recommendations, relevant and targeted notifications, reward loyalty and many more. This is enabled by data ingested from multiple sources including thousands of tables in transactional systems, hundreds of events from clickstream data, with innumerable ways to join, slice, and dice. Net result is 100’s of data pipelines processing TB’s data at any given point in time. And the scale continues to grow with adoption of e-commerce in India.

For more information on Data Ingestion at Myntra, read about Sourcerer.

Data processing is no one fixed tool but a combination of multiple technologies. The problem is there are too many of them! Spark, Presto, Ray, Hive, Flink to name a few computes that read and write across messaging queues like Kafka and RabbitMQ; file formats like parquet and orc; table formats like hive acid, iceberg and delta; aggregate friendly databases like Druid and Cassandra; caching and feature stores like Redis and Aerospike; data warehouse like Snowflake, Google bigquery, Azure synapse; with language choices of Scala, Python, Java et.al.

The Processing framework at Myntra aims to create an environment where developers can write workflows operating over different technology choices and ease out the lifecycle management of processing jobs. We named it Janus after the Roman God of transitions and time.

Let’s understand the lifecycle of data-processing before we move into the architecture of Janus.

Lifecycle of data-processing

Requirement Translation

What is the data

Looking at 1000’s of tables each with 100’s of columns is intimidating. Hence, it is important for consumers of data to locate the right datasets, facilitate reuse to avoid duplicate processing and remove knowledge silos.

Querying the data

After the data is located, is the ability to query the data and experiment. Connectors to datastores, compute frameworks, storage services become better with new versions. Here it is important to provide interfaces/sdk’s to access the data while the platform team maintains the infrastructure, version upgrades, table formats et.al.

Identifying compute infrastructure

Frameworks such as spark, presto, flink, ray, hive can distribute processing of large datasets over multiple partitions/tasks. Each of these tools are purpose built, hence identifying where what fits is vital. This enablement must be backed by enough compute power (memory & cpu) for the jobs. If resources are insufficient then a job will throttle or fail. For a scheduled job, availability of optimal resources for every execution is important to meet the SLA.

Data Pipeline Development

Low code and Code first

Low code requires an application that consumers can use to write the pipelines with minimal overheads. Low is favorable if SQL driven approach is used.
Code first can be powered by standard IDE’s or by notebooks. Code first is suitable for power users such as data scientists who are experimenting with statistical libraries or service developers that consume the data over API or messaging queues.

Test driven development

Writing multi-step data pipelines without a test driven development approach will limit productivity. Python has been de facto adopted by the data science community primarily because of its interpretation capability that facilitates test driven development. Test driven development while building data pipelines is vital for users to use the platform effectively.

Job dependencies

Job dependency is absolutely critical for operating inter-dependent pipelines. Depth and breadth of dataset dependencies are amplified with increase in the number and complexity of data pipelines. Without dependencies, uncaught delays result in partial executions requiring expensive backfills and being blind to impact radius of an upstream failure.

Deployment and Validation

Build and deploy

Packaging code and deploying an executable can be a time consuming process. More so if it is to be repeated over many iterations to get the right outcome. We all concur to staring at the build and deploy screen and wish it was fast! This is a serious and often overlooked issue of data engineering. Hence, lowering development time with effective CI-CD is an important consideration.

Data validation and veracity

Verifying sanity of processed data requires functional knowledge of the expected output. Business teams are the best judge to perform validation. Hence, providing the right tools to facilitate these validations in both online and offline mode increases trust in data. And trust in data is the single most important metric before delivering data driven decisions with confidence.

Operations

Monitoring and Alerting

Visibility into the health of the pipelines should be conveyed transparently to all tenants. Setting the right process within the organization to convey quality of data builds a trustworthy data driven culture. Time spent in maintaining the processing ecosystem is many times greater than development. Any pipeline deployed has to follow standard guiding principles that enable centralized monitoring of data availability, compute, and storage resource consumption at the least. Manual or automated recovery should follow as an outcome of the alerts.

Scaling of infrastructure

Processing infrastructure in e-commerce has to handle data that follow seasonal trends, surge during sale events, organic and inorganic growth. Recognising load on the systems and adjusting the scale to meet the SLA and consequently not over provisioning and optimizing cloud compute costs is an important driving parameter.

Modifications and Backfill

Modifying the pipeline and performing data backfill are common operational concerns which should be doable without overheads.

Job Optimization

‘There is always room for improvement’

Adoption of newer and performant technologies can greatly optimize the job performance thereby saving expenses of compute and storage infrastructure.

Architecture of Janus Processing Platform

Janus processing framework integrates all the steps in the lifecycle of writing a processing job with one application. It is extensible to add more connectors, processing engines, multi-cloud support using supported languages to meet the needs of analysts and data sciences alike. In the ever changing world of data engineering it is important to keep the systems agile to adopt new and better technologies for the organization. Below is the high level architecture of the platform with breakdown of each component.

Data Catalog

Data catalog surfaces the metadata of table definitions and workflows with tools required for ensuring quality and policy enforcements. This is the source of truth of the table definitions that are a part of the data ecosystem. It is the point of contact for setting policies for access controls, enabling quality checks and retention/purging policies. Flow of data across different workflows can be visualized on the data catalog using the lineage functionality.

Pipeline Modeling

How to develop an ETL pipeline is the next question to be addressed. This is also what we call the ‘Design Time’ phase. Here the developer writes the transformation code using a multi-step approach. These steps connect to form a Directed Acyclic Graph (DAG)
Writing DAG is feasible with a low code application platform that is developed at Myntra or standard code first approach of using IDE and notebooks. Important point during the design phase is that the developer focuses on the logic and not about the runtime aspects such as data volume and frequency. However, the runtime aspects cannot be negated altogether because the choice of technology dictates how effectively the processing can happen for a requirement. For instance, spark-streaming and flink are suitable choices for stream processing while spark and trino are suitable choices for batch processing. Many more considerations follow to drill down on the right technology for a use-case. Data platforms should provide the right interfaces to access, transform, and write the data so that future upgrades, code modifications, and maintenance is possible with ease. This must be accompanied by a test driven environment and iterative development of code.

Pipeline Deployment

Packaging and deploying code for executable is not much talked about aspect of data engineering. Multi-tenant usage, versioning of code, governance during deployment depending on sensitivity of data access by the transformations along with executable generation are concerns that deployment should address. Lowering turn around time for deployment lowers the time to test iteratively, modify, backfill and in general maintain the data pipelines. Janus provides CI-CD layer for low code config generation and scripts deployment. It is extensible to developers onboarding their executables and orchestrating data pipelines using Janus.

Pipeline Execution

Scheduled or ad hoc execution of a pipeline from an executable is what constitutes pipeline execution which is also called ‘Runtime’.

Runtime should provide tools to redirect execution to multiple compute engines without being coupled to a vendor or cloud. Hence, it is important to provide interfaces that will redirect a job to compute technology as requested for by the developer according to the use case. Data platform should enable this redirection with configurations that help in job optimisation, debugging, and governance. For instance, specification of queues to limit usage of cluster, tuning parameters for effective query execution, garbage collection, logging level et.al. Janus platform is extensible to multi-compute engines with multi-cloud support.

Access to the data layer from compute engine decides the domain flexibility. For instance, producers and consumers to messaging queues like kafka will help in developing streaming pipelines for near real time workflows. Sink to redis or feature stores are necessary for feature generation use cases for ML models to act on. Pushing data to Druid and HBase will enable low latency aggregate use cases.

For more information on Near Realtime Processing at Myntra, read about QuickSilver.

Processing on data lake for batch and intraday uses will require connectors to file and table formats on say Azure blob or S3 where all datasets are stored centrally. Thus the data access layer during runtime forms a core part of Janus processing platform.

Operations

Operations involve core engineering excellence functions of monitoring, alerting, and scaling of infrastructure. For data engineering though it goes a step beyond. These include maintaining data freshness and job dependencies for complex web of data pipelines. Data backfill in case there is data corruption. Periodic upgrade of versions of open source for bringing in more efficiency in execution. Understanding scope of optimisation in pipelines to improve performance and consequently operational costs. Tiering of data between hot, cool and archive based on policy settings on the catalog. Data tiering over petabyte scale data lake provides significant savings on storage costs.

Data engineering activities consume operational overheads many times larger than development because of the very nature of data in motion, ensuring the correctness with tight latencies. Janus considers operations as a first class module rather than an engineering excellence activity post development and deployment.

Up next

We have seen a high level overview of the data processing platform and its components. In subsequent posts we will be covering platform design in detail that power low code and code first data pipelines. Do watch the space for understanding how Myntra is developing the next generation sustainable data platform to power its growth as India’s top fashion e-commerce.

--

--