Data Products — Self Service Query Platform (DDP)
My journey with Myntra started in early 2015 with Analytics team and I was assigned responsibility to develop data products at Myntra. This was the first time that I will work on analytics or big data, I had always been a full stack application developer on Java ecosystem.
First goal was to build a self-service platform that enables all the business users at Myntra with ability to get data, understand it and able to take decisions based on it. The journey started with building a tool that helps users to query the data available in data-warehouse
Our data platform team had chosen Amazon Redshift as it was a petabyte scale data warehouse, didn’t need any definite investment in hardware, managed services and cloud deployment that is easy to scale. Our platform team was busy migrating the warehouse from MySQL to Amazon Redshift and meanwhile there was a thought process to build a tool that can provide capability to query on Redshift rather than giving workbench access to all users. This will help in controlling data access, auditing data and even providing more capabilities on top of data warehouse.
Data Democracy Portal (a.k.a DDP)
A web based self-service query platform named as Data Democracy Portal, famously know as DDP at Myntra, germinated with primary goal of abstracting user access to data warehouse and provide additional capabilities to users.
First release of the tool had a single page web app that allowed user to submit the query and get the results. Data warehouse queries may take very long time to execute and hence we took a decision to have asynchronous query execution flow and send a notification email to user on completion of query.
Over the time DDP was enhanced to have many features which won’t have been possible with any traditional workbench:
- Query List: List queries fired by user, helps to track and search queries fired by user
- Access Restriction: Role based access restriction on the underlying data as well as approval workflow for getting roles
- Schedule Query: Provides capability to re-execute user query at scheduled time (daily, weekly, monthly). This is one of the most used feature as users need to get similar data repeatedly and now they can get it by scheduling query without need of coming to DDP.
- Query Grouping: User can group multiple queries and schedule them
- Temporary table: Create temporary tables in data warehouse by uploading CSV data file
- Schema explorer: Allows users to explore tables and columns in data warehouse along with capability to see sample data without need of submitting a query
- Direct API: Generic API that allowed other applications to access data programatically
Technical Stack Under the Hood
DDP is basically divided into two modules at very basic level, User Interface and Application Services.
User Interface (Front-End)
- Unity Framework of Myntra that uses ReactJS and NodeJS for user interface
Unity Framework is a user-interface framework at Myntra that is used by all the internal applications at Myntra. Teams develop reusable components using ReactJS that can be reused across multiple applications with standard interface.
Application Services (Back-End)
- Java, Tomcat & Jersey for implementing REST web-services and application service layer
- Redis for implementing application Queue
- MySQL for storing application Data
- AWS Redshift as data warehouse
As DDP became very popular with our business users, we started experiencing performance issues with Redshift. Basically if there are too many queries executing concurrently on Redshift then its performance will be impacted. Also due to concurrency limit in Redshift, many queries will wait in Redshift WLM (Work Load Management) queues and will finally get aborted based on max. query execution time configured. In order to solve this, we introduced our own queueing mechanism that ensures only N queries are executing on database at any particular time and rest are waiting in application queue for their turn. We also implemented some intelligence to allow prioritisation of queries waiting in the application queue.
DDP also started blocking of queries based on parameters like incorrect joins or having too many values within IN clause (typically data warehouse are not designed for queries having thousands of items within IN clause). In order to implement blocking of queries, DDP implemented query plan (explain plan) parser that helps us discover a lot of parameters on how a query will perform.
Even after implementing these features, many a time DDP server used to die with OOM (Out Of Memory) errors and it was because some user would fire a query that fetched data which can’t fit in application memory. Obviously we cannot have RAM equivalent to huge data available in data warehouse. We had to implement cursor based data fetch to ensure that our server never runs out of memory. Also for queries that needs very huge data we utilized UNLOAD feature of Redshift which allows to directly upload the result data to S3 without need to pull the data to application server.
Looking back, DDP became the first self-service tool that enabled lot of automated reporting by user themselves even to the extent that people don’t want to move out of DDP even when well defined reports are provided to them as of today. By providing self-service of data, DDP enabled engineering team to do many important technical tasks without getting caught in data delivery and reporting only. Thanks to all the contributors and fantastic data products team who helped in building DDP.
This is just the first post from the series of products that have been built by data products team at Myntra. Stay tuned for further posts.