Blueprint: Qualitative and Quantitative Clickstream Event Analysis
The amount of clickstream data being processed and stored at the Analytics front was enormous. If we could offer users with a platform through which they could derive results, without having the required technical expertise, it would automate a lot of processes at Myntra, which are currently being done manually.
The lesser the human intervention in deriving results and analysis out of data, the better the conclusions.
The major motivation to build an application to verify the clickstream events, qualitatively and quantitatively arose when we noticed the cross team dependencies to get the required data. Often the Applications team would have to rely on the Data Insights team to verify if the roll out of the new application was successful, in terms of receiving the desired number of events and maintaining a similar ratio across various dimensions (for eg, screen load events per session).
Another use-case where Blueprint could play an important role was in multi-tenancy efforts being undertaken across organisation. The event quality leg of Blueprint could make the event definitions accessible to the tenants (Myntra and Jabong), and the various teams internally could refer to the existing event definitions to unify the structures in the future releases.
Business Use Cases
- Notifying appropriate owners about the obsolete applications and events, which have not been recorded from a considerable amount of time.
- Reporting of new applications, events and keys contained within the event body.
- Elimination of manual efforts required for drawing conclusions from the numbers related to clickstream data.
- Easing the attainment of multi-tenancy, by minimising the error ratio and unifying the event structures.
- An automated process to ensure a safe application roll out, by multiple numbers/percentage verification stages.
The architecture primarily deals with two parts, a) Data Ingestion and Processing and b) Services.
Blueprint is powered by:
- Analytics Data Pipeline, Meterial: It records clickstream events from various user facing applications and allows the Elastic Load Balancer (ELB) to distribute the events to various Heimdals, which are Kafka producers written in Go Lang. Secor then takes theresponsibility of reading clickstream events from various Kafka queues and put them into the Data Lake.
- Apache Spark and Database: We consume the raw JSON events, which are stored in specific buckets of the Amazon S3. The Spark jobs, written in Scala, act on them inorder to crunch the useful information out of it and store it in the MySQL, which we use as the primary Blueprint database.
- Services: The Blueprint backend services have been designed using Spring Boot, written in Java, while the frontend in powered using ReactJS.
The data processing layer acts as the backbone to the entire Blueprint project. This process can logically be divided into two legs, one being the Metadata Processor and the other one being the Events Processor.
The metadata processor ran once, during the time on initiation of Blueprint and it consumed a month’s data. This was done in order to obtain all the applications and events received in that period. This job also was able to produce the valid combinations of the two.
The second leg — events processor, which runs every hour consumes clickstream events for all the available tenants (Myntra and Jabong, at the time of writing this post) and checks for any new event or new application, or even a new combination of the two. Then an additional step that it performs is to store the count of individual events, being received from a specific application, triggered from an internal screen. For eg, addToCart was recorded 1200 times from Products Page on MyntraRetail v3.2 from Android.
Another functionality handled by the hourly job is the upsertion (update/insert) operation for the event and application combination to the database. Incase we receive a combination that already exists in our database, we would need to update the last seen of that specific event. And in the cases where a new combination is recorded, the first seen has to be inserted so that appropriate notifications can be triggered.
This project wouldn’t have been possible without the support received from the entire Data Products and Platform Team. I’d like to thank Apurva Shah for the constant motivation during my internship. I greatly value the guidance received from my mentor, Karshit Jaiswal throughout the development process.