Creation of data marts

Dweep Sharma
Sep 29 · 2 min read

Repeat user data mart by various dimensions — By Dweep Sharma, redBus

Image for post
Image for post

Motivation

There are three main entities at the redBus data lake viz. redBus B2C platform, Bus Operators and Customers or end users. For every transaction at redBus, there are pieces of data recorded for all entities and insights are drawn pivoting data by a list of dimensions.

Repeat user analysis is crucial to assess some of the business insights. Gaining more granularity by specific cohorts (repeat users by OperatorId as dimension) provides knowledge workers greater clarity to set the roadmap of features ahead. The dimensions would vary based on the use case and an optimal solution would be to have the ability to inject a dimension list to a computation job which would trigger the creation of a data mart.

How is it done?

Image for post
Image for post

We use s3 for our data lake. Spark jobs are hosted via AWS Glue and the data mart generated is queried by Athena.

We start by first ingesting the entire historic transaction data into a spark job which would compute some of the data points like lastDoj (dateofjourney), currentDoj, interval (current-last) and intervalChange for each confirmed ticket. These data points can be computed based on various dimensions that are injected to the spark job. Before writing the dataset onto s3, it is partitioned by a dimension (doj) to improve the query speed and reduce computation costs. Partition columns can be selected based on prior knowledge of the query filters. Parquet file format is chosen to achieve maximum compression

For the successive runs (daily), we ingest the t-1 day’s repeat user data and create a dataset for t which would then be written onto the repeat user data mart and so on.

How is this data used?

This cohort of users can provide insights to bus operators for specific service routes. This will be used to power our bus operator dashboard, RedProWin

This can also help calculate the total lifetime value (TLV) of a customer

Conclusion

In this short post, we discussed an approach to create data marts for repeat users pivoted by dimensions.

redbus India Blog

redBus India Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store