Creation of data marts

Dweep Sharma
redbus India Blog
Published in
2 min readSep 29, 2020

Repeat user data mart by various dimensions — By Dweep Sharma, redBus

Motivation

There are three main entities at the redBus data lake viz. redBus B2C platform, Bus Operators and Customers or end users. For every transaction at redBus, there are pieces of data recorded for all entities and insights are drawn pivoting data by a list of dimensions.

Repeat user analysis is crucial to assess some of the business insights. Gaining more granularity by specific cohorts (repeat users by OperatorId as dimension) provides knowledge workers greater clarity to set the roadmap of features ahead. The dimensions would vary based on the use case and an optimal solution would be to have the ability to inject a dimension list to a computation job which would trigger the creation of a data mart.

How is it done?

We use s3 for our data lake. Spark jobs are hosted via AWS Glue and the data mart generated is queried by Athena.

We start by first ingesting the entire historic transaction data into a spark job which would compute some of the data points like lastDoj (dateofjourney), currentDoj, interval (current-last) and intervalChange for each confirmed ticket. These data points can be computed based on various dimensions that are injected to the spark job. Before writing the dataset onto s3, it is partitioned by a dimension (doj) to improve the query speed and reduce computation costs. Partition columns can be selected based on prior knowledge of the query filters. Parquet file format is chosen to achieve maximum compression

For the successive runs (daily), we ingest the t-1 day’s repeat user data and create a dataset for t which would then be written onto the repeat user data mart and so on.

How is this data used?

This cohort of users can provide insights to bus operators for specific service routes. This will be used to power our bus operator dashboard, RedProWin

This can also help calculate the total lifetime value (TLV) of a customer

Conclusion

In this short post, we discussed an approach to create data marts for repeat users pivoted by dimensions.

--

--