The RedDeal story

Published in

redbus India Blog

11 min readFeb 2, 2022

At redBus all the engineers get the invaluable opportunity to see many of its features go from concepts to becoming business USP. Here in this story I would like to share my experience while the RedDeals grew from one of the many features on Search listing page to a major differentiator for redBus among its competitors.

What are RedDeals

redBus is a marketplace for bus operators (Vendors) to sell their inventory (journey tickets/seats). RedDeals is the feature on redBus through which vendors can configure multiple offers on each of their products (trips). Having multiple offers on each product helps the vendor to target multiple cohorts of users like
- First time customers
- Returning customers
- Last minute travel planners
- Early bird travelers
- Flash sales available to all customers

Apart from the above redDeals also helps the vendors to increase the occupancy of the buses by having ability to automatically activate/remove deals on buses depending on
- “Occupancy levels” to get minimum occupancy
- “Pricing bands” as prices vary dynamically
- “Time” to start of the journey to maximize occupancy

As it is evident there are 4 parameters which determine the availability of an offer and discount amount. Here is the list of them and how fast their values are changing;

Occupancy and price: redBus has close to 100 K trips (Source-Destination pair) listed per date and we maintain up to 90 days of inventory totaling to about 5–6 million active listings. To keep the price and occupancy fresh redBus generates about 60–80 K real time update (RTU) events per minute. RTU events are generated at each bus and date of journey granularity which we refer to as RT_DOJ (RouteID+DOJ) in this story.
Loyalty of customer to vendor: We need to show special deals depending customer loyalty on our Search Listing pages (we call it SRP). The SRP on redBus are accessed at 25 K req/min rate and each page, defined by a combination of Source-Destination-Date of journey (S_D_DOJ), has on average 50 buses. Some busy routes have up to 400 buses listed on them.
Time to journey: We maintain RedDeal validity based on time at both day level (for early bird offers) and at minute level (for last minute offers). Hence we need to keep cleaning up list of active offers throughout the day and a bulk cleanup is also required at midnight.
Vendor actions: Vendors can activate/deactivate the deals at any point using the operation dashboard. One action of this type results in thousands of routes and DOJs getting affected as RedDeals are often configured to be applicable on multiple DOJs and routes. There are about 8 K offers live by 3 K vendors, and on daily basis there are about 200–300 actions happening to enable and disable offers manually.

RedDeal V1 AKA Operator Offer

When the Red deals were first launched it was called “Operator Offer” and it was implemented using an XML based rule engine in .NET. The first version of the SRP API was implemented as part of a monolith and hence it was doing all the computation after it has built the search listing. This meant 5 Million computations were done every minute (25 K SRP req/min X 50 buses per listing X 4 deal per bus on average).

Even though the first approach gave computation result with real time data, this approach had 3 problems;

Lot of the computation was duplicated across independent SRP requests
API did not scale well and computation became a bottleneck during peak traffic hours and holiday seasons.
Introduction of new types of offers would be difficult

Over the time redBus inventory has grown healthily and the penetration of redDeals has risen to more than 25%.

After a few weeks of throwing 12 beefy machines at the problem (and still failing to get desired latency), we decided to revisit the overall architecture behind the Red Deal. In the new system all the limitations of the previous implementation were expected to be addressed.

What could be done better

Some key decisions were taken upfront to guide the overall architecture during the revisit. They were as follows

Any price or occupancy change should be handled only once.
Any status change of deals should be applied to all the affected SRP listings in the background hence reduce the need for computation in the SRP API to a minimum.
Freshness of price and discount on SRP can be kept at 1–2 minutes accuracy, where as during the Fare breakup before Payment completion price has to be 100% accurate.

All the new requirements were pointing in the direction of building a system which would

Pre-compute the deals for all the inventory and all 90 DOJs upfront for SRP
Keep the discount amount up to date by subscribing to RTU feeds and recompute the best deal available
Enable/Disable action on the deals would update the compute results in background and recompute the best deal available
SRP API computation to be restricted to merging personalized offers for logged in users

Choosing the data model

It is important to understand the access patterns for both Read and Write traffic to come up with a data model which suffices all the requirements listed above. At a high level the data model looks like as shown below

Even though S_D_DOJ seems to be a natural choice to aggregate all the data points there are few drawbacks to this model;
- There will be contention between Read and Write traffic to access different parts of the object
- Multiple RT_DOJ objects and offers validity will be getting updated concurrently within a single S_D_DOJ object
- Both read and write host spots would be created for popular S_D_DOJs

So we decided to split the above object in to 2 pieces;

S_D_DOJ object: This object contains only the best (highest discount amount) offer for every RT_DOJ inside a SRP listing. So there will be no computation required to find the best offer on read by SRP API. Only write required would be to update the best offer.
RT_DOJ object: Here we keep all the detailed information about configures offers, offer validity flags and pre-computed results. All the write traffic from occupancy/fare changes, time based activation and vendor actions are applied here and only when the best offer changes we update the corresponding S_D_DOJ objects.

Also to equally distribute the data and read/write traffic across the nodes in the data store we adopted the following partitioning strategy.

(S+D+DOJ day of week)% N for SRP (S_D_DOJ) objects
(RouteID)%N for RT_DOJ objects

A small side affect of this was introduction of a queue to ship the data updated inside RT_DOJ objects to corresponding S_D_DOJ SRP structure. But this was limited to shipping only best deals inside RT_DOJ object.

To handle the deal invalidation at midnight we created one data set per DOI (date of sale/issue) and switched from old to new at the stroke of midnight. This made the SRP data structure very simple to read and process as all the data present in them was valid. This also meant we would be pre-computing all the deals which need to be activated tomorrow today itself.

We have chosen to implement this partitioning strategy over a Redis cluster consisting of 7 nodes. Each Redis node has a slave node in an automatic fail over setup. Also the daily rotation of data set was handled by simply using the 16 DBs available inside the Redis, 1–7 of them representing the dataset to be used from Sunday to Staturday.

Choosing the compute model

Following factors determined the scale of computation required by the new system;
- On average 3–5 offers are configured per RT_DOJ
- There are 5–10 million RT_DOJ in the system spread across 90 DOJs
- RTU feeds are coming in at 80K/min
- Vendors add or remove approximately 300 offers each of which result in update of a few hundred to several thousand RT_DOJ objects

I had previously tried to ingest the RTU feed using Java and could achieve at best 120 writes/sec on a single machine with memory footprint going into GBs. Hence for this project we decided to bench mark the Golang with the same data.

The RTU POST API

In Golang we were able to write a POST API accepting 1300 RTU updates/sec of 50KB size and running under 40MB resident set. We achieved a good optimization in memory utilization and compute by

Starting 5–10 persistent go routines and pushing the POST data to them through channels. This avoided spawning a background go-routine to handle data in order to respond back to RTU caller quickly.
POST data in the format of JSON string was converted to a golang struct. This avoided the data from escaping to heap, thus allocation was done on stack which does not require GC.
We avoided parsing the whole JSON object and concentrated on extracting only price and availability data using the super light weight GJSON library.

One more optimization we introduced was a diff analyzer in the RTU pipeline. This greatly reduced the load on the downstream processors by more than 90%. We made the diff analyzer run quickly by encoding the price and availability data into a binary blob and doing a very quick binary comparison of previous stored state and new state.

Compute in Lua

One of the challenges in Red Deal was that there was a amplification of computation. This was due to multiple offers being configures for the same RT_DOJ and also we needed to compute for multiple cohort of users. Then there is the requirement to find best offer in each segment.

This would have resulted in multiple round trips to storage engine and thus network calls would have become a bottleneck. To avoid this we took a call to package all the above computation which just require one set of Price and Availability as input, but queried multiple offer master data into a small Lua script. We found performance of this approach to be acceptable.

Eventually we have a system which takes overall 15–25 ms to
- Parse each RTU feed object
- Compute all the offers
- Update the best discount for multiple categories.

The Lua script which was running on a single thread model inside Redis gave us an average latency of 2 ms/invocation. That gives us a healthy throughput of 400–600 writes/sec per Redis node against a expected load of 130 writes/sec after diff analyzer pass.

SRP API

As we have partitioned the SRP data according to S_D_DOJ we can get best offers for all the routes in a single read with in 2 ms from the Redis. But we still need to add loyalty based offers depending on transaction history of the user with each vendor. If we pre compute this it would become a huge data set and each personalized output will have very less utilization.

But the silver lining is that there are only 3 types of personalized offers. So we pre-compute them and choose one of them depending on customer category like “First time customer”, “Return trip customer” and “Nth transaction customer”. User transaction history is accessed from the Personalization micro service and is used to choose the loyalty offers on the fly.

After the above optimization we are able to serve SRP API with < 5ms latency from 2 X 4 core machines, that is a huge reduction from 12 X 8 core machines as required before.

Scope for future improvements

Overall design and development of the new system took close to 4 months with 3 developers. During this period many technology choices were done to balance business needs with long term maintainability of the code base. Sure that has resulted in we taking on some technical debt but there are clear separation of functionalities which make addressing these easy. We will be discussing some of them below.

Parallel processing

Between RTU feed, Operator actions and Time we have 3 ways for an offer to be Enabled/Disabled. Protecting each RT_DOJ and S_D_DOJ objects from concurrent updates is essential to avoid loss of data and worse corruption.

Even though the Lua scripts take less than 2 ms (thanks to data structures used) for each invocation, but this is still a not very scalable model in the long term. We can add more nodes to cluster to get more compute power but that comes at cost of wasting memory.

Better long term approach would be to introduce a processor/thread ownership at RT_DOJ/S_D_DOJ level using Etcd/ZooKeeper/Redis. Then these processor can be written in any other scalable compute frameworks outside Lua.

Evaluation Queues

We have heavily used List data structure available in Redis as queues for Batch evaluation, SRP updates and User actions. We can easily replace these queues with external durable queues like Kafka which would give us

Durability
Repeatability: We can re run the events in case of node failures or for debugging purposes.
Scalability and concurrency control: By using RT_DOJ/S_D_DOJ as message keys we get ordering of events even with multiple parallel processors.

Undo / Roll Back Queues

We get the RouteID list from our inventory system every time an offer is activated or deactivated. Some times the above API does not return the same set of RoutIDs when called after a long duration due to the fact that RouteIDs get disabled for various reasons.

Above behavior results in incomplete cleanup of the SRP after an offer has been disabled or incomplete population of SRP after activation.

A solution to this problem is to store the RT_DOJ set generated for every action and use it to apply the UNDO/ROLLBACK action.

Deployment

As mentioned earlier there are only 3 machines handling all the work hence deployment has not been a main concern and its still a manual process. But if we go down the path of bringing evaluation jobs outside the Lua for scaling purposes we would need a robust deployment automation.

Durable storage

In the current implementation only Redis has the computed result sets. Even though we have master-slave setup with fail over mechanism to handle any node outage we can still do better in terms of durability. In the worst case scenario of losing both master and slave, we have no option but to recompute all the RT_DOJ in that particular shard which typically takes 1 hour time.

A better alternate with both replication and durability needs to be explored. But this new system has to give same or better write throughput than Redis. May be a durable storage for RT_DOJ data structure and in-memory storage for S_D_DOJ structure might give us the right balance.

Key takeaways

I have been closely involved with the development of Reddeals from day 1. Some of the key learning in this 5 year long journey could be summarized like

Understanding Read/Write patterns and adjusting your data model to optimize those path would give huge performance and scale advantages. For existing systems adding domain specific metrics at fine granular level can help to understand bottlenecks.
Sometimes users use the platform in completely unexpected ways and keeping track of all the steps the system took to handle the request is very essential from day 1 to assist the user queries and to unearth any hidden bugs.
Having a realtime feed like RTU will enable a lot of innovation. We have implemented BP-DP search, Near-by search, Tentative unblock process and Inventory change feed on top of this raw seat layout feed.
We used binary encoding for storing seat fare and availability in RTU diff service, but we used plain text csv to store (offerID, net fare, discounted fare) tuples in All offer evaluation results. The former provided fast comparison for diffing whereas latter proved invaluable in quickly debugging issues. This might be a small detail but we learnt by giving up a little bit on space optimization we gained a lot in visibility into the working of the rules.
Any feature on the platform can grow out to be your USP, hence embracing micro services approach would enable quicker iterations and innovations.