Swachh Data — redBus Style
“Without a systematic way to start and keep data clean, bad data will happen.”
Fun Fact: Being only the 7th largest country, India still has the 2nd largest road network in the world.
‘Not so fun’ fact: Unlike in trains or flights, where passengers can board and alight at designated stations or airports, bus stations can crop up anywhere on our massive road network.
With buses connecting 90% of Indian cities, the number of boarding and dropping points for buses operates on a scale which comes close to the number of pincodes in the country.
This leads to a nightmarish experience for travelers trying to choose a boarding point/dropping point, especially for those who are traveling to a new city. In Hyderabad, seemingly similar location names like “Lingampalli” and “Bagh Lingampalli” are in reality 24 kms apart.
Each bus operator has his/her own naming convention of a boarding point. For e.g. Kempegowda Bus stand is also known as Majestic Bus stand. Some bus operators may use abbreviations and call it as K.B.S instead.
The problem is compounded by the fact that, redBus has more than 70 different software integration to pool in data from 1800 bus operators, covering 67,000+ routes and serving 3,000+ cities. Add to this, the nuances in naming conventions — a place like Bangalore had 25,000+ boarding point names!
Before we decided to draft a process to handle this problem, there were 300 permutations of BTM Layout (an area in Bangalore) alone!
For e.g, Bangalore’s Boarding Points before the clean-up:
For a customer to filter out bus services based on Boarding points in an area like BTM Layout or Jaya Nagar was a near impossible task.
We needed two things:
1. List of all valid boarding points in all the cities: the Grid-list
2. Process to map any boarding point configured by the Bus Operators to one item on the Grid-List
With the help of text clustering and local knowledge of our inventory on-boarding team, we have a grid list of all valid boarding points in a city. This now is the reference grid-list — sanitized and geo-tagged.
For the second leg, the process developed at redBus ensures that as and when a new boarding point is configured by an operator, it is automatically mapped to an existing boarding point in the grid-list thereby ensuring clean data. This is tricky, given how unstructured the boarding point naming conventions can be and the data can flow in from any of the 70 different software integration.
The mapping algorithm uses a weighted approach including SoundEx, Split Minima, Reverse Matching, Grep Scan, Naives Bayes Classifier (and other statistics jargon to confuse the reader :))
With this, we were able to automate the mapping of boarding/dropping points with 99% accuracy. Bus operators constantly commission (and decommission) new services — thereby adding new boarding point names. During any given week, for the top 6 cities, we have around 2000 new additions to the boarding and dropping point names. Automating the algorithm is the only way to maintain data sanity. The support team takes care of the un-mapped points (remember, the algo checks out 99% of the cases).
For the top 10 cities alone, we were able to map 200,000 boarding Points to 750 unique geo-tagged ones.
After the cleanup of redundant boarding points, this is how the boarding point selection list looks like:
This is the snapshot of the process we have used to handle Boarding and Dropping point geo-tagging.
With this being an ongoing exercise, we now have the process and the technology to keep the data clean.
More power to redBus customers!
Originally published at blog.redbus.in on February 4, 2016.