Leveraging Machine Learning for automation in Inventory Platform

Mohit Goyal
redbus India Blog
Published in
9 min readSep 26, 2024

In any rapidly growing inventory platform, seamless integration with various Global Distribution Systems (GDS) is essential for efficient operations. Each GDS often follows different standards of metadata such as (Boarding Point, CityName, BusType). As a platform, we need to unify and map this diverse incoming data to our internal identifiers. One key challenge is mapping boarding points across different GDS systems to internal redBus subregions/Grids .
Automation to solve this challenge is crucial which can help in Scalability and Accuracy.

Use Case: Automating Boarding Point Mapping

Automating the mapping of boarding point addresses to particular area not only reduces manual work but also has several broader business benefits:

  1. Expanding Search Size: By automating the process, we can expand the search size for BP/DP based search (Instead of searching “Bangalore” to “Hyderabad” search “Madiwala” to “Gochibowli”).
  2. Improving Customer Experience: The automated system provides clearer context about BP/DP details for customers when selecting a boarding point as operator-provided name/addresses often contain irregularities, misspellings, ,missing key information.

Addressing a Common Industry Challenge

Many leading e-commerce companies like Myntra and Flipkart have faced a similar challenge when trying to make last-mile delivery more accessible. The addresses provided by customers often include inconsistencies, such as spelling mistakes, incomplete details, or incorrect locality information. To tackle this issue, these companies have developed systems to map these unstructured addresses to more standardised delivery zones or regions.

Initial POC: Fuzzy Matching

Fuzzy matching helped to a certain extent by identifying slight variations or misspellings in addresses. For example, it could successfully map “MG Road” to “Mahatma Gandhi Road” or handle misspelled words like “Yashwantpur” instead of “Yeshwanthpur.”

However, there were significant challenges. Not all boarding point name/addresses include the specific area name. A case in point is the address “Govardhan Theatre” a well-known boarding point in Bangalore, but its corresponding area is “Yeshwanthpur.” In this case, fuzzy matching fails since there’s no textual similarity between “Govardhan Theatre” and “Yeshwanthpur,” leading to incorrect mappings.

To overcome these limitations, we realized that a more robust solution was needed — one that could understand the context of addresses, even when area names weren’t explicitly mentioned. This is where Machine Learning (ML) came into the picture.

Solution: Leveraging Historical Data to Build a Machine Learning Pipeline

Given our company’s long-standing presence in the industry, we have accumulated a wealth of data over the years. Out of the around 20 lac unique boarding points spread across India, nearly 8 lac have already been mapped manually by our support teams over the years. This raised an important question:

Why not leverage this rich historical data to automate the mapping of new BP/DP?

Tried Classification Models: Logistic Regression vs. Random Forest

As part of the solution, I experimented with a few classification models after performing basic data cleaning.

High Level Design of Pipeline

Logistic Regression: Initial Results and Limitations

I first applied Logistic Regression to the cleaned dataset. However, the results were less accurate than expected due to several limitations:

1.Difficulty with Non-Linearity in dataset

2.Handling High Dimensionality

3.Inability to Capture Complex Interactions

Random Forest Classifier : A Better Fit for the Dataset

Next, I tried the Random Forest Classifier, which performed significantly better. Random Forests are particularly well-suited for handling large and complex datasets, dealing with high-dimensional feature spaces, and providing insights into feature importance.

How Random Forest Classification works?

Random Forest Classification is an ensemble learning technique designed to enhance the accuracy and robustness of classification tasks. The algorithm builds a multitude of decision trees during training and outputs the class that is the mode of the classification classes. Each decision tree in the random forest is constructed using a subset of the training data and a random subset of features introducing diversity among the trees, making the model more robust and less prone to overfitting.

https://dsc-spidal.github.io/harp/docs/examples/rf/
Results on different approaches

Data Cleaning: A Key Step in Achieving Accurate Address Mapping

One of the biggest challenges in mapping boarding point is the inconsistency and noise in the data. Address strings often contain typos, abbreviations, unnecessary information like phone numbers, and variations in spacing. In order to improve the accuracy of our machine learning model, I applied several data cleaning techniques to ensure that the input data is structured and uniform before feeding it into the model.

Data Preprocessing flow

Below are the key techniques I used:

1. Basic Cleaning

The first step in the data cleaning process was basic preprocessing, which involved removing elements that add no value to the address, such as Pin code, Mobileno and non significant words like “Near”, “Opposite”,”To”.

2. Spell Correction

One of the most significant sources of inconsistency in the dataset was spelling variations. Names of places can be spelled differently by various operators, or even within the same operator’s dataset that adds noise to our model weight. For example, “Madiwala” and Madiwaala” refer to the same location but are spelled differently.

To solve this, I employed spell correction techniques inspired by leader clustering. Here’s how it works:

  • We use a combination of Levenshtein distance (edit distance) and the Metaphone algorithm to cluster words that are likely spelling variants of each other.
  • Levenshtein distance calculates the minimum number of single-character edits required to turn one word into another.
  • The Metaphone algorithm (Phoenetic ) groups words based on their phonetic similarity.

For example:

  • Two tokens, Ta and Tb, are considered to be spelling variants if they satisfy the following conditions:
we set threshold value equal to 3 and only consider tokens of length greater than 6 as candidates for spell correction
condition for same words

we set threshold value equal to 3 and only consider tokens of length greater than 6 as candidates for spell correction

## create a cluster of each token and it's correct spelling
def get_phonetic_representation(label):
return doublemetaphone(label)[0]

def leader_clustering(tokens, threshold, min_length):
clusters = []
for token in tokens:
if len(token) <= min_length: #check only for words length > 6
continue
phonetic_token = get_phonetic_representation(token) ## phoenetic code
found_cluster = False
for leader, cluster in clusters:
phonetic_leader = get_phonetic_representation(leader)
if phonetic_token == phonetic_leader and levenshtein_distance(token, leader) < threshold: ## check combination of phonetic and leveshtein distance
cluster.append(token)
found_cluster = True
break
if not found_cluster:
clusters.append((token, [token]))
return clusters

3. Probabilistic Splitting

Addresses often contain compound words where adjacent terms are merged, like “HSRlayout” instead of “HSR Layout” .

  • We construct a term-frequency dictionary for the entire corpus of boarding point address tokens.
  • For each token (e.g., “HSRlayout”), the method splits the token at different positions (e.g., “HSR” and “layout”) and compares the probabilities of the individual tokens.
  • If the joint probability of the split tokens exceeds the probability of the compound token, the algorithm stores the new split in a dictionary and applies it to all future cases.
  • The token “hsrlayout” is split into “hsr” and “layout” when the combined probability of these tokens being used together is higher than that of the compound token. This improves the consistency of address representation.
def should_split(token, token_counts):
if len(token) <= 6: # Only consider tokens greater than 6 characters
return None
max_prob = 0
best_split = None
total_count = sum(token_counts.values())

for i in range(3, len(token) - 3): # Ensure each part is at least 3 characters
left, right = token[:i], token[i:]
if len(left) < 3 or len(right) < 3:
continue

prob_split = (token_counts[left] / total_count) * (token_counts[right] / total_count) #joint probability
prob_token = token_counts[token] / total_count

if prob_split > prob_token: #compare joint probability with probability of compound token
if prob_split > max_prob:
max_prob = prob_split
best_split = (left, right)

return best_split

4. Probabilistic Merging

Conversely, some addresses contain unnecessary spaces or are split into separate tokens, such as “lay out” instead of “layout.” For these cases, I applied probabilistic merging:

  • Similar to splitting, we merge adjacent tokens if the merged token has a higher occurrence probability than the separate tokens.
  • The tokens “lay” and “out” will be merged into “layout” if the occurrence count of the compound token exceeds that of the individual tokens.

Feature Extraction: Using TF-IDF Vectorization

To convert the boarding point name and address strings into a numerical format that machine learning models can work with, I used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. This method transforms the text data into vectors by assigning a score to each word based on how frequently it appears in the address (Term Frequency) and how unique or rare it is across all addresses (Inverse Document Frequency).

Why TF-IDF?

  • Captures Importance of Words: TF-IDF helps prioritize important words by assigning higher scores to terms that are more meaningful and less common across the dataset. For example, it would reduce the impact of frequent but less informative words like “Pickup” or “Near” while emphasizing more relevant words like “Govardhan” or “Yeshwanthpur.”
  • Efficient Representation: It converts the textual data into a structured numerical form (vectors), allowing machine learning models like Random Forest to process and classify the data effectively.
  • Reduces Noise: By focusing on the relative importance of terms, TF-IDF reduces the impact of high-frequency but non-informative words, which is crucial in a dataset where names and addresses can contain unnecessary details or common terms.

Mathematically, it can be defined as follows:

In our case, an address is a document d, each token in the address is t , the collection of addresses is the corpus D and total number of addresses is N .

Why Not Use Large Language Models (LLMs)?

While Large Language Models (LLMs) excel at processing unstructured data and understanding context in free-form text, they may not always be the best choice when we already have labeled, structured data like our boarding point data.

In contrast, traditional Machine Learning models are well-suited for structured data like our historical mappings, where clear labels are available. These models can efficiently learn from patterns and labeled examples, providing faster, more accurate results without the overhead of training and maintaining a large LLM. By focusing on the basics of machine learning — using feature engineering, geospatial proximity, and supervised learning — we can achieve robust results more effectively.

Optimising Model Predictions with City-Level Segmentation

The shift to city-level models has been instrumental in addressing two significant challenges in our machine learning pipeline: the issue of large weight files and the potential for false positives in address mapping.

Initially, our comprehensive model, trained on a vast dataset, resulted in a large weight file that caused slow predictions and increased latency, hindering real-time processing essential for an inventory platform. Additionally, the global model struggled with misclassifications, particularly when boarding points shared similar names across different cities — leading to false positives and compromised accuracy. To mitigate these challenges, we transitioned to city-level models.

Integration of Model by Data Science Team to Avoid Wrong Predictions

After the model was set up and integrated into the inventory platform, an issue arose where some boarding points were historically mapped incorrectly. This led to a few wrong predictions, which could potentially compromise the quality of the mapping. To mitigate this, a final validation layer was introduced before committing any prediction to the database.

The final check is designed to ensure that the predicted mapping is accurate by verifying the following:

  1. Geolocation Distance Check:
    The model compares the geolocation of the boarding point (captured by YourBus) with the geolocation of the predicted area (grid). The predicted area must lie within a fixed radius (e.g., 1.5 km) of the boarding point. This ensures geographical proximity between the actual and predicted locations.
  2. Confidence Score & Address-Grid Name Consistency :
    If the geolocation check fails, the system falls back on the confidence score provided by the prediction model. In this case confidence score should be much higher than the threshold (above 90) and the boarding point name or address must contain elements of the predicted grid name.

Results

The algorithm was initially deployed for the top 100 cities based on transaction volume, successfully mapping approximately 1.2 lac boarding points.

Following the initial success, the algorithm was extended to an additional 200 cities, mapping around 1 lac more boarding points.
This enhanced the mapped boarding point coverage in transactions, increased the BP/DP based search size for top routes and significantly reduced the manual support work.

Thanks to Hitesh Vaghani for guiding and Sasank Sankaran for amazing collaboration.

Thanks for reading this article!

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

[2] https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

[3] T. Ravindra Babu and Vishal Kakkar. 2017. Address Fraud: Monkey Typed Address Classification for e-Commerce Applications.

--

--