Inventory Data Cleanup Method in an OTA company

Kevin Adinata
tiket.com
Published in
3 min readDec 21, 2021

As an OTA company that is thriving in the travel and tourism industry, it is important to be competitive in the market. One common way to be recognised distinctively can be achieved by boasting a broad range of products/ inventories and more often than not, it requires them to work together with other OTA companies and onboard them as vendors.

However, this process is often associated with many intrinsic challenges as inventories can be provided by many sources. Hence, we need to ensure the uniqueness of these inventories, to be distinct enough for further data processing and innovative products offering.

Due to the voluminous inventory data sourced from different vendors and internal databases, tech companies suffer from one common issue, the duplicates of inventories or SKUs (Stock Keeping Units). As an example, see the following table of records below for “transylvania hotel”:

There are several solutions that can be applied by the data analysts to merge these inventories, for example:

  • Count distinct inventory names
    A very straightforward solution for inventories with 100% same names from different vendors.
    This method can be done on SQL by using count(distinct ) function
  • Regex Replace with alphanumeric characters
    A regular expression (shortened as regex or regexp) is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.
    A very effective solution for inventory names with foreign alphabets or unique characters.
    This method can be done on SQL by using regex_replace function
  • Calculating Levenshtein Distance between Inventory Names
    The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other
    Ie: Measuring tape vs measurement tape, the levenshtein distance is 4
    This method can be achieved in bigquery using the User Defined Function feature to store javascript functions.
    I had the pleasure of learning about this feature by reading a medium article by Felipe Hoffa.
  • Removing Decimal Places of Latitude and Longitude Data
    For OTA Accommodation inventories, it is possible to merge inventories by removing some decimals in the latitude and longitude, provided that they have a similar name and are located in the same building.
    This method is effective when there are multiple apartment owners listing their unit in the OTA platform and we need to know how many unique properties (building level) are listed in the platform.
    The table below provides a guideline on the accuracy level of various latitude and longitude decimal places.
    This method can be achieved in SQL by utilizing the roundup function
Accuracy vs Decimal Place
  • Natural Language Processing (NLP)
    Last but most importantly, we can leverage a machine learning method called Natural Language Processing (NLP) to label inventories based on their characteristics.
    More detailed information about NLP can be found here.
    This method can be achieved by creating a machine learning model, typically done using Python.

What is the best method?

There is no best method for every case, depending on the magnitude of differences between inventory naming. Counting distinct works best when the inventories are clean and you need a quick way to estimate unique inventories. On the other hand, NLP would be preferable especially when there are a lot of different variables that we can derive from two same inventories. In this case, this is performed on a longer term project where you are required to clean and merge the inventories.

REFERENCES:

[1] https://en.wikipedia.org/wiki/Regular_expression

[2] https://hoffa.medium.com/new-in-bigquery-persistent-udfs-c9ea4100fd83

[3] https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b

--

--