How to Sort Thousands of Hotels with the Help of Data Science

Seda Alacan
Metglobal
Published in
6 min readAug 3, 2017

OVERVIEW

Creating a Balanced Scorecard is one of the best strategic movement. In the same way, any organization, no matter what size it is, can make its BSC. On the other hand, nowadays, otel.com has been beginning to look into a new score-carding for its hotels ranking on its web page.

It’s not a traditional scorecard for the reason that it is little bit different.

otel.com has been designed to be user friendly; hence, travelers can book their hotel accommodation quickly and easily. Hotel ranking from otel.com is created for each type of travelers such as social travelers, value seekers, habitual travelers, etc. This ranking maybe answer traveler’s worries.

Type of travellers [1]

otel.com organization does not think in going to the cheapest hotel in the internet search. Due to the previous fact, otel.com reaches various travelers. otel.com reviews many criterias or fields, some instances are the comfort, high demand and quality of the hotel along with the cheapness. This organization collaborates these fields with some statistical techniques, because of that, its smart interface sorts the hotels for the travelers.

GOALS

  1. Focuses on performance targets as they relate to customers and the market.
  2. Focuses on financial performance of an organization covering the revenue and profit targets of commercial companies. In the same way, focuses the budget and cost-saving targets.

SPECIFICATIONS

Scorecard Algorithm from otel.com is a ranking interaction of the hotels with its client countries, by that way, it combines the results for a certain performance measurement / management system allowing the employee to set targets.

Hotel sorting should correspond to the user demands among different geographies and cultures because it is based in the client’s country. It means that a different ranking is given for hotels in the same locality for customers in different countries.

otel.com uses that variation because lifestyles and demand can be differentiated country by country. For instance, Scandinavian customers does not depend about the hotel fee; however, they can deal with different details such as the following;

Type of travellers [2]

TECHNICAL SPECIFICATIONS

Basic Principals of Methodology

There are N main categories that should be considered while scoring the hotels inventory and percentage of this categories with business skills like hotel’s content score, financial score, etc. About this basis, all score variables available were grouped into different categories according to a
specific criteria.

Each attribute -from this N main categories- (for instance, “hotel stars” is a characteristic and “1–3” is an attribute) assigns points based on statistical analyses, taking consideration about various factors such as the predictive strength of the characteristics, correlation between characteristics and operational factors. The total score of a hotel is the sum of the scores for each attribute present in the scorecard for that hotel.

The main techniques which create scorecard are below;

Basic Methodology of Scorecard

Thinking and Analyses

Cycle of statistical design

The steps followed in scorecard (statistical part) development mentioned below,

  1. Define the Problem
  2. Get a Plan
  3. Gathering Data (Data Access)
  4. Data Preparation (Statistics)
  5. Univariate and Multivariate Analysis
  6. Clustering (Machine Learning)
  7. Conclusion

One key element into scorecard is the dependent variable called Conversion Rate which is selected by the employee since independent variables are analyzed by this dependent variable. The conversion rate will show the association of the other variables. The main categories are;

  • For Content Score, subgroups such as hotel’s static properties. (e.g. hotel stars, hotel facilities)
  • For Financial Score, subgroups as provide financial profit. etc.

In the last step before the statistical part, different statistics were used for pre-transformation, these included outlier elimination with IQR method, Feature Scaling (unity-based normalisation).

How to find interquartile range

The statistical and machine learning procedure used to develop scorecards called the linear model & regression. otel.com decided to get categories for some fields because of giving score. For doing that, otel.com used k-means clustering.

At the end of the decided the cluster size (by k-means) for each field, had made some analyses such as,

  • Chi-square tests, which is a statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. It is used to detect p-values with variables.
  • AIC, which is a measure of the relative quality of statistical models for a given set of data. Giving a collection of models for the data, AIC estimates the quality of each model, relative to each one of the other models.
  • ANOVA, which is used for testing groups to see if there’s a difference between them.
  • Hypothesis test also measured by p-values and got correlation between variables.

Why hypothesis test so important?

Hypothesis Test is defined as a set of statistical tools that keeps a confidence about the significant difference based on the measurements. It uses the measurements to calculate a level of confidence that the measured difference is due to chance assuming there is no ‘real’ difference.

A road map of Hypothesis Testing

Such as mentioned above, every field examines univariate analysis and multivariate analysis between each variables. Each score percentage which is decided for scorecard fields will give the percentage with related own main categories.

At the ending of the analyzes section;

Go Parallel your code in R

It is imperative to optimize the code for production. Getting parallelism with the code, otel.com used R packages which are called ‘foreach’ and ‘doParallel’

Some tips for parallelism,

library(doParallel)

# Create cluster with desired number of cores
workers <- makePSOCKcluster(detectCores() - 1)
# for calculating parallel processing
registerDoParallel(workers)

Create a good number of clusters is the numbers of cores-1. Using all desired cores on machine is going to prevent to do anything else (the computers get a stop until the R script has finished).

Grafana, “the leading tool for querying and visualising time series and metrics[3].”

otel.com uses Grafana widely to analyse applications, monitor CPU usage and memory visualizing time series data. otel.com also look at its log files, so it is possible to see when the code starting the do parallel;

“2017–08–03 09:15:17 MSK"

As it shows, parallelism started at that time!

Monitoring with Grafana

Consequently, inventory sorting is a vital process for online hotel booking business, that’s why otel.com try to do it as more data powered as the organization can.

Data Science also covers a very large area and tries to use it much more effective likes this project.

--

--