VALUE-BASED SEGMENTATION FOR HOTELS WITH STRUCTURED DATA SCIENCE PRINCIPLES:

Abhiram Prasad
MiQ Tech and Analytics
12 min readNov 12, 2019

by Abhiram Prasad

I-BUSINESS UNDERSTANDING

WHY AND WHAT IS VALUE BASED SEGMENTATION?

VALUE-BASED SEGMENTATION

Market segmentation is one of the most important tasks in marketing. Identifying subgroups in a way guides the marketing and sales decision-making process. For example, customers who are price insensitive and are not well served by the competitors can be charged more than customers who are price-sensitive and well served by the competitors. In a case where there is no value-based segmentation in place, for instance, we might end up undercharging certain segments and overcharging certain segments. In an overall sense, it can be said that segmentation facilitates tailored strategies for the segments rather than having a one strategy fits all approach. As a working definition, we can define customer segmentation as a practice of dividing a customer base into groups that are similar in specific ways such as personal characteristics, preferences or behaviors that should correlate with the same behaviors that drive customer profitability. The market segmentation also helps in tying future customers into the groups made, based on how similar their characteristics are with those groups. Some of the benefits that customer segmentation provides:

  • Increased understanding of customer needs and wants, which in turn can lead to increased sales and customer satisfaction.
  • Identify the most and least profitable customers allowing the companies to focus on profitable customers as every time it might not be feasible to focus on all customers.
  • More focused marketing efforts — products can be emphasized through more targeted advertising using more appropriate media to enhance message delivery.

There are 3 main approaches to market segmentation:

1. A priori segmentation is the simplest approach that uses a classification scheme based on publicly available characteristics. In this method, the type and number of segments are pre-determined in advance. This is a rule-based approach and results in the formation of segments as expected. Examples include Demographics based segmentation, Usage-based segmentation.

2. Needs-based segmentation is based on differentiated and validated drivers that customers express for specific product or service being offered. The segments are demarcated based on different needs in this approach.

3. Value-based segmentation differentiates customers by their value that they add to the business. Grouping customers with the same value level into individuals segments that can be distinctly targeted.

CLUSTER ANALYSIS VS RULE BASED SEGMENTATION:

There are multiple ways to segment a market but the more precise and statistically valid approaches are to use a technique called cluster analysis. Clustering is the process of using machine learning algorithms to identify how different types of data are related, and use these differences for grouping objects of a similar kind into respective categories. Some common clustering algorithms include k-means clustering, hierarchical clustering, spectral clustering and mean shift clustering. Unlike the rule-based segmentation process (segmenting by Age, Frequency of Sale, Income, etc.), clustering algorithms are not based on any fixed rules. Rather, the data itself reveals the customer prototypes that naturally exist within the population of customers.

Problems with Rule-based Segmentation:

  1. Because the rules are pre-determined the final segments formed usually meet the initial assumptions whereas in Cluster analysis the data is given the freedom to find its natural clusters hence discovering meaningful insights.
  2. It becomes increasingly difficult to segment customers when having more than 2 dimensions.

3. There will still be a large variance within attributes within each segment.

CASE STUDY: VALUE-BASED SEGMENTATION FOR HOTEL VERTICAL

PROJECT FLOW

PROJECT FLOW

II-DATA UNDERSTANDING:

a)DATA SOURCE :

VARIABLES CAPTURED

For the VBS that was carried out for hotels, the data was a first-party data. This data consisted of over 25 attributes ranging from the referrer, timestamp of booking,check-in-checkout dates to the number of rooms, revenue, UID’s, etc. A snapshot of some of the important variables is posted above in the figure. This data was nicely packaged into tables consisting of various columns after using regular expressions for cleaning data. For the case of this project, it was 1-month data consisting roughly of 4.5 lakh rows.

III-DATA PREPARATION

a)HYPOTHESIS GENERATION:

It is critical to develop customer segment hypotheses and variables and validate them. This is particularly true for value-based segmentation schemes. Ultimately hypothesis should be formed around customer characteristics or factors that allow one to clearly separate current customers into distinct value-based segments. After several discussions with various stakeholders, a few important attributes including booking timestamp, check-in-checkout dates, number of rooms and revenue which were consistent across different clients falling under the same vertical were listed.

b)FEATURE GENERATION:

The next logical step was to drop masked UID’s(which may be due to various reasons like operating in incognito mode)as we cannot associate these attributes to a particular UID. After this was the dropping of columns having a high null rate(>50%).In the feature generation step, based on the available data and business understanding we tried to fabricate features that could potentially play a crucial role in defining different value-based segments. The following features were generated:

GENERATED FEATURES

Once this was done a pivot table was generated with UID’s as the indices and aggregate on length of stay, advance booking period, revenue, number of rooms, time since last booking or recency, frequency(count).

c)OUTLIER TREATMENT:

Once the features are formed outlier treatment was absolutely necessary as we would be using clustering algorithms which use distance-based metrics to calculate the similarity between the data. There are various ways to treat the outliers including completely removing them, imputing them with mean values, imputing them with upper and lower thresholds, etc. The way we treat the outliers depends on many factors including reason for the outlier, percentage of outliers, etc. After examining all the columns the highest percentage of outliers present in a column was 9%. It was decided to clip the outliers to Q3+1.5*IQR and Q1–1.5*IQR for all the columns.

d)UNIVARIATE,BIVARIATE ANALYSIS,FEATURE RELEVANCE,CORRELATION MATRIX:

SCATTER MATRIX

In this step, a scatter matrix was plotted to observe the distributions of various columns and also observe the correlation between the variables. Before performing cluster analysis it is important to check feature relevance to see if a particular feature is really necessary. To check the feature relevance iteratively one feature was taken as the output feature and the other features were taken as input. A decision tree regressor was fit on to the data and the r2 score was calculated. This was done iteratively with a different feature as output each time.

CORRELATION MATRIX(2nd approach)

A high r2 score meant that most of the variance in that particular feature as the dependent variable could be predicted by the independent variables. It essentially means that a particular feature is not really needed. A negative score means that they are essential as it means the model has failed to fit the data. Similarly, a low r2 score means that we need to consider that feature. A correlation matrix was further generated to affirm the obtained R2 scores. The results aligned well and there was no significant correlation between the variables.

e)SCALING, STANDARDIZING OF FEATURES:

Many machine learning algorithms perform better when features are relatively on a smaller scale. Scaling and standardizing can help features arrive in a more digestible form for these algorithms. Especially in clustering algorithms, it is important to have features on a similar scale so that all the features are given similar importance while clustering. The python sklearn preprocessing library offers many scaling methods like StandardScaler, Minmax scaler, Robustscaler, etc. Standard scaler subtracts the mean from the values and divides it with standard deviation. This changes the distribution to now having zero mean and a standard deviation equal one. Deep learning algorithms and regression-based algorithms perform well when the data is standard scaled as the distribution approaches a normal distribution. The standard scaler also distorts the relative distance between the features. Min-max scaler scales the values between [0,1] by subtracting minimum and dividing the resultant with the range. Minmax scaler preserves the distribution shape of the features and only changes the range of the features. It is sensitive to outliers, which means the effect of outliers still persists after scaling. RobustScaler does scaling by subtracting the median and dividing the result by inter-quartile range. This scaling method is more robust to outliers. The method that we choose depends on the algorithm that we are using and the kind of data that we have. Min-max scaling was used as it was giving better results with the data in hand.

f)DEFINING VALUE METRIC:

The defining value was one of the most challenging tasks. We had tested and visualized results with 2 approaches on the basis of how we define value and what attributes we input to the cluster.

1. The first approach value parameter consisted of 6 attributes(mean revenue, mean len_stay, the mean number of rooms, frequency, recency and mean advance booking period). Using Recency, frequency and monetary is a proven value metric, but here we are considering a more specific case of hotels where it makes intuitive sense to think of people having average longer stays, booking more rooms and those booking at the last moment as more valuable. In this approach, weighted values were given to each of the attributes with higher weight given to monetary, frequency and recency.

Value=(2*mean_revenue+2*frequency+mean_rooms+mean_len_stay)/(2*Recency+mean_advance_booking_period)

Here Recency and advance booking period are in the denominator(less recent the customer lesser valuable he is)

2. In the second approach, it is more about defining value for differentiating premium value users from non-premium value users. The value metric is a combination of 3 attributes(Total revenue, total rooms booked, the total number of nights stayed).To be precise :

Value=Sum_revenue/total_rooms* total_stay length

Consider an example where customer 1 has made a booking of a single room worth 8000 bucks for 1 night vs customer 2 who has made a booking of a single room for 2 nights for a total price of 8000 bucks, wouldn’t this indicate customer 1 has more buying capacity(premium) compared to customer 2.Whereas the first metric would give higher value to customer 2.So the second approach in a way differentiated customers on the basis of their buying capacity and drive to go for higher-priced rooms.

After defining the value metric, the outliers in value were detected and clipped to Q3+1.5*IQR in both the approaches. It was then scaled by applying a min-max scaler so that all the attributes would fall on the same scale.

IV-MODELING

a)SHORTLISTING OF CLUSTERING ALGORITHM :

The next big task was to identify the most suitable clustering algorithm. The algorithm which we choose depends on the kind of data that we have. Things to keep in mind while selecting a clustering algorithm:

1.Intuitive hyper-parameters: It should not be a difficult task to set the hyperparameters. There has to be some way to make a good choice of hyperparameters quickly(not a random choice)

2.Stable: There should not be significant variations in the results after clustering if we run the same algorithm again(the same data points getting allotted to different clusters different times)

3.Performance: This is perhaps the biggest thing to keep in mind while choosing an algorithm. There is n number of clustering algorithms to choose from but very few can handle large amounts of data. The time complexity of the algorithm should also be looked at.

We went ahead with K-means clustering because:

1. After a fair deal of exploratory analysis of the data, it was not so difficult to set the hyper-parameters. The number of clusters could be set by plotting an elbow curve(inertia scores) and could also be verified with silhouette-score.

2.K-means suffer from a drawback of random initialization which can possibly give different results different times. To fight this drawback, one way is to use K-means++ initialization which chooses centroids smartly and ensures they are not close to each other. The probability of a point being chosen as next centroid would be directly proportional to the square of its distance from the closest centroid. The first centroid is chosen randomly.

3. The biggest win for k-means is that it can handle large amounts of data quite easily unlike many other algorithms. The time complexity of affinity propagation is O(k*n²). The time complexity of hierarchical clustering is O(n³) and the mean shift is O(n²). The time complexity of K-means is linear O(n). There are some other algorithms that can match the performance of K-means like DBSCAN but the drawback with this algorithm is that it performs poorly when clusters are of varying densities. Apart from that, they are also highly sensitive to the hyper-parameter settings.

b)FINDING THE OPTIMAL NUMBER OF CLUSTERS:

SSE CURVE(clusters vs inertia)

After selecting k-means as a suitable algorithm for the existing data, the next step was to choose the approximate number of clusters. This can be done by plotting an elbow curve which is a graph plot between the sum of squared errors vs the number of clusters. The sum of squared errors also called inertia tells how close the points within the cluster are to each other. The optimal number of clusters is chosen at the elbow of the curve or when there is no more significant decrease in SSE.

SILHOUETTE SCORE

The second method that was used to verify the number is by using a silhouette score. Unlike inertia, this score also takes inter-cluster distance into consideration. Ideally, the inter-cluster distance should be high and the intra-cluster distance should be less. The closer this score is to 1 the better is the clustering.0 indicates bad clustering whereas negative scores indicate that points are allocated in the wrong clusters. Once the optimal number of clusters was found K-means was applied using the optimal number of clusters and setting the initialization as kmeans++.In the second approach of defining value, just the frequency and value attributes were input to the clustering algorithm

V-SUMMARIZATION

a)SUMMARIZE THE ATTRIBUTES :

The next step is summarizing the attributes cluster wise to get a sense of mean values of various attributes cluster wise. This would, however, be scaled-down values, so the clustering result was tied back to the original data(before scaling the attributes) and then they were summarized. To get a better idea of the differences between the clusters, graphs were plotted between cluster number and mean values of various attributes.

CLUSTERS VS VALUE(SCALED AND 2nd APPROACH)

VI-EVALUATION OF CLUSTERS FORMED

a)HOMOGENEITY CHECK:

STANDARD DEVIATION OF ATTRIBUTES WITHIN CLUSTERS

This step was to check how homogenous each cluster was. Ideally, wanted to verify if the standard deviation of various attributes(input to the clustering algorithm)within a particular cluster is lesser than the attribute’s population standard deviation. This means attributes within clusters should be more similar than when they were existing in the population as a whole. It was also verified by checking the interquartile ranges of the attributes inputted to the K-means. This would give an idea of how separable they are as we move across the clusters.

BOX-PLOTS OF VALUE ACROSS CLUSTERS

VII-NEXT STEPS

The segmentation part of the customers is in-fact just the beginning. Tying these results to identify other profiling attributes that differentiate the clusters and using this to drive actionable results is a whole process of its own. Once the clustering part is done, it opens a wide range of possibilities to work on look-alike segments, allowing to target users by attributing them to one of the clusters based on the characteristics. Currently working on post clustering steps to identify the general characteristics of users in each segment and then drive actionable results to target users from different clusters, would probably come up with a second blog on the same which would be a continuation of this!

Co-Author: Sahil Khan

--

--