RFM Analysis For Customer Segmentation Using Hierarchical & K-means Clustering

Vijaya Patil
12 min readNov 15, 2018

--

Introduction

This study focuses on different markets for Adidas (sample dataset), with the help of transactional data to identify key meaningful segments that could help Adidas in creating effective marketing campaigns and improve its market share. This study examines the data set by using IBM’s Statistical Software Package SPSS. The dataset used in this report contains transactional data about the Adidas customers between December 16, 2004, and September 17, 2012

I have used the RFM model to choose variables such as ORDER_NO_nu_1, REVENUE_sum_sum, Profit_sum, and MonthDiff for the initial analysis to identify the clusters using the hierarchical cluster analysis by splitting the data into calibration (60%) and validation (40%) sample, respectively. After this, the K-Means Cluster Analysis was run to compare the results. After running the Hierarchical and K-Means cluster analysis several times, I identified 8 important segments based on the variables used. Later, I conducted a post-hoc analysis of the final customer segments by using different secondary variables such as channel, product_category_ID, payemnt_method, and zipcodes to provide a better idea in planning the loyalty program.

The sportswear goods industry is competitive and fragmented with many players and thus customers have many brands to choose from. In addition, according to Pareto’s principle, 80% of the companies’ profits come from 20% of their customers. Therefore, it is crucial for Adidas to retain such profitable and loyal customers. This study aims to better understand different types of customers, target profitable ones effectively, and identify the areas for improvement which can help Adidas increase its market share.

In this study, I examined the dataset which shows a comprehensive outlook on the transactional history of products purchased by the customers from December 16, 2004, to September 17, 2012, in the USA. I followed RFM (recency, frequency, and monetary) analysis model and decided to use key variables such as ‘number of orders’, ‘revenue’, ‘profit’, and ‘months difference’ (for recency) in this study. To prepare the dataset of 100,000 customers, I split the set into calibration and validation samples. Later, I conducted a series of Hierarchical and K-means cluster analysis and identified valuable customer segments. Also, I performed post-hoc analysis on these segments to gain more insights about the product categories, zip codes, channels, and payment methods used. The outcome of this analysis allows interpreting significant findings regarding individual customer segments.

The results indicate that Adidas customers can be classified into four prominent groups based on the revenue and profit generated, and how often the customers place orders. The segments are represented by highly profitable, more profitable, and less profitable customers. Approximately 61% of the customers purchased the Adidas products through the web channel. Majority of the customers used Visa, MasterCard, and American Express as the preferred payment method channels. Based on these findings I recommended Adidas, a loyalty Program named “Adidas Star” to its “best” customers.

Project Overview

Overview of Methodology

With regard to the background research, Adidas is experiencing fierce competition in the sportswear industry. Therefore, I was assigned a large data set to identify different segments of customers, especially the profitable ones. In addition, the study aims at launching a loyalty program to help Adidas improve market share.

Primarily, I aggregated the variables in the dataset to transfer the transactional data files into the customer data files. Then, I split the overall sample into two groups of approximately 60% and 40% in order to create a calibration sample and a validation sample. This step was followed by multiple Hierarchical Cluster Analyses and K-Means Cluster Analyses on 10% of each sample. Later, I concluded that eight segments run by Ward’s Method would be the appropriate initial seeds to conduct K-Means Cluster Analyses. I then executed a K-Means cluster analysis on the entire sample to assign each customer to one of the segments. After repeating the above steps several times, I finalized the results with the consistency of the segments from the calibration and validation samples. Furthermore, I conducted a post-hoc analysis to gain more insights from the data.

Flowchart of Methodology

Data Source

The dataset consists over 226,000 records, which reflect over 137,000 orders from 100,000 random U.S. customers (representative of all their customers). The data includes all orders, dollars, items, and returns for those customers between December 16, 2004, and September 17, 2012, over transactions in all channels (store, web, and multi-brand showroom).

Data Aggregation and Identifying Variables

In order to prepare the customer data files, I aggregated the transactional data file twice. The first aggregation included only the applicable ones. After calculating the number of orders for each customer, I aggregated the customer number variable and a new dataset was created, including the variables of order number, profit, revenue, and months difference.

The variable profit, calculated from the difference between revenue and cost, was newly created because the total profit generated by each customer was of more interest. Meanwhile, I created a new variable called “LastOrderDate” (September 17th, 2012), which was the final date of data collection. Then I calculated the number of months between December 16th, 2004 and September 17th, 2012 and named the result as a new variable called “MonthsDifference”. (Table 1) As there were 99 rows of data without order date, I took them as missing data and deleted all. In the meantime, in order to determine which customers are the best ones, I adopted RFM (recency, frequency, monetary) model to conduct the data analysis, which includes how recently a customer has purchased (recency), how often they purchase (frequency), and how much the customer spends (monetary).

Table 1: New Variables

After data aggregation, I finalized 9 variables. (Table 2)

Table 2: Variables Identification

Data Files Preparation

After identifying all the variables that would be used in the following analysis, I standardized several variables, such as Order Number, Revenue, Profit, and Months Difference, as standardization procedures equalize the range and data variability. Next, the data set was randomly split into two groups, the calibration sample, and the validation sample. The calibration sample contained approximately 60% of the data lines, while the validation sample contained the other 40%. This random division allowed for cross-validation and increased the opportunity to identify important customer segments.

The calibration sample data set was analyzed first, followed by the analysis of the validation dataset using the same method. Each analysis comprised of multiple Hierarchical Cluster Analyses to identify the possible number of clusters and the cluster centers (initial seeds) for K-Means Cluster Analysis. Multiple Hierarchical Cluster Analyses were performed on random subsets comprising approximately of both 5% and 10% of the calibration and the validation samples. Since the 10% sample is larger and more representative of the whole sample, I chose 10% sample instead of the 5%.

Hierarchical Cluster Analyses

For each subset, Furthest Neighbor Method and Ward’s Method were executed by using distance measures of Squared Euclidean as this measure is relatively faster than Euclidean and can give more preference towards variables that are distant from each other. The results of each aforementioned analyses were used to identify initial seeds for executing K-Means Cluster Analyses. The convergence criterion was set to 0.

Since the Furthest Neighbor Method provided us with customer segments with plenty of extreme values, I decided to use Ward’s Method, which offered me reasonable segments and pragmatic figures, for future research. For the first 10% of the calibration and validation sample, I identified eight clusters each by using Ward’s Method. The completed results are shown in Table 3 and 4.

Table 3: Hierarchical Cluster Analysis result for the Calibration sample
Table 4: Hierarchical Cluster Analysis result for Validation sample

K-Means Cluster Analyses

The results of the hierarchical cluster analyses led to an identification of the cluster centers and the creation of seeds files used in K-Means analyses. I ran each K-Means cluster analysis based on eight segments for both the calibration and validation samples multiple time, respectively. The results are represented as Set 2 and Set 4 (see Table 5 and 6).

Table 5: K-Means Analysis result for Calibration sample — Set 2
Table 6: K-Means Analysis result for Validation sample — Set 4

With regard to the extensive analysis of the calibration and validation samples, I suggested that the more appropriate customer segmentation was Set 2 instead of Set 4, which includes eight important customer segments. As shown in Table 5 and 6, the results from K-Means analyses in both calibration (Set 2) and validation samples (Set 4) were a match. Therefore, I was ready to run a K-Means cluster analysis on the entire data set. The results were consistent with the earlier analyses of the two samples. Moreover, Table 7 presents detailed information about the final eight customer segments.

Table 7: Final Cluster

Conclusions

Adidas usually segments its customers based on the target market highlighting demographic, geographic, psychographic and behavioral factors. This study is crucial to Adidas as it provides a thorough understanding of the segmentation based on profitability. As per the analysis, I have recognized some key findings based upon the final eight customer segments. Below is a detailed explanation of each finding.

1. As shown in table 8, this group consists of two customers who have ordered four orders on an average. The revenue and profit generated by these two customers are the highest across all customer segments. I can conclude that this segment is either a sports club or sports association, who have purchased the products in bulk. The drawback with this group is that last orders were approximately three years back.

Table 8: Most profitable group

2. This group (see Table 9) primarily combines cluster 5 and cluster 6. I have a group of 671 customers and these customers generate profits between the range of $1112 and $3338. Group 2 is regarded as a highly profitable group. These customers purchased Adidas products frequently as they have a relatively low number of months between their last order.

Table 9: Highly profitable group

3. This group (table 10) comprises of 6533 customers. The profits created by the customers in both the segments are almost the same. The month difference since the last order in segment 7 is approximately twice as high as segment 3. As most of the variables are almost similar across both the segments, I have collaborated both the segments as a single group. This group can be termed as profitable customers.

Table 10: More profitable group

4. The three segments in group 4 are the least profitable for Adidas compared to the other segments as shown in Table 11. Even though the customer count is huge, these customers have placed an average of one order. The customers in segment 8 have been idle for more than five years. I assumed that customers in this group precisely segment eight are not either satisfied with the Adidas products or decided to switch brands.

Table 11: Less profitable group

5. After completing the post-hoc analysis, I wanted to analyze the payment methods, and purchasing channels favored among the customers. I found that the web method was the most popular and multi-brand stores was the least popular purchase channel across all the customers. This can be observed.

Web — 61%, Store — 36% and Multi-brand Showrooms — 3%

6. From Table 12, I can infer that web is the most preferred channel among highly profitable and more profitable customers, while less profitable customers opt for web channel to purchase Adidas products.

Table 12: Purchasing channels

7. By referring to the payment method, I have discerned that Visa, MasterCard, American Express are the most prominent payment methods compared to others.

8. AMEX, MasterCard, and Visa cards earn a profit of 89% for Adidas on the total sales of Footwear, Clothing, and Apparels.

Table 13: Percentage of profits for different product categories

9. The customers belonging to the states New York, California, Delaware, New Jersey, and Pennsylvania are highly profitable customers who bring an average profit of $1,320 per customer. The customers belonging to the states Illinois, Texas, Florida, Mississippi, Washington, Colorado, Maryland, North Carolina, and Oklahoma are more profitable customers who bring an average profit of $400 per customer, as shown in figure (red is highly profitable, dark black is more profitable).

Highly and more profitable states

Proposed Customer Rewards Program

The right gear, expert guidance and access to incredible events. The program provides access to the latest and greatest gear for its higher and more profitable customers. They get to connect with Adidas Star Experts online, in-store or through our Adidas Star apps. They can book their spot and join the crew for special events and weekly group workouts.

Adidas Star Program:

  1. On the Adidas Star Rewards Card, earn 1 point on every $10 spent.1 point = $1. Unlimited validity

2. If the purchase of Footwear or Clothing or Apparels → 30% discount on rest of the products in the same order

3. Extra 10% discount on Reebok products on the purchase of Adidas Products. In order to improve the market share of Reebok, the company can provide special offers on Reebok products when they buy the Adidas products

4. Get 10% cashback on Adidas products, when paid by American Express, MasterCard, and Visa

5. Free shipping on all orders from adidas.com

6. Extra 10% discount on Reebok products on the purchase of Adidas Products. In order to improve the market share of Reebok, the company can provide special offers on Reebok products when they buy the Adidas products

7. All the highly profitable and more profitable customers will earn a chance to participate in the lucky draw on March 31st, 2018 for the tickets of the final of UEFA Champions League 2017. They get a lifetime opportunity to take a selfie with the winning team

8. Additional 10% discount on the personalized products

9. Exclusive invitation to Adidas VIP events for all the highly profitable customers

Limitations

This study aims at narrowing down certain profitable customer segments for Adidas in order to devise an effective loyalty program. While the research has certain limitations, the findings suggest several future research directions. The limitations are as follows:

1. The analysis captures nine variables to identify profitable customer segments. Apart from the information such as order number, customer number, revenue, cost, channel, payment method, product category, zip code, and transaction date, if the study had incorporated the details of age, gender and return and an exchange rate of products then it would have been more comprehensive.

2. Due to the time and resource limits, it is difficult to consider all the aspects of the data thoroughly.

3. There are some missing values in the data set which can be an area of concern for analyzing the data. For example, the order date is not specified for some of the products.

4. The report uses only Ward’s Method and Furthest Neighbor Method for the analysis. Nevertheless, there might be other optimized methods that can be used to analyze this problem. The data provided in the dataset is not up to date. The last order transactional date was September 17th, 2012.

Appendix

Initial Cluster Centers
Final Cluster Centers

--

--