A Combined Approach for Customer Profiling in Video on Demand Services Using Clustering and Association Rule Mining

Dhanasree Rajamani
This article proposes a combined data mining approach for analyzing and profiling customers in Video on Demand(VoD) services. The proposed approach integrates Clustering and Association Rule mining. This approach would be beneficial for IPTV service providers to implement effective Customer Relationship Management and customer-based marketing strategies.


The Video On Demand service(VoD), also known as the Pay-TV business model, offers broadcast transmissions of media content to paying subscribers. The goal is to increase the quality of the network. Internet Protocol Television(IPTV) supports VoD, concentrating more on content and system quality by subscribers’ choices. With added expansion on the web, VoD and live TV services are becoming even more popular. Studies anticipate that the VoD market could increase from USD 38.9 billion in 2019 to USD 87.1 billion in 2024, progressing at an average rate of 17.5%. Other reports state that the number of VoD subscribers worldwide now exceeds 1 billion and is predicted to reach 1.1 billion by 2024. Hence these services are clearly shown to remain appealing to audiences in upcoming years.

This research proposes a hybrid approach combining clustering and association rule mining techniques for profiling subscribers in VoD services. For customer segmentation, the LRFMP(Length, Recency, Frequency, Monetary, Periodicity) model is employed alongside K-means and Apriori algorithms to generate association rules between the identified customer groups and content genres. We use real-world data obtained from an IPTV operator to demonstrate the application of the approach. The four customer groups identified are high-consuming-valuable subscribers, less-consuming subscribers, less-consuming-loyal subscribers, and disloyal subscribers. For each of these groups of customers, a different marketing strategy or action is proposed — including campaigns, special day promotions, discounted materials, and offering favorite content, etc.

Customer Relationship Management

Customer Relationship Management(CRM) involves methodology and practices for utilizing customer data in the business where decisions are made based on such data to gain a competitive advantage over the rest of the sector. Customer analysis using data mining techniques determines CRM strategies and increases customer value. Clustering methods for customer segmentation and profiling based on similar attributes assist service providers to gain a better understanding of subscribers’ needs and behaviors by introducing improved products and services. Such clustering uses the RFM(Recency, Frequency, Monetary) Model — which delves into customer behavior and individual characteristics.

Proposed Combined Approach

This study proposes a combined approach using data mining methods for clustering and rule analysis to examine and analyze subscribers’ VoD behaviors in the IPTV sector. As illustrated in the figure below, the proposed approach includes three phases — preparing the dataset, segmenting the customers, and creating the association rules.

In Phase 1, data extracted from different sources are cleaned and preprocessed for analysis. The next step is constructing the LRFMP variables for each IPTV subscriber. Phase 2 consists of the standardization of variables using the mix-max normalization(applied between 0 and 1) to reduce the potential effects of variable differences. The next step is determining the appropriate number of clusters(to find the customer groups) and customer profiling based on clustering. In Phase 3, association rule mining determines the subscribers’ rental preferences. After this, the Apriori algorithm is applied to customer groups to recommend new content types according to the groups. Hence, this approach addresses the IPTV subscribers’ profiling by clustering and determining their rental preferences by association rule mining. Finally, the rental contents favored by these subscribers are predicted based on the customer groups to devise appropriate marketing strategies for adoption.


The LRFMP model segments the IPTV subscribers based on their content rental behaviors. Table1 represents the rental transactions of a given hypothetical IPTV subscriber.

Length: The number of days between the subscriber’s first and last rentals. Based on the sample below, the value of ‘length’ is 68.

Recency: The recency value is 7, and it refers to the number of days between the subscriber’s last rental date (July 24, 2019) and the last date of the time period (July 31, 2019).

Frequency: The total number of content rentals by the subscribers during the observation period. In the sample below, ‘Frequency’ is calculated as 9.

Monetary: The average amount spent per content rental by the subscriber. In the example below, the average amount spent by the customer is 2.19.

Periodicity: The standard deviation of the subscriber’s inter-rental times refers to the time in days between two consecutive rentals belonging to different dates. ‘Periodicity’ is measured as 6.77 in the example below.

Customer Segmentation

In this proposed approach, an effective combination of the LRFMP model and clustering is used for customer segmentation. Each LRFMP parameter is regarded as equally significant with thorough standardization in cluster analysis. The K-means algorithm is used to minimize the total within-cluster sum of squares and the elbow method is used to find the optimal numbers of clusters. Then cluster analysis is performed and customer segments are profiled based on customer values as per the LRFPM model. To determine the relationship between the customers and rented content types association rule analysis is performed next.

Association Rule Mining

Rented Content Types

All rented content types are tagged and named in a Genre list(G) as shown in the picture. The most rented five genres are selected and Apriori algorithm is applied to each to find the relationship with other four genres. In this algorithm, each association rule is composed of antecedent(X — left side) and consequent(Y — right side). The antecedent is the item with the most rented content type, whereas consequent is the item set relevant to X. They also allow for the computation of the probability of the rental content types, and express the degree of uncertainty about the rule. Support indicates how frequently the item sets appear in the datasets. Confidence represents the frequency in which the rule can be true. Lift denotes the relationship between antecedent and consequent item.

Support, Confidence and Lift

By applying the Apriori algorithm, we obtain the most preferred content types of IPTV subscribers and the potential rental content types they would rent in the future. By analyzing the customer profiles(clustering), content type analysis is conducted and reliable association rules are obtained to examine the dependencies among the different content genres.

Application of the Proposed Approach

Data Preparation

The service provider in point is one of the major digital broadcasting platforms in Turkey. The original dataset was extracted from 277808 subscribers’ set-top-box(STB) based data for a two-year period. Subscribers with a length value of one rental data and those with zero monetary value were excluded from the study alongside users who had only one content rental. The reason was that, otherwise the periodicity value would not be calculable due to lack of IRT measure. In the data preprocessing stage, missing subscribers’ information and incorrect transaction records were removed. Finally, 195493 subscribers’ data related to rental records were analyzed. In the dataset, each subscriber’s transaction contains their ID, Platform information, rental date, content ID, content price, name and content type. The LRFMP model variables were produced for each subscriber, and the table below shows the descriptive statistics based on the LRFMP variable.

Customer Segmentation Results

Prior to clustering, the Min-Max normalization between 0 and 1 is used to standardize the LRFMP variables. The ideal number of clusters is obtained with the LRFMP values using the k-means algorithm. The chart below shows the within-cluster sum of squares results of the k-means algorithm from which the optimum number of clusters is chosen as 4.

Table 3 below denotes the sample size, the average values of the LRFMP variables and scores for each cluster. The cluster values for L, R, F, M, P above the aggregate average are denoted with (↑), otherwise (↓).

To explain the different customer groups, related profiles are created based on the results obtained in the cluster analysis. Then customer value and customer relationship matrices are employed to interpret the segmentation results. The table below represents the names of the subscriber groups, LRFMP scores and size of each customer group.

High consuming valuable subscribers maintain long-term relationships with the VoD service provider and as regards marketing strategies and decisions, it would be profitable to provide them various promotions and discounts. Less consuming subscribers tend to rent high price content, hence it is important to make the services more attractive and offer more content. Less consuming loyal subscribers prefer high priced content, which can be determined more accurately and increases in terms of its variety to encourage longer subscriptions with the service provider. Disloyal subscribers comprise the largest population, and detailed studies are needed to make them more active.

Association Rule Mining based on Customer Profiling

The content types and sub-categories of the rented contents were extracted from the dataset. A sample is shown below.

The Apriori algorithm is used to determine the size of the rental content types for different subscriber groups. Then the most popular types or genres are determined to find out about the interrelations among them. Support value(ratio of number of rentals to total rentals) is set to 0.001. Confidence is set to 0.9 to determine the best rules. The Apriori algorithm is applied based on these rules to find the relationship between each customer group and most rented content types.

For each group, a series of association rules are produced according to the minimum support and the minimum of confidence values. Then the relations of association rules above these values are identified. The support and confidence values are calculated to reveal frequently occurring items with the Apriori algorithm. It includes knowledge on associations between items to find only related association rules and reduce frequent items. Sample subscribers’ preferences are shown below which indicates that IPTV subscribers prefer comedy genre types, while more than 96% of all other rentals include action, adventure, drama and sci-fi. Additionally, 99% of the customers renting these genres also rented comedy.

In figures 3–6, the green circle represents the popular content types rented by the given customer groups, and the size of the circles indicates the rental volume for each content type. The arrows represent the relationship between the item set. The red/purple circle indicates the lift value, larger circles imply stronger lift.

Based on the analysis for “high consuming-valuable subscribers” (cluster 1) the most rented content is found to be comedy, adventure, action, drama and sci-fi in order of preference, thereby suggesting that customers who rent any of these popular genres prefer comedy content as determined by the Apriori algorithm. Figure 3 shows that the lift value is high between the LHS item set(adventure, action, drama, sci-fi) and the RHS item(comedy), implying comedy as a preferred content in this group. The IPTV service providers could offer the content types relevant to these subscribers’ preferences — through email and social media forms to increase satisfaction in the long term.

For the “less consuming subscribers” (cluster 2), figure 4 shows that those renting one of the content types of action, comedy, drama and animation also chose adventure genre. The lift value between LHS(animation, comedy) and RHS(adventure) is higher. Hence IPTV service providers could add such genres and categories, increase the number of items on offer and details to metadata information to achieve more customer satisfaction.

The “less consuming loyal subscribers” (cluster 3) rents adventure, action, drama and sci-fi, as well as comedy. Figure 5 shows that the lift values between LHS(action and adventure) and RHS(comedy) are higher — indicating that adventure is a preferred content for those renting animation and comedy. IPTV service providers may employ different notification tools to inform users of reduced fees, enriched contents and so on to encourage further rentals.

The “disloyal subscribers” (cluster 4) in figure 6 reveals that users who rent action, adventure, drama and sci-fi also opt for comedy — resembling cluster 1. The lift values between LHS(adventure, drama and sci-fi) and RHS(comedy) are higher, indicating that comedy is a preferred content here. Given the sizable majority of these subscribers, service providers should carry more in-depth surveys and analyses to uncover the reasons behind these user’s behaviors, based on which future marketing decisions and promotional initiatives could be planned.


Due to development of the Internet and enormous diversity of IPTV services, a surge is witnessed in the number of subscribers. Hence the problems related to identifying customers’ preferences and expectations are likely to multiply. Companies should make additional efforts to provide appropriate content to their subscribers by gaining insights to their behaviors and identifying customer values which will pave the way to improved decision making, adopting relevant marketing strategies and ultimately increasing profitability.

The proposed combined approach here includes clustering and association rule mining for customer profiling in the VoD sector. The LRFMP model allows for the extraction of subscriber’s values with the K-means algorithm to carry out the segmentation process. Subscribers are then grouped into four categories. Association rule mining is used to account for the rented content type information. The Apriori algorithm is used to identify the genres preferred by customers in different clusters.

This study theoretically contributes to current literature proposing a novel combined data mining approach in the VoD sector to determine customer groups based on content preferences and to analyze these preferences in terms of the prevailing relations among them. It provides insights on how to develop a combined approach using clustering and association rule mining techniques for profiling subscribers in VoD service providers. From a practical standpoint, the finding of this study contributes in implementing CRM and market strategies of IPTV service providers, helping them offer subscribers with appropriate promotions and advertising campaigns, and recommend more relevant content types and categories to users based on their preferences and profiles. Enhanced strategizing by firms can help them in reaching out to more prospective customers while sustaining the existing customers.

The limitations of this study are:

  • This study makes use of the subscriber’s VoD transaction records via only STB devices and not devices like computers, mobile phones.
  • Future research can also include other factors such as age, gender, profession, income level, region, etc.
  • This data is from a single case company in a country, which could be expanded to different geographical areas.
  • Various other algorithms could be used for clustering and associative rule mining.


