Data Science Project : Rule Based Classification

Creating Personas from Existing Customer Profiles and Classifying According to Customer Purchasing

Cem ÖZÇELİK
8 min readOct 23, 2022
Photo by Lukas on Pexels

This article written by Cem ÖZÇELİK and Alparslan Mesri.

Organizations spend a great deal of money to provide the best service to their existing customers. While making these expenditures, knowing the customer profiles well and knowing the spending habits of the customers allows the expenditures to be directed in the best way. The knowledge of how much profit a customer in which profile brings him may depend on various variability in the customer profile. For example, depending on the degree of influence of the institution, information such as the nationality, age, gender of the relevant customer, the type of the device they shop with, the time the shopping takes place, the season in which the shopping takes place, the frequency of shopping made by the relevant customer from the institution can be effective variables when creating the customer profile. In this context, today, organizations can benefit from various methods while creating their customer profiles.

From the perspective of data science and machine learning, various segmentation and clustering algorithms offer customized performance for this and similar classification processes. However, although classification and clustering algorithms such as K-Means may sound cooler, sometimes the data set we have does not need such complex algorithms. It is also possible for us to make rule-based classifications using certain features of our customers for classification in smaller, compact datasets with less variability and uncertainty. Taking an approach in this way not only saves a great deal of work for our team, but also enables us to bring a more general and acceptable solution to the current business problem.

In this study, as an example business problem, an international gaming company wants to create new level-based customer definitions (personas) by using some features of its customers and to create segments according to these new customer definitions and to estimate how much new customers can earn on average according to these segments. For example, he wants to determine how much of a 26-year-old Android user in Turkey can offer to the relevant company.

Now we can get to work. First, let’s get to know our data set:

Our data set consists of 5 variables including,

  • ‘PRICE’,
  • ‘SOURCE’,
  • ‘SEX’,
  • ‘COUNTRY’,
  • ‘AGE’.

Let’s also recognize these variables:

  • PRICE: It is the amount of expenditure of the customer concerned.
  • SOURCE : A tour of the operating system of the device to which the customer is connected.
  • SEX : Gender information of the customer.
  • COUNTRY : Information of the country / nationality to which the customer is connected.
  • AGE: Mournery information of the customer

First, let’s import the libraries that we will use in the study.

Next, let’s import our dataset and see the summary of the dataset.

Dataset Overview

Now let’s examine the data set in general terms so that we have a ten information about the data set:

We can see the output of this function blog in the following image:

Desriptive Statistics of Dataset
Histogram Charts of PRICE & AGE Features

When we look at the data set we have, we see that there is no empty view. Of course, we should not forget that this is an example data set. In fact, in data science projects in the world, we will not encounter clean and understandable data sets in this way. In addition, we have two digital variables (AGE, PRICE) in the data set. When we look at the statistics of these two variables, we can see that the variable PRICE varies between 9 units and 59 units, and the average is 34 units. In the same way, when we look at the lower and upper quarter slices of the PRICE variable, we can say that it has a balanced structure. There doesn’t seem to be much carouselage. However, when we examine the histogram graph, our values are not continuous; we need to keep in mind that it has a dashed structure. In addition to the PRICE variable, we also look at the AGE variable and the ages of the users in our data set vary between 15 and 66. Compared to the PRICE variable, we can see that the AGE variable is in the form of a Right-Skewed. The users in our dataset are mostly young people. When we think that our business problem belongs to a mobile gaming company, we can think that it is an expected situation.

Now we ask some questions in the dataset and find the answers to these questions:

First, we answer the question is :

How much sales have been generated to the user in each SOURCE round?

How many sales are there in each PRICE value?

As we can see from the output here, the prices of our products we offer to our customers are 9, 19, 29, 39, 49 and 59.

How many sales from which COUNTRY ?

When we look at the output, we can see that sales are concentrated in the US and Brazil. When we look at the European countries, Germany and Turkey are seen in the near values, while the least sales from European countries are to the users in France. The least sales were in Canada.

How much was earned in total from sales by COUNTRY ?

When the earnings obtained from the customers according to the countries are examined, as expected, the most earnings were provided from the users of the USA and Brazil. However, while the number of purchases made by users in Brazil was almost half that of US users, purchases were equivalent to 73% of US users in earnings. In European countries, although more products are sold in Germany than in Turkey, the profit obtained from Turkey is higher than in Germany. In this case, it can be interpreted that “Users in Turkey and Brazil have purchased products with higher prices than users in the USA and Germany.”

What are the sales amount by SOURCE types ?

As we can deduce from the results here, the gain from Android users is ~65% more than IOS users.

What are the PRICE averages by COUNTRY ?

In this case, as we can see, we have confirmed the comment we made to the question “How much was earned in total from sales by COUNTRY ?” in the previous section. As we can see, when we look at the earnings provided per sale, users in Turkey and Brazil are at the top of the list, followed by Germany and the USA. In the previous section, it is not out of sight, but as you can see, the situation of paying more fees per sale is also valid for users in Canada.

What are the PRICE averages by SOURCE ?

When we look at the earnings averages according to the device tour used by the users, we can see that Android users earn more than IOS users.

What are the PRICE averages in the COUNTRY-SOURCE ?

When the results here are examined, as expected, the averages of Android users are higher in each country, but we can surprisingly see that the averages of the USA, where the most sales and the most earnings are the lowest, are at the lowest level. In Canada, due to the low number of sales, the average according to the operating systems has been higher than other countries.

What are the average earnings by COUNTRY-SOURCE-SEX-AGE ?

As a result of this, male Android users in Brazil and the USA are the ones who earn the most. Although the number of users is small, the fees they pay per sale are higher than those of young users. Young female android users of France are also seen to be in the user profile that pays the most per sale.

We asked the data set we had, questions that we could know and know about the data set, and we got some conclusions. Now, to create personalities from customers, let’s first categorize users according to their age. Therefore, we divide the AGE variable into age categories 0_18, 19_23, 24_30, 31_40 and 40_70.

We are now concanate variables to create a level-based customer definition variable.

The point we need to pay attention to here is, after creating customers_level_based values with list comprehension, these values need to be unique. For example, it could be more than one of the following: FRA_IOS_FEMALE_31_40.

We then look at the PRICE averages according to the CUSTOMERS_LEVEL_BASED variable we created.

We divide users into segments according to the average price of the personas we create here. We name these segments as A, B, C, D.

Finally, let’s examine the statistical properties of the segments we have created:

Statistics of Segments

As we can see in this table, there are 27 users in each segment. The segment with the highest earnings is segment A and segment D with the least income. The highest sales value realized in the A segment is 45.42 units. The minimum amount is 36.06 units. Its average is 38.69 units.

Here, as a different approach, we can combine the segments that show the most similarity to each other. For this problem, segments B and C are the 2 segments that show the most similarity to each other. When we look, the average of the B segment is 34.99, while the average of the C segment is 33.50. Max values are also similar to each other. However, sometimes combining segments that are similar to each other may cause us to ignore some distinctive information. That’s why we don’t combine these two segments in this business problem.

Now, finally, we will define a few sample users and find out which segment these users are in our system.

As the output of this for block:

In the first example, we segmented a 33-year-old male android user in Turkey, then in the second example, a 39-year-old female Ios user in France, and in the last example a 26-year-old male android user in Turkey.

We have come to the end of our study. In this study, we performed a rule-based classification on data sets with low uncertainty without the need for complex algorithms. more later

--

--