How to Calculate Potential Customer Return with Rule-Based Classification

Bitaazari
5 min readDec 18, 2021

In this article, we are going to analyze the return of potential consumers with python. Since we will use Rule-based classification, then let’s start with a short and brief explanation of Rule-Based classification.

There are some classification methods, namely Decision-Tree, Rule-Based, and Random Forest. The term rule-based classification as one of the classification method can be used to refer to any classification scheme that makes use of IF-THEN rules by creating “rulesets” in the algorithm for class prediction.

**IF condition THEN conclusion**

This classification consists of a Rule induction algorithm and Rule ranking measures, the former refers to the process of extracting relevant IF-THEN rules from the data while the latter refers to some values that are used to measure the usefulness of a rule in providing accurate prediction. However, Rule ranking is sometimes used in the rule induction algorithm to eliminate unnecessary rules and improve efficiency. the other usage is as we will use in this project analysis is a class prediction which will utilize to anticipate the class of new cases.rules To answer where are these rules come from, we should study characteristics of this method which contains Mutually exclusive rules and Exhaustive rules but in this project, we don’t need to go deep in , so let’s just start the business problem and go further with analysis.

Business problem:

A game company would like to create level-based new customer definitions(persona)by using their customer’s features. They will create new segments and would like to predict and estimate how much money they will get approximately from the new potential customers from those segments.

Example:

How much company can earn from a 25-year-old male from Turkey who is an IOS user.

Dataset:

Persona.csv dataset shows the number of products sold by an international game company and some demographic information of the users who buy these products. The data set consists of records created in each sales transaction. This means the table “is not deduplicated”. In other words, a user with certain demographic characteristics may have made more than one purchase.

Variables :

  1. PRICE: Customer spend amount
  2. SOURCE: The type of device the customer is connecting to (IOS/Android)
  3. SEX: Gender of the customer
  4. COUNTRY: Country of the customer
  5. AGE: Age of the customer

So far we know the business problem, dataset, and also understood the variables in this project. before starting our analysis let's think about some required steps :

  • Re-read the business problem aim to find out what exactly we need to achieve?
  • Check the dataset and variables to get a general idea of data
  • What kind of tools and libraries do we need
  • And finally, plan our analysis like how to start and finish our work in order to provide an efficient result.

Now its time to code :

First of all, we need to import the required library, in this case as it can be seen from the code, we imported pandas.

Some of you may ask what is pandas! (but I’m pretty sure if you read this article, you already knew ), anyway, in short pandas is a Python library used for working with data sets, which has functions for analyzing, cleaning, exploring, and manipulating data.

the data from the dataset has been read followed by some general analysis like summary statistics, detecting missing values in the given series object, the shape and the columns.

General dataset statistics and analysis

Now, let's get more into the details about variables as below :

  • How many unique SOURCE are there? What are their frequencies?
  • How many unique PRICEs are there?
  • How many sales were made in terms of PRICE?
  • How many sales were made in terms of the country?
unique value numbers and frequencies + number of sales based on country and price

Then, by going further it becomes more interesting . In order to understand the values of multiple variables together and observe their breakdown, we use group by and later we use aggregation functions, you can read from here if don’t know these functions yet.

Groupby function has been used to achieve multiple veriables sum and mean
Total sale by country
The PRICE averages in the COUNTRY-SOURCE breakdown

After all this, we want to get an index and sort it to have a better picture of output. here we will sort the output by price and name it agg_df.

Reverse sorting
index for agg_df

After that, We want to divide the customers into certain groups by converting the numerical variable of age into a categorical variable. this is going to be part of segmentation process and getting closer to our objective.

let’s talk a bit about cut() , we use “cut()” function to segregate array elements into separate bins. and also it's useful when we want to go from continuous variable to categorical.read more here about cut() function.

there is another function its called,qcut(),the difference between those two functions is qcut will calculate the size of each bin in order to make sure the distribution of data in the bins is equal. all bins will have (roughly) the same number of observations but the bin range will vary. On the other hand, cut is used to specifically define the bin edges.

anyway, it is not necessary to be more worried of this definitions, there are many sources and videos you can learn about later. following our project with the next step.

It’s time to segment customers and calculate some descriptive statistics :

describe the segments

At last, we will learn new customers belong to which segments and also able to know the potential purchases on average.

let’s try it for a 31–40yrs woman from Turkey who use Android and the same with women in France.

check the out put yousrelf and share with me in the comment 💕😊🤞

Resources :

Decision-Tree, Rule-Based, and Random Forest Classification of High-Resolution Multispectral Imagery for Wetland Mapping and Inventory.Tedros M. Berhane,1 Charles R. Lane,2,* Qiusheng Wu,3 Bradley C. Autrey,2 Oleg A. Anenkhonov,4 Victor V. Chepinoga,5,6 and Hongxing Liu7

--

--