Exploring Customer Demographic Insights: Exploratory Data Analysis for Attrited Customers in Bank Credit
Welcome to my portfolio, showcasing the project I worked on as Virtual Internship Experience Data Engineering Rakamin in BTPN Syariah. In this article, I want to share my final project as a virtual internship, a project that involved creating Entity Relationship Diagram, creating a schema, loading data into the database, and querying data to support business intelligence and data analysis efforts. Through this project, I was able to demonstrate my skills in database design, common table expression using PostgreSQL, and data visualization
Business Objective
The bank manager is frustrated with an increasing number of customers leaving their credit card services. They would greatly appreciate it if someone could identify customer profiles so they can know which customers are likely to leave, so they can proactively approach customers to provide better service and change customers’ decisions in the opposite direction.
As a response to the trend of customers leaving the credit card services offered by banks, a solution is sought to better understand customer profiles. This project aimed to provide valuable customer insight that could be used in improving customer satisfaction, reducing customer attrition, and maintaining existing customers. In order to achieve that, the business objective is :
- Identifying what factors are causing customers to be late in paying their credit arrears.
- Visualizing the condition of customers who have overdue credit payments
- Determining the most frequent factor causing customers to fall behind on payments.
in this project, I created an Entity Relationship Diagram, Build a database, Data exploring, and analyze customer demographic, transaction history, and product usage patterns to help the bank take proactive measures to retain them.
Setting up the PostgreSQL for Customer Data History Using Python
The first step to exploring customer data provided in the form of a CSV file is to load it into the database management system. For this project, PostgreSQL was selected as the database management system due to its scalability, reliability, and support for advanced data analytics. The data was loaded into PostgreSQL using Python with the psycopg2 library and using SQL query such as COPY command to efficiently transfer the CSV data into the database.
I made four python files with classes each file that have different functions.
data_connection.py
data_connection.py contains DataConnection class, which is used to perform connection to the PostgreSQL database, fetching and executing queries. This class has the following methods:
import psycopg2
class DataConnection:
def __init__(self, database, user, password, host='localhost', port=5432):
'''
Initialize the connection to the database :
host (str) : the host of the database (default : 'localhost')
database (str) : the name of the database
user (str): the username for the database
password (str) : the password for the database
port (int) : the database port (default : 5432)
'''
self.host = host
self.port = port
self.database = database
self.user = user
self.password = password
def connect(self):
'''
Connect to the database using the provided paramaters
'''
try:
self.conn = psycopg2.connect(
host = self.host,
port = self.port,
database = self.database,
user = self.user,
password = self.password
)
self.cursor = self.conn.cursor()
print('Connected to the database')
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error: {e}")
def close(self):
'''
Close the connection to the database
'''
try :
if self.conn:
self.cursor.close()
self.conn.close()
print('Connection closed.')
else:
print('There is no opened database')
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error: {e}")
def fetching(self, query):
'''
Fetch the data from database using SELECT query
query (str) : the query to be executed
return fetching data and database column names
'''
try:
cursor = self.conn.cursor()
cursor.execute(query)
result = cursor.fetchall()
column = [desc[0] for desc in cursor.description]
return result, column
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error: {e}")
def exec(self, query, values=None):
'''
Execute custome query
query (str) : sql query in string
values : list of values if want to insert more than 1 values
'''
cursor = self.conn.cursor()
try:
if values:
cursor.executemany(query, values)
else:
cursor.execute(query)
self.conn.commit()
except (Exception, psycopg2.DatabaseError) as e:
self.conn.rollback()
print(f"Error: {e}")
data_crud.py
data_crud.py contain DataCrud class which is used to perform crud (create, read, update, delete). This file has the following methods :
import psycopg2
class DataCrud:
def __init__(self, data_connection):
self.data_connection = data_connection
def createDatabase(self, database_name:str):
'''
Create a database into postgresql server
database_name (str) : your new database name
'''
try:
sql = f"CREATE DATABASE {database_name}"
self.data_connection.exec(sql)
print(f"database {database_name} successfully created")
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error : {e}")
def dropDatabase(self, database_name:str):
'''
Delete a database from postgresql server
database_name (str) : your new database name
'''
try:
sql = f"DROP DATABASE {database_name}"
self.data_connection.exec(sql)
print(f"Database {database_name} successfully deleted")
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error : {e}")
def createTable(self, table_name:str, column:list):
'''
Create a new table into database
table_name (str) : new table name that not exist in database
column (list) : list of column name and datatype, must be a string
'''
try:
col = ", ".join(column)
sql = f"CREATE TABLE {table_name}({col})"
self.data_connection.exec(sql)
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error : {e}")
def importFile(self, path:str, columns:list, table_name:str ):
'''
Importing csv file into postgresql database
path (str) : csv file location
columnns (list) : List of column name that want to be inserted by each values from csv files
table_name (str) : table name
'''
try:
col = ','.join(columns)
sql =f'''copy {table_name}({col})
FROM {path}
DELIMITER ','
CSV HEADER;'''
self.data_connection.exec(sql)
except (Exception, psycopg2.DatabaseError) as e:
print(f"Error : {e}")
In importFile function, I’m using COPY command from PostgreSQL that can import CSV files into the table on your server. You can do this easily with pgadmin4, dbeaver, or another GUI-based software, but since I’m working on this in the WSL2 Ubuntu environment, I had to use a python script to import CSV data into my database.
data_explorer.py
data_explorer.py contain DataExplorer class which is used to perform importing data into DataFrame in order to explore data using Python libraries or perform predictive models in the future and perform a simple visualization. This file has the following methods :
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
class DataExplorer:
def __init__(self, data_connection):
self.data_connection = data_connection
def query_to_dataFrame(self, query:str):
'''
Execute a query and return the result to the DataFrame
query (str) : query to be executed
'''
data, column = self.data_connection.fetching(query)
df = pd.DataFrame(data, columns=column)
return df
def visualizer(self, df, col_x, col_y, vis:str, title:str):
'''
Create a simple data visualization
df = the dataframe
col_x = column for x axis
col_y = column for y axis
vis = type of visualization (bar, line, or scatter)
'''
vis_map = {
'bar' : sns.barplot,
'line' : sns.lineplot,
'scatter' : sns.lineplot
}
if vis not in vis_map:
print('Error, vis must be input with "bar", "line", or "scatter"')
return
vis_map[vis](x=col_x, y=col_y, data=df)
plt.title(title)
return plt.show()
main.py
main.py contains the main codes that tie everything together.
Connecting to PostgreSQL Server
from data_connection import DataConnection
from data_crud import DataCrud
# establish postgresql server connection
conn = DataConnection(
user='your-username',
password='your-password',
database='customer_credit',
)
# Connecting into PostgreSQL Sever
conn.connect()
In conn variable, I assigned the user, password, and database. I’m using PostgreSQL Server on my local computer. You can change that according to your own database credentials.
Data Model
Creating Table
I need to create a 5 table since the given data is divided into 5 tables, that is category_db, customer_data_history, education_db, marital_db, and status_db.
# Call DataCrud Class
crud = DataCrud(conn)
# create table
## Column
category_col = [
"id INTEGER PRIMARY KEY",
"card_category VARCHAR(128) UNIQUE"
]
marital_col = [
"id INTEGER PRIMARY KEY",
"marital_status VARCHAR(50) UNIQUE"
]
status_col = [
"id INTEGER PRIMARY KEY",
"status VARCHAR(50) UNIQUE"
]
education_col = [
"id INTEGER PRIMARY KEY",
"education_level VARCHAR(50) UNIQUE"
]
customer_col = [
"clientnum INTEGER PRIMARY KEY",
"idstatus INTEGER",
"customer_age INTEGER",
"gender VARCHAR(50)",
"dependent_count INTEGER",
"educationid INTEGER",
"maritalid INTEGER",
"income_category VARCHAR(50)",
"card_categoryid INTEGER",
"months_on_book INTEGER",
"total_relationship_count INTEGER",
"months_inactive_12_mon INTEGER",
"contacs_count_12_mon INTEGER",
"credit_limit NUMERIC",
"total_revolving_bal INTEGER",
"avg_open_to_buy NUMERIC",
"total_trans_amt INTEGER",
"total_trans_ct INTEGER",
"avg_utilization_ratio NUMERIC",
"FOREIGN KEY (idstatus) REFERENCES status_db(id)",
"FOREIGN KEY (educationid) REFERENCES education_db(id)",
"FOREIGN KEY (maritalid) REFERENCES marital_db(id)",
"FOREIGN KEY (card_categoryid) REFERENCES category_db(id)"
]
## Executing function to create table in database
crud.createTable('category_db', category_col)
crud.createTable('marital_db', marital_col)
crud.createTable('status_db', status_col)
crud.createTable('education_db', education_col)
crud.createTable('customer_data_history', customer_col)
now let's look at the table inside the database,
It seems all the tables have been created successfully. Currently, all the tables have no values, so we need to insert values into all tables. At this time, I will import the value inside the CSV file into each table
# Import csv files
## column
category_col_csv = ['id', 'card_category']
marital_col_csv = ["id","marital_status"]
status_col_csv = ["id", "status"]
education_col_csv = ["id", "education_level"]
customer_col_csv = [
"clientnum","idstatus","customer_age","gender","dependent_count","educationid","maritalid","income_category",
"card_categoryid","months_on_book","total_relationship_count","months_inactive_12_mon","contacs_count_12_mon",
"credit_limit","total_revolving_bal","avg_open_to_buy","total_trans_amt","total_trans_ct","avg_utilization_ratio"]
# execute importFile method
crud.importFile("'/home/firdaus/project/customer_data/category_db.csv'",category_col_csv, 'category_db')
crud.importFile("'/home/firdaus/project/customer_data/marital_db.csv'",marital_col_csv, 'marital_db')
crud.importFile("'/home/firdaus/project/customer_data/status_db.csv'",status_col_csv, 'status_db')
crud.importFile("'/home/firdaus/project/customer_data/education_db.csv'",education_col_csv, 'education_db')
crud.importFile("'/home/firdaus/project/customer_data/customer_data_history.csv'",customer_col_csv, 'customer_data_history')
# close connection
conn.close()
Now let’s look at the database again and make sure that all the data has been inserted correctly
It seems I made a mistake naming column “contacs_count_12_mon” it should be “contacts_count_12_mon”. So I need to rename it to the proper column name.
ALTER TABLE customer_data_history RENAME COLUMN contacs_count_12_mon TO contacts_count_12_mon;
Data Exploration With PostgreSQL
For this data exploration project, I am focusing on utilizing PostgreSQL queries to hone my skills in SQL. I’m using grafana and psql terminal to do all the queries and visualization.
Joining data
Firstly, I create a pseudo-table that join all the table into one table called customer_data_history_join
CREATE VIEW customer_data_history_join AS
SELECT cdh.*,
md.marital_status,
sd.status,
cd.card_category,
ed.education_level
FROM customer_data_history cdh
JOIN marital_db md
ON md.id = cdh.maritalid
JOIN status_db sd
ON sd.id = cdh.idstatus
JOIN category_db cd
ON cd.id = cdh.card_categoryid
JOIN education_db ed
ON ed.id = cdh.educationid;
Result
customer_credit=# \d
List of relations
Schema | Name | Type | Owner
--------+----------------------------+-------+----------
public | category_db | table | firdaus
public | customer_data_history | table | firdaus
public | customer_data_history_join | view | postgres
public | education_db | table | firdaus
public | marital_db | table | firdaus
public | status_db | table | firdaus
(6 rows)
SELECT * FROM customer_data_history_join LIMIT 10;
clientnum | idstatus | customer_age | gender | dependent_count | educationid | maritalid | income_category | card_categoryid | months_on_book | total_relationship_count | months_inactive_12_mon | contacts_count_12_mon | credit_limit | total_revolving_bal | avg_open_to_buy | total_trans_amt | total_trans_ct | avg_utilization_ratio | marital_status | status | card_category | education_level
-----------+----------+--------------+--------+-----------------+-------------+-----------+-----------------+-----------------+----------------+--------------------------+------------------------+-----------------------+--------------+---------------------+-----------------+-----------------+----------------+-----------------------+----------------+-------------------+---------------+-----------------
768805383 | 1 | 45 | M | 3 | 1 | 1 | $60K - $80K | 1 | 39 | 5 | 1 | 3 | 12691 | 777 | 11914 | 1144 | 42 | 0.061 | Married | Existing Customer | Blue | High School
818770008 | 1 | 49 | F | 5 | 2 | 2 | Less than $40K | 1 | 44 | 6 | 1 | 2 | 8256 | 864 | 7392 | 1291 | 33 | 0.105 | Single | Existing Customer | Blue | Graduate
713982108 | 1 | 51 | M | 3 | 2 | 1 | $80K - $120K | 1 | 36 | 4 | 1 | 0 | 3418 | 0 | 3418 | 1887 | 20 | 0 | Married | Existing Customer | Blue | Graduate
769911858 | 1 | 40 | F | 4 | 1 | 3 | Less than $40K | 1 | 34 | 3 | 4 | 1 | 3313 | 2517 | 796 | 1171 | 20 | 0.76 | Unknown | Existing Customer | Blue | High School
709106358 | 1 | 40 | M | 3 | 3 | 1 | $60K - $80K | 1 | 21 | 5 | 1 | 0 | 4716 | 0 | 4716 | 816 | 28 | 0 | Married | Existing Customer | Blue | Uneducated
713061558 | 1 | 44 | M | 2 | 2 | 1 | $40K - $60K | 1 | 36 | 3 | 1 | 2 | 4010 | 1247 | 2763 | 1088 | 24 | 0.311 | Married | Existing Customer | Blue | Graduate
810347208 | 1 | 51 | M | 4 | 4 | 1 | $120K + | 2 | 46 | 6 | 1 | 3 | 34516 | 2264 | 32252 | 1330 | 31 | 0.066 | Married | Existing Customer | Gold | Unknown
818906208 | 1 | 32 | M | 0 | 1 | 3 | $60K - $80K | 3 | 27 | 2 | 2 | 2 | 29081 | 1396 | 27685 | 1538 | 36 | 0.048 | Unknown | Existing Customer | Silver | High School
710930508 | 1 | 37 | M | 3 | 3 | 2 | $60K - $80K | 1 | 36 | 5 | 2 | 0 | 22352 | 2517 | 19835 | 1350 | 24 | 0.113 | Single | Existing Customer | Blue | Uneducated
719661558 | 1 | 48 | M | 2 | 2 | 2 | $80K - $120K | 1 | 36 | 6 | 3 | 3 | 11656 | 1677 | 9979 | 1441 | :
Data Cleaning
Handling Missing Values
Let’s assess if there are any missing values present in our data, I’m using the FILTER clause in my query to filter all null values in all columns and count all of them.
SELECT
COUNT(*) FILTER (WHERE clientnum IS NULL OR
idstatus IS NULL OR
customer_age IS NULL OR
gender IS NULL OR
dependent_count IS NULL OR
educationid IS NULL OR
maritalid IS NULL OR
income_category IS NULL OR
card_categoryid IS NULL OR
months_on_book IS NULL OR
total_relationship_count IS NULL OR
months_inactive_12_mon IS NULL OR
contacts_count_12_mon IS NULL OR
credit_limit IS NULL OR
total_revolving_bal IS NULL OR
avg_open_to_buy IS NULL OR
total_trans_amt IS NULL OR
total_trans_ct IS NULL OR
avg_utilization_ratio IS NULL OR
marital_status IS NULL OR
status IS NULL OR
card_category IS NULL OR
education_level IS NULL
) AS missing_count
FROM customer_data_history_join cdhj;
This SQL query will count the number of records in the “customer_data_history_join” table that has a missing value in one or more of the columns specified (clientnum, idstatus, customer_age, gender, dependent_count, educationid, maritalid, income_category, card_categoryid, months_on_book, total_relationship_count, months_inactive_12_mon, contacts_count_12_mon, credit_limit, total_revolving_bal, avg_open_to_buy, total_trans_amt, total_trans_ct, avg_utilization_ratio, marital_status, status, card_category, education_level). The result will be stored in the “missing_count” column.
missing_count
---------------
0
(1 row)
It seems, our data has no missing values, so far we’re good to go.
Duplicate Value
In order to ensure the integrity of our data, it’s important whether our data has duplicate value or not. Duplicate value could lead to inconsistency and error in the analysis, which could affect the accuracy of making business decisions
SELECT (cdhj.*)::text as column, count(*)
FROM customer_data_history_join cdhj
GROUP BY cdhj.*
HAVING count(*) > 1;
This query is counting the occurrences of each record in the customer_data_history_join and aggregates the count of each unique combination column. The output of this query will be a list of columns, with the count of occurrences for each unique combination of values in those columns. The HAVING clause is to filter the result to show the column that has a count greater than 1 that indicates a duplicate. (cdhj.*)::text syntax is CAST the entire record as text to ensure that each record in all columns is considered as one single unique value in the aggregation
result
column | count
--------+-------
(0 rows)
It seems our data has no duplicate values.
After a thorough analysis of the customer data, it has been found that all the columns in the customer_data_history_join table have no unusual data. The values in each column such as marital_status, status, education_level, and card_category, adhere to the expected range of values, and there are no anomalies present. This ensures that the data is reliable and can be used to make informed decisions for the credit services offered by the bank.
The consistency and integrity of the data are critical to the bank’s operations, and this analysis confirms that the data is in good shape for further analysis and utilization.
Exploring Demographic Insight
I use Grafana for further data exploration and data visualization
For data visualization, you could visit my Dashboard here
Customer Attired Analysis by Gender
select
count(*) filter (where status = 'Attrited Customer') as attrited,
count(*) filter (where status = 'Existing Customer') as existing
from customer_data_history_join;
This query returns the total number of customers in the customer_data_history_join table who have either “Attrited Customer” or “Existing Customer” status. The result will have two columns, “attrited” and “existing” that show the count of customers who have either status.
attrited | existing
----------+----------
1627 | 8500
(1 row)
From that data, you can see that there is a difference in the count of attrited and existing customers, let’s break down the attired customer based on gender
with base_q as (
select customer_age as age, gender,
count(*) filter (where status = 'Attrited Customer') as attrited
from customer_data_history_join
group by 1,2
order by 3 desc
)
select round(sum(attrited) filter (where gender = 'M')/
sum(attrited) * 100, 1) as male_attired_percent,
round(sum(attrited) filter (where gender = 'F')/
sum(attrited) * 100, 1) as female_attired_percent
from base_q;
This query calculates the percentage of attrited customers among all customers, filtered by gender.
male_attired_percent | female_attired_percent
----------------------+------------------------
42.8 | 57.2
(1 row)
Based on this data, it can be concluded that the percentage of male customers who have attrited is 42.8% and the percentage of female customers who have attrited is 57.2%. This information can be used to make data-driven decisions to improve customer retention and satisfaction, such as identifying the reasons why male and female customers in these specific income categories are more likely to attrite and address those factors. Additionally, this information can be used to target specific customer segments for marketing and retention efforts.
Distribution of Attrited Customers and Existing Customers By Gender and Income Category
SELECT income_category, gender,
COUNT(*) FILTER(WHERE status = 'Attrited Customer') AS attrited,
COUNT(*) FILTER(WHERE status = 'Existing Customer') AS existing
FROM customer_data_history_join
WHERE gender = 'M'
GROUP BY 1, 2
ORDER BY 3 DESC;
income_category | gender | attrited | existing
-----------------+--------+----------+----------
$80K - $120K | M | 242 | 1293
$60K - $80K | M | 189 | 1213
$120K + | M | 126 | 601
$40K - $60K | M | 105 | 671
Less than $40K | M | 30 | 247
Unknown | M | 5 | 47
(6 rows)
SELECT income_category, gender,
COUNT(*) FILTER(WHERE status = 'Attrited Customer') AS attrited,
COUNT(*) FILTER(WHERE status = 'Existing Customer') AS existing
FROM customer_data_history_join
WHERE gender = 'F'
GROUP BY 1, 2
ORDER BY 3 DESC;
income_category | gender | attrited | existing
-----------------+--------+----------+----------
Less than $40K | F | 582 | 2702
Unknown | F | 182 | 878
$40K - $60K | F | 166 | 848
(3 rows)
This SQL query aggregates the data of male customers in the customer_data_history_join table, grouping the data by the income_category and gender columns. It then counts the number of "Attrited Customer" and "Existing Customer" in each group and returns the results. The query filters the data to only include male customers and orders the results by the number of attrited customers in descending order.
This information could suggest that male customers in the $80K — $120K income category are more likely to leave the bank, while female customers in the Less than $40K category are also prone to attrition. This insight could help the bank understand which customer segments are at a higher risk of attrition and target them with retention strategies. It’s important to note that this is just a snapshot of the data and further analysis and investigation are required to confirm these patterns and their underlying causes.
Attrited Customer by Customer Age in Each Gender
select customer_age,
count(*) filter (where gender = 'F') as female_count
from customer_data_history_join
where status = 'Attrited Customer'
group by 1
order by 2 desc
limit 10;
select customer_age,
count(*) filter (where gender = 'M') as male_count
from customer_data_history_join
where status = 'Attrited Customer'
group by 1
order by 2 desc
limit 10;
This SQL query aggregates “Attrited Customer” in the customer_data_history_join table that is grouped by customer_age column and counts the number of Female in each group, order by count in descending order, and return the top 10 result
customer_age | female_count
--------------+--------------
44 | 56
46 | 49
48 | 48
54 | 46
47 | 45
43 | 45
41 | 44
45 | 43
49 | 43
51 | 40
(10 rows)
customer_age | male_count
--------------+------------
43 | 40
48 | 37
49 | 36
45 | 36
40 | 35
50 | 34
46 | 33
41 | 32
47 | 31
52 | 31
(10 rows)
The highest count of Female Customers that attrited is 44 years old and The highest count of Male Customers that attrited is 43 years old
Attrited Customer by Card Category in Each gender
select card_category,
count(*) filter (where gender = 'M') as Male,
count(*) filter (where gender = 'F') as Female
from customer_data_history_join
where status = 'Attrited Customer'
group by 1
order by 1 desc;
This query will return the number of customers who have attrited and are Male or Female for each card category. The result will be ordered by the card category in descending order.
card_category | male | female
---------------+------+--------
Silver | 54 | 28
Platinum | 1 | 4
Gold | 13 | 8
Blue | 629 | 890
(4 rows)
The highest distribution of card categories among male customers is the Blue category, with 629 customers. For female customers, the highest distribution is also the Blue category, with 890 customers. The distribution of the Platinum and Gold categories is relatively low for both male and female customers.
Analyzing the average “months on the book” for different age ranges and comparing it between attrited and existing customers
WITH base_q AS (
SELECT
CASE
WHEN customer_age BETWEEN 18 AND 25 THEN '18-25'
WHEN customer_age BETWEEN 26 AND 35 THEN '26-35'
WHEN customer_age BETWEEN 36 AND 45 THEN '36-45'
WHEN customer_age BETWEEN 46 AND 55 THEN '46-55'
WHEN customer_age BETWEEN 56 AND 65 THEN '56-65'
ELSE '65+'
END AS age_range,
status,
AVG(months_on_book) AS avg_month_on_book
FROM customer_data_history_join
GROUP BY 1,2
)
SELECT
age_range,
AVG(avg_month_on_book) FILTER (WHERE status = 'Attrited Customer') AS avg_month_on_book_attrited,
AVG(avg_month_on_book) FILTER (WHERE status = 'Existing Customer') AS avg_month_on_book_existing
FROM base_q
GROUP BY 1
ORDER BY 2 DESC;
This query is calculating the average “months on book” for different age ranges and compares it between attrited and existing customers. The age ranges are defined as 18–25, 26–35, 36–45, 46–55, 56–65, and 65+. The results show the average “months on book” for each age range, separated by whether the customer is attrited or an existing customer.
age_range | avg_month_on_book_attrited | avg_month_on_book_existing
-----------+----------------------------+----------------------------
65+ | 54.0000000000000000 | 52.3750000000000000
56-65 | 46.1052631578947368 | 46.2535971223021583
46-55 | 38.7427325581395349 | 38.8001160429358863
36-45 | 31.8960396039603960 | 31.8242984693877551
26-35 | 25.6885245901639344 | 24.5759096612296110
(5 rows)
On the result, we can see that the average of the month on the book of the attrited customers is decreased as customers get younger. And from this data, we could understand how each age range will likely leave the bank.
How many customers for each age range group?
SELECT
CASE
WHEN customer_age BETWEEN 18 AND 25 THEN '18-25'
WHEN customer_age BETWEEN 26 AND 35 THEN '26-35'
WHEN customer_age BETWEEN 36 AND 45 THEN '36-45'
WHEN customer_age BETWEEN 46 AND 55 THEN '46-55'
WHEN customer_age BETWEEN 56 AND 65 THEN '56-65'
ELSE '65+'
END AS age_range,
COUNT(*) FILTER (where idstatus = 1) as Existing_Customer,
COUNT(*) FILTER (where idstatus = 2) as Attrited_Customer
FROM customer_data_history_join cdhj
GROUP BY 1
ORDER BY 1;
This query is calculating the count of customers for different age range and compared it between Existing and Attrited Customers. The age ranges are defined as 18–25, 26–35, 36–45, 46–55, 56–65, and 65+. The results show the count of customers for each age range, separated by whether the customer is attrited or an existing customer.
age_range | existing_customer | attrited_customer
-----------+-------------------+-------------------
26-35 | 797 | 122
36-45 | 3136 | 606
46-55 | 3447 | 688
56-65 | 1112 | 209
65+ | 8 | 2
(5 rows)
From this perspective, we can see that the largest customers count is in the age group 46–55 and 36–45. This information can be used to help identify potential areas to focus on for reducing customer attrition.
Income Category in Attrited Customer and Existing Customer for Each Category
SELECT
CASE
WHEN customer_age BETWEEN 18 AND 25 THEN '18-25'
WHEN customer_age BETWEEN 26 AND 35 THEN '26-35'
WHEN customer_age BETWEEN 36 AND 45 THEN '36-45'
WHEN customer_age BETWEEN 46 AND 55 THEN '46-55'
WHEN customer_age BETWEEN 56 AND 65 THEN '56-65'
ELSE '65+'
END AS age_range,
count(*) filter (where income_category = 'Unknown') as unknown,
count(*) filter (where income_category = 'Less than $40K') as less_than_40K,
count(*) filter (where income_category = '$40K - $60K') as between_40_and_60K,
count(*) filter (where income_category = '$60K - $80K') as between_60_and_80,
count(*) filter (where income_category = '$80K - $120K') as between_80_and_120K,
count(*) filter (where income_category = '$120K +') as more_than_120K
from customer_data_history_join cdhj
where status = 'Existing Customer'
group by 1 order by 1;
SELECT
CASE
WHEN customer_age BETWEEN 18 AND 25 THEN '18-25'
WHEN customer_age BETWEEN 26 AND 35 THEN '26-35'
WHEN customer_age BETWEEN 36 AND 45 THEN '36-45'
WHEN customer_age BETWEEN 46 AND 55 THEN '46-55'
WHEN customer_age BETWEEN 56 AND 65 THEN '56-65'
ELSE '65+'
END AS age_range,
count(*) filter (where income_category = 'Unknown') as unknown,
count(*) filter (where income_category = 'Less than $40K') as less_than_40K,
count(*) filter (where income_category = '$40K - $60K') as between_40_and_60K,
count(*) filter (where income_category = '$60K - $80K') as between_60_and_80,
count(*) filter (where income_category = '$80K - $120K') as between_80_and_120K,
count(*) filter (where income_category = '$120K +') as more_than_120K
from customer_data_history_join cdhj
where status = 'Attrited Customer'
group by 1 order by 1;
This SQL query will group customers based on their age range, and for each age range it will calculate the count of customers with different income categories. The query first uses a CASE statement to categorize the customer_age into different age ranges, and then uses the count() function with a filter clause to count the number of customers in each income category for each age range. Finally, the query groups the results by the age_range column.
--Existing Customer
age_range | unknown | less_than_40k | between_40_and_60k | between_60_and_80 | between_80_and_120k | more_than_120k
-----------+---------+---------------+--------------------+-------------------+---------------------+----------------
26-35 | 117 | 314 | 163 | 97 | 76 | 30
36-45 | 313 | 1074 | 573 | 490 | 501 | 185
46-55 | 361 | 1132 | 549 | 488 | 593 | 324
56-65 | 132 | 426 | 231 | 138 | 123 | 62
65+ | 2 | 3 | 3 | 0 | 0 | 0
(5 rows)
--Attrited Customer
age_range | unknown | less_than_40k | between_40_and_60k | between_60_and_80 | between_80_and_120k | more_than_120k
-----------+---------+---------------+--------------------+-------------------+---------------------+----------------
26-35 | 14 | 57 | 17 | 14 | 14 | 6
36-45 | 57 | 227 | 118 | 85 | 81 | 38
46-55 | 90 | 249 | 104 | 58 | 123 | 64
56-65 | 25 | 78 | 32 | 32 | 24 | 18
65+ | 1 | 1 | 0 | 0 | 0 | 0
(5 rows)
Based on these insights, the company can focus on retaining customers in the 46–55 age range, as this age range has the highest number of customers across all income categories. The company can also focus on retaining customers in the 36–45 age range, especially those with an income of less than 40K.
Average Months on Book for Each Income Category
WITH base_q AS (
SELECT
income_category,
status,
AVG(months_on_book) AS avg_month_on_book
FROM customer_data_history_join
GROUP BY 1,2
)
SELECT
income_category,
AVG(avg_month_on_book) FILTER (WHERE status = 'Attrited Customer') AS avg_mob_attrited,
AVG(avg_month_on_book) FILTER (WHERE status = 'Existing Customer') AS avg_mob_existing
FROM base_q
GROUP BY 1
ORDER BY 1 DESC;
This query will find the average of the “months_on_book” column for each combination of “income_category” and “status” by using a common table expression (CTE) named “base_q”.
Then, it will select the average “months_on_book” for each income category and status, specifically “Attrited Customer” and “Existing Customer”, and display the result with the “income_category” column in descending order.
income_category | avg_mob_attrited | avg_mob_existing
-----------------+---------------------+---------------------
Unknown | 36.6844919786096257 | 35.5654054054054054
Less than $40K | 36.0392156862745098 | 35.8491013903017972
$80K - $120K | 36.3223140495867769 | 36.0278422273781903
$60K - $80K | 36.0740740740740741 | 35.5226710634789777
$40K - $60K | 35.7749077490774908 | 35.8907175773535221
$120K + | 36.8492063492063492 | 36.9001663893510815
(6 rows)
From the data above, we can see, average months on book in different income category relatively similar
Purchasing Pattern
select
total_trans_amt
from customer_data_history_join cdhj where idstatus = 1;
select
total_trans_amt
from customer_data_history_join cdhj where idstatus = 2;
The Total Annual Transaction Amount varies a lot and forms a multimodal distribution
Insight
- The highest distribution of customers is for those in the 36–45 age range.
- The majority of customers hold a Blue card.
- The average months on book for attrited customers is slightly higher compared to existing customers.
- There is not much difference in the average months on book between the different income categories.
- Based on the data, 43% of the customers who attrited are male with an income category of $80K — $120K.
- Based on the analysis, it can be concluded that 57% of the attrited customers are female and have an income category of less than $40K.
It seems Age and income could be contributing factors to customer attrition, So we could do an auto screening for a customer that has income less than $40K and an age range between 36–55 to the attrited potential customer, and put concern on the male customer that’s in $80K -= 120K income category