Data Bank — Where SQL Queries are My Deposits

Case Study 4 of Danny Ma’s #8WeekSQLChallenge

8 min readOct 23, 2023

INTRODUCTION

Hi! If you’ve been following this series of articles, you’ll know I took quite the hiatus. Life. We’ll just get right back into it.

This is the fourth installment in my #8WeekSQLChallenge series by Danny Ma. This project is important, as it was used to calculate growth metrics, customers’ activities over specified periods of time and the amount of cloud data customers will be entitled to, based on their activity. For this project, I opted to use PostgreSQL for the first time, instead of MySQL which I used for the first three.

Interested? Let’s begin.

THE PROBLEM

Data Bank is a Neo-bank with no physical branches (think Kuda, if you’re Nigerian). As such, it isn’t only for banking activities, they also have the world’s most secure distributed data storage platform!

Customers are allocated cloud data storage limits which are directly linked to how much money they have in their accounts. The management team at Data Bank want to increase their total customer base — but also need some help tracking just how much data storage their customers will need.

THE DATA

Danny provided three tables: regions, customer_nodes and customer_transactions, connected as shown in the following entity relationship diagram (ERD):

**The Entity Relationship Diagram (ERD)**

regions: This table represents the continents of the world with their unique identifiers. Note that Antartica was not listed (obviously), and North and South America were merged into one America.
customer_nodes: Customers are randomly distributed across the nodes according to their region — this also specifies exactly which node contains both their cash and data. This random distribution changes frequently to reduce the risk of hackers getting into Data Bank’s system and stealing customer’s money and data.
customer_transactions: This table stores all customer transactions, which could be deposits, withdrawals and purchases, made using their Data Bank debit card.

DATA EXPLORATION

This study was split into 5 sections, but the fifth will be discussed in the next part of this article.

Customer Nodes Exploration
Customer Transactions
Data Allocation Challenge
Extra Challenge
Extension Request

To save time to discuss other longer parts of this project, I’ll only write my answers to the first 4 questions in the Customer Nodes Exploration and Customer Transactions sections.

A. Customer Nodes Exploration

How many unique nodes are there on the Data Bank system?

SELECT COUNT(DISTINCT node_id)
FROM customer_nodes;

**There are 5 unique nodes on the Data Bank system**

2. What is the number of nodes per region?

SELECT r.region_name, COUNT(DISTINCT n.node_id)
FROM regions as r
JOIN customer_nodes as n
ON r.region_id = n.region_id
GROUP BY 1
ORDER BY 2;

3. How many customers are allocated to each region?

SELECT r.region_name, COUNT(DISTINCT n.customer_id)
FROM regions as r
JOIN customer_nodes as n
ON r.region_id = n.region_id
GROUP BY 1
ORDER BY 2 DESC;

**Australia and Europe were allocated the highest and lowest number of customers respectively**

4. How many days on average are customers reallocated to a different node?

SELECT CONCAT(CEIL(AVG(day_to_change)), ' days') as avg_day_to_change
FROM
(SELECT customer_id, start_date, node_id,
LEAD(start_date) OVER (PARTITION BY customer_id ORDER BY start_date) 
  AS next_date,
CASE WHEN LEAD(start_date) OVER (PARTITION BY customer_id ORDER BY start_date) IS NOT NULL 
THEN (LEAD(start_date) OVER (PARTITION BY customer_id ORDER BY start_date) - start_date) 
  ELSE 0 END AS day_to_change
FROM customer_nodes) T1;

Customers are reallocated every fortnight on average

B. Customer Transactions

What is the unique count and total amount for each transaction type?

SELECT txn_type, COUNT(DISTINCT customer_id) AS unique_count,
SUM(txn_amount) AS total
FROM customer_transactions
GROUP BY 1
ORDER BY 2 DESC;

Deposits were the most transactions made, and made the highest amount

2. What is the average total historical deposit counts and amounts for all customers?

SELECT ROUND(AVG(deposit_count)) AS avg_deposit_count, 
ROUND(AVG(deposit_amount)) AS avg_deposit_amount
FROM
(SELECT customer_id, COUNT(txn_type) AS deposit_count, SUM(txn_amount) AS deposit_amount
FROM customer_transactions
WHERE txn_type = 'deposit'
GROUP BY customer_id) deposit_txns;

**Customers made 5 deposits on average, with $2718 as the average amount**

3. For each month — how many Data Bank customers make more than 1 deposit and either 1 purchase or 1 withdrawal in a single month?

WITH type_counts AS
(SELECT customer_id, 
COUNT(CASE WHEN txn_type = 'deposit' THEN customer_id END) AS deposit_count,
COUNT(CASE WHEN txn_type = 'purchase' THEN customer_id END) AS purchase_count,
COUNT(CASE WHEN txn_type = 'withdrawal' THEN customer_id END) AS withdrawal_count
FROM customer_transactions
GROUP BY customer_id)
SELECT TO_CHAR(c.txn_date, 'Month') as month_name,
CONCAT(COUNT(DISTINCT t.customer_id), ' customers') AS ret_customers
FROM type_counts t
JOIN customer_transactions c
ON t.customer_id = c.customer_id
WHERE t.deposit_count > 1 AND (t.purchase_count > 0 OR t.withdrawal_count > 0)
GROUP BY 1
ORDER BY 2 DESC;

**January had the highest amount of returning customers, by this metric**

4. What is the closing balance for each customer at the end of the month?

WITH customer_funds AS
(SELECT customer_id, TO_CHAR(txn_date, 'Month') as month_name,
SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -1 * txn_amount END) AS total_amount
FROM customer_transactions
GROUP BY customer_id, 2
ORDER BY 1)
SELECT customer_id, month_name,
SUM(total_amount) OVER(PARTITION BY customer_id ORDER BY month_name) AS closing_balance
FROM customer_funds;

This is a section of each customer’s monthly closing balance

C. Data Allocation Challenge

The Data Bank team wanted to run an experiment where different groups of customers would be allocated data using 3 different options:

Option 1: data is allocated based off the amount of money at the end of the previous month using a running customer balance column that includes the impact of each transaction.
Option 2: data is allocated on the average amount of money kept in the account in the previous 30 days using customer balance at the end of each month.
Option 3: data is updated real-time using minimum, average and maximum values of the running balance for each customer.

This is to help the Data Bank team estimate how much data will need to be provisioned for each option.

The queries are really long! So to make it easier, I’ll only add the queries to generate the first elements. The queries to determine the allocations will be on my GitHub.

-- running customer balance column that includes the impact of each transaction
WITH amounts AS
(SELECT customer_id, EXTRACT('Month' FROM txn_date) as month_num, txn_type, txn_amount, 
CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -1 * txn_amount END AS total_amount
FROM customer_transactions
ORDER BY customer_id)

SELECT customer_id, month_num, total_amount, 
SUM(total_amount) OVER (PARTITION BY customer_id, month_num 
 ORDER BY month_num ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
   AS running_balance
FROM amounts;

-- customer balance at the end of each month
SELECT customer_id, EXTRACT('Month' FROM txn_date) as month_num, 
SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -1 * txn_amount END) AS monthly_balance
FROM customer_transactions
GROUP BY 1, 2
ORDER BY customer_id
;

-- minimum, average and maximum values of the running balance for each customer
WITH amounts AS
(SELECT customer_id, EXTRACT('Month' FROM txn_date) as month_num, txn_type, txn_amount, 
CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -1 * txn_amount END AS total_amount
FROM customer_transactions
ORDER BY customer_id),

balances AS 
(SELECT customer_id, month_num, total_amount, 
SUM(total_amount) OVER (PARTITION BY customer_id, month_num 
 ORDER BY month_num ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
   AS running_balance
FROM amounts)

SELECT customer_id, 
MIN(running_balance) as min_bal, 
ROUND(AVG(running_balance),2) AS avg_bal, 
MAX(running_balance) AS max_bal
FROM balances
GROUP BY 1;

**From top to bottom, each table shown in the picture above determines the allocation based on Options 1, 2 and 3**

This was a very tricky section but in summary: Data Bank wanted to use the elements listed above to determine the most efficient way to allocate cloud storage space to their customers. Would that be running balance, average monthly balance or min-avg-max running balance? For me, it was Option 1 (running balance, table 1). I felt so for two reasons:

It was more equitable, as the figures weren’t far off from one another.
Using an average to determine the allocation could be heavily skewed by a huge purchase (which would add disproportionately) or a huge withdrawal (which would do the opposite).

D. Extra Challenge

Data Bank wanted to determine allocations based on a 6% per annum interest rate. I had to whip out some secondary school mathematics for this one. Recall the formula for simple interest? It was very useful here.

WITH monthly_balances AS
(SELECT customer_id, EXTRACT('Month' FROM txn_date) as month_num, 
 
SUM(CASE WHEN txn_type = 'deposit' THEN txn_amount ELSE -1 * txn_amount END) AS monthly_balance
FROM customer_transactions
GROUP BY 1, 2
ORDER BY customer_id),

interest_earned AS
(SELECT *,
ROUND(((monthly_balance * 6 * 1)/(100.0 * 12)),2) AS interest
FROM monthly_balances
GROUP BY customer_id, month_num, monthly_balance
ORDER BY customer_id, month_num),

total_earnings AS
(SELECT customer_id, month_num, 
(monthly_balance + interest) AS earnings
FROM interest_earned
GROUP BY 1, 2, 3
ORDER BY 1, 2)

SELECT month_num,
SUM(CASE WHEN earnings < 0 THEN 0 ELSE earnings END) as allocation
FROM total_earnings
GROUP BY 1
ORDER BY 1;

**Allocations using the 6% interest rate**

Now this is an improvement on the running balance option from the last section, as it still made use of the balances (but as interest amounts), and dealt with smaller amounts, thus requiring less space. Easily the definition of efficiency.

DATA VISUALIZATION & REPORTING

This is the Extension Request section of the project. In this section, I was asked to create a PowerPoint presentation containing the key insights derived from the data and all the questions asked.

Check out the PowerPoint presentation here.

Based on the questions answered, the following recommendations were made:

To improve the security of the reallocations, the reallocation rate should reduce to an average of 7 days. If these reallocations can’t take place faster, they should be optimized to quicken during peak periods, such as around the end of year.
Customers should be duly notified of their reallocations at most 2 days before they occur.
Cloud Data storage spaces should be shared based on accrued interest for efficiency and equitability.
Saving plans should be included into customers’ payments to increase their earned interest and storage allocations, and ultimately, to earn money for Data Bank.

FINAL WORDS

This was a very difficult challenge, particularly in the Data Allocation sections. I used a lot more mathematics here, in calculating running balances, medians and percentiles, and simple interest using specified rates in SQL. It also highlighted the importance of good presentation skills. Looking forward to Case Study 5!

As always, I am open to questions and corrections on this project and collaborations on future projects.

Till my next article, ¡adiós!

PS: Follow me on Twitter and connect with me on LinkedIn. View my portfolio here. All code used and the link to the interactive report are stored in the Data Bank folder on my GitHub.