Data Science in Business
A real-time scenario illustrating the importance of Communication, domain knowledge, and the ability to take swift decisions in Data Science.
“There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and a master of some.’” — Brendan Tierney
I believe the word some in the above quote includes communication and domain knowledge. You might have read many articles focusing on the technical facets of data science. In this article, we will discuss about some not-so-technical facets that data scientists encounter in their day-day lives by picturing a scenario.
I am working as a data practitioner for the online department of Eastside, a large retail company. My manager passes by my desk on his way to a meeting and asks me to figure out “our best customers” and leaves in a whisker.
What does best mean here? Does it mean customers who have spent the most? or does it mean customers who buy more? Notice that spending most and buying many items are two completely different things.
The situation which happened above is a common occurrence in the field of data. The usage of fuzzy(vague) language. More often, we will hear people expressing their ideas using natural language which looks good initially but on close inspection are ill-defined.
In the above situation, you noticed how bad communication can have an adverse impact. A Linkedin study states that communication is the most sought-after soft skill. Even though my manager was not precise in his request, I could have sought clarifications. If we find out the end goal of the request ie. Why does he want to know the best customers? we can decide upon our approach.
Upon reaching to my manager, he explains that there are $1000 left in the marketing budget and he wants to use that money to convert some physical store customers to the online stores by emailing them some free coupons. One caveat here is that we should not steal the customers of the physical store as it may create a problem for the physical store’s head. He also mentions that this task must be accomplished within two hours!!
This is where domain knowledge comes into the picture. What does the stealing of customers mean? It means we should not send coupons to active customers of the physical store as it may prevent them from going to the physical store. Instead, we can send coupons to some of the best-churned customers.
Customer churn — means a customer, ceases to be a customer. (I bought a subscription on Netflix for 3 months and unsubscribed for it later. I am a churned customer.)
I explained to my manager that we will consider a customer to be churned if he has not purchased anything from the physical store in the past 3 months (Most of the customers buy only groceries from the physical store. It is safe to assume that someone who has not purchased anything for the past 3 months has churned). My manager agrees and gives me the dataset of all the physical store customers.
You may think the idea for churned customers is not perfect. There arises a situation in data science when we don’t know the truth due to time constraints or the inability to measure it. We use approximations close to truth. These are called proxies. When a request is urgent, it is common to use proxies.
Let us explore the data of the physical store given.
import pandas as pd
import datetimedata = pd.read_csv("/content/es_phy_store.txt")
The Output —
There are 1,25,000 rows in total and 3 columns. Our goal is to return the id’s of the best customers. We can also see that all the columns are not-null indicating that the data is clean.
To find out the churned customers we will group the data by
customer_id and find out the latest
transaction_date of each customer.
group_by_customer = data.groupby("customer_id")
last_transaction = group_by_customer["transaction_date"].max()
Since customers are considered to be churned if their last transaction was three months ago we will create a cutoff date of May 1st, 2020, and label the customers accordingly.
We will create a separate data frame called best_churn which consists of the
transaction_date and a boolean column
churned denoting whether the customer was churned or not.
Ranking the Customers
We found out the churned customers. The main aim is to find the best-churned customers. Firstly, we need to rank the customers based on some criteria, and next, we need to find a threshold value to identify the best customers.
Due to the time constraints, we cannot use a complex ML/DL model. We can use a simple weighted-sum model to classify customers. This model assigns a number(score) to each customer denoting how good they are. In our case, we need to consider two criteria — Amount Spent and the Number of Purchases made. Both must be given the same weight ie. a customer who spends a lot is equivalent to a customer who makes more purchases. So we can define the customer score as —
Score = (1/2 × Number of purchases)+(1/2 × Amount spent)
For example, if a customer made 2 purchases worth $500 his score would be (1/2 × 2) + (1/2 × 500) = 251.
Let us find the number of transactions per customer and create a separate column. This can be accomplished by grouping the data based on customer_id and using the
size() method. We can also find the total amount spent by using the
sum() method on the
transaction_amount column. We will also drop the
transaction_date column which is no longer required.
best_churn["no_of_transactions"] = group_by_customer.size()
best_churn["amount_spent"] = group_by_customer.sum()
Everything seems to be good, but if we take a closer look at the formula we notice a defect. We saw that when a customer spent $500 for 2 purchases his score was 251. If a customer has spent $400 across 20 different purchases his score would be 210 which seems to be unfair because it seems that the second customer is more regular than the first one and shows more potential to spend in the long run. This is happening mainly due to two reasons.
1) Money spent always exceeds the number of transactions.
2) We are using the same weights for both the criteria.
Let us find out the min and max number of transactions and amounts from the best_churn data frame.
best_churn[["no_of_transactions", "amount_spent"]].describe().loc[["min", "max"]]
We can see that the number of transactions is way too less when compared to the amount spent. To overcome this problem we will use min-max scaling which is used to compare different scales in a meaningful way. The formula for min-max scaling is —
Let us apply the above formula on our
amount_spent columns, find out the score using the scaled values and sort the data frame based on the score.
How do we find out the threshold score value?
Should we chose the first 20 customers? or the first 50 customers? or the top 10%? What should be the criteria? Again, domain knowledge plays a crucial role here. We know that the budget is $1000. Each coupon value is not specified and we must decide the value. The coupon value cannot be too high because it reduces the number of customers.
We all know that a 30% discount on one transaction is a pretty decent deal.
So, let us find out the mean value of all the 1,25,000 transactions in the initial data frame we have and find 30% of that mean value.
coupon = data["tran_amount"].mean() * 0.3Output - 19.4976
Let us round this to 20. Hence each coupon value is $20. We know that our budget is $1000. Dividing 1000/20 yields 50. Hence we can select the top 50 customers from the
best_churn data whose churned value is 1 and mail the coupons.
top_50_churned = best_churn.loc[best_churn["churned"] == 1].head(50)
In this article, we understood the importance of communication and how the usage of fuzzy-language can be a hindrance. We also took a real-life scenario and solved a problem that had many constraints and also required the usage of communication skills, domain knowledge, and quick decision-making. I hope you learned something new today.
If you would like to get in touch, connect with me on LinkedIn.