A case-study for applying Survival Analysis on real business problem.
Sample code accompanied with this article: https://github.com/vfa-phuclkh/online_retail_survival_analysis/blob/master/motorbike_online_retailer.ipynb
Over the past few years, Machine Learning has been a very useful tool in the decision-making processes for companies all over the world. In this post, I will show how to apply Survival Analysis — an interesting but not very popular approach in Machine Learning — to enhance the business of motorbike online retailer. From this case study, you will see that Survival Analysis can be applied to many problems and helps companies conduct their business more efficiently.
Imagine you’re running an online retailer that sell used motorbike. The routine business operations consist of:
- stocking the used motorbikes
- publishing them with detailed information and some photos
- responding to inquiries and order for it.
From the retailer’s website, let collect a little dataset of many ads of the same motorbike types that have been sold. It might include:
- How old is the motorbike?
- The distance it has traveled.
- The price tag for the motorbike.
- When is it sold?
- How long does it take to sold the motorbike?
- Publish motorbike’s advertisement can be automated in most part, except for setting the price where human input is needed. It took time and create a bottleneck in the process speed. If we can automate this decision as well, the entire process can be done automatically.
- More importantly, does human really know how best to set the price?
Let’s examine the money flow in the business:
- Retailer purchases old bikes -> Outgoing money 💸
- Stocking and maintenance -> Outgoing money 💸
- Sales <- Incoming money 💰
In this article, we focus on the process after purchasing, so purchase cost is considered sunk cost and will be ignored, as it can not be controlled.
In order to maximize the difference between Incoming money and Outgoing money, let consider two extreme actions:
- On one end, to minimize the “Stocking and maintenance” cost, the price should be always set at 0. This ensures the bike would be sold in no time.
- On the other end of the spectrum, set an extremely expensive price would greatly increase the “Sale” income. However, cost for stocking and maintaining increase for every single day that bikes are still in stock.
In total, we have Profit = Sales revenue- Cost. The goal is to find the sweet spot to set the price in between these two extremes using the accumulated data on the retailer’s website.
Goal: Find the best price to maximize profit.
Why not regression?
In the case of our motorbike data, Linear regression is not an appropriate approach because it can not handle censored data. More detail about censored and uncensored data is discussed later in the Survival Analysis section but in short, an action that dramatically change the property of an entry such as price down operation is not possible to be introduced in Linear regression.
Scikit-survival is an open source library built on top of Scikit-learn. This library makes it possible to apply Machine Learning method to Survival Analysis.
In statistic, Survival Analysis is used to study the time to happen of an event, such as:
- How long will it take for patients to recover from illness?
- How long will it take for industrial products to be broken? …
Unlike usual machine learning, in Survival Analysis it is possible to obtain the output as the probability of an event happen along timeline, which is called as Survival curve.
Survival curve (function)
A plot of a survival function is a series of declining horizontal step, with the vertical axis represent the probability of surviving of the population over the horizontal time axis.
Given a large enough sample size, the estimator will approach the true survival function of the population.
Survival function by types
It’s useful to analyze survival function for some specific feature.
In this case, the data is divide into 2 groups: the group that has price lower than average and the other group has price higher than the average. Then two survival functions of each group are plotted side by side.
Interestingly, the survival function for the bike that has higher-than-average price is to the left of the other. That means the population for the high-price group decreases faster and bikes in that group are more likely to be sold first.
Does that means the bike should be priced higher if we want to sell it faster?
Not so fast, correlation does not imply causation.
This may simply because the bikes that are set with a higher price is in better quality and that’s why it is sold faster.
2. Frame the motorbike data as a Survival Analysis problem
One interesting property of Survival Analysis is that it can be used even in the case when training data can only be partially observed.
In Survival Analysis there are two type of data: censored and uncensored. Censored data is when we do not know if the event happen during the observation period, while Uncensored data is when the time an event happen to the sample is observed.
In case of motorbike retail, event of interest is when the motorbikes are sold:
In current data, some samples does not have Publish_period values and it can be considered as censored data. For other samples that also have a price change operation, it can be divided into a censored and uncensored event as follow:
- The censored entry A has Published_period equals the period from beginning to when the price is changed.
- The uncensored entry B has Published_period equals the period from when the price is changed until the purchase date.
Now the Survival Analysis model treats the newly created entries as two different bike-advertisements.
3. Model Profit with Survival curves
Profit = Revenue - Cost
After a Survival Analysis estimator is fitted using the data prepared above, the plan to find the best price for maximum profit is as follow:
- Step 1: Estimate Survival curves for all the possible values over the price range of each motorbike.
- Step 2: For a survival function at price p, convert it to a table of sell amount over time.
- Step 3: Revenue is exactly the price p that it’s sold at:
- Step 4.1: Cost is modeled linearly in this example, as a constant C amount of money per day.
- Step 4.2: Intuitively, we know that Total_Cost would be equal to:
On the other hand, the average day it takes to sell a motorbike is equal to the area under the survival curve of that bike. This can be calculated as integral of Motorbikes_Remains over time t, as a result:
- Step 5: Finally, calculate Profit at price p from Revenue and Cost
Figures below plot the calculated Profit over the price range of two different bikes, with Profit in vertical axis and Price in horizontal direction:
We now have a Machine Learning model based on Survival Analysis that can automatically set the price for a new bike advertisement with the best expected profit.
Survival Analysis is an interesting approach in statistic but has not been very popular in the Machine Learning community. Through this case study, now you can add a new technique to your Machine Learning toolbox.
As powerful as the tool can get, this case study proves that having a good understanding of the business processes and the ability to apply Machine Learning techniques in a flexible manner are critical factors to the success of a Machine Learning project.