A case-study for applying Survival Analysis on real business problem.

Sample code accompanied with this article: https://github.com/vfa-phuclkh/online_retail_survival_analysis/blob/master/motorbike_online_retailer.ipynb

Over the past few years, Machine Learning has been a very useful tool in the decision-making processes for companies all over the world. In this post, I will show how to apply Survival Analysis — an interesting but not very popular approach in Machine Learning — to enhance the business of motorbike online retailer. From this case study, you will see that Survival Analysis can be applied to many problems and helps companies conduct their business more efficiently.

Situation

Imagine you’re running an online retailer that sell used motorbike. The routine business operations consist of:

  • stocking the used motorbikes
  • publishing them with detailed information and some photos
  • responding to inquiries and order for it.

Data

From the retailer’s website, let collect a little dataset of many ads of the same motorbike types that have been sold. It might include:

  • How old is the motorbike?
  • The distance it has traveled.
  • The price tag for the motorbike.
  • When is it sold?
  • How long does it take to sold the motorbike?
Data timeline

Problems

  • Publish motorbike’s advertisement can be automated in most part, except for setting the price where human input is needed. It took time and create a bottleneck in the process speed. If we can automate this decision as well, the entire process can be done automatically.
  • More importantly, does human really know how best to set the price?

Business flow

Let’s examine the money flow in the business:

  • Retailer purchases old bikes -> Outgoing money 💸
  • Stocking and maintenance -> Outgoing money 💸
  • Sales <- Incoming money 💰
In this article, we focus on the process after purchasing, so purchase cost is considered sunk cost and will be ignored, as it can not be controlled.

Goal

In order to maximize the difference between Incoming money and Outgoing money, let consider two extreme actions:

  • On one end, to minimize the “Stocking and maintenance” cost, the price should be always set at 0. This ensures the bike would be sold in no time.
  • On the other end of the spectrum, set an extremely expensive price would greatly increase the “Sale” income. However, cost for stocking and maintaining increase for every single day that bikes are still in stock.

In total, we have Profit = Sales revenue- Cost. The goal is to find the sweet spot to set the price in between these two extremes using the accumulated data on the retailer’s website.

Goal: Find the best price to maximize profit.

Methods

Why not regression?

In the case of our motorbike data, Linear regression is not an appropriate approach because it can not handle censored data. More detail about censored and uncensored data is discussed later in the Survival Analysis section but in short, an action that dramatically change the property of an entry such as price down operation is not possible to be introduced in Linear regression.

Survival Analysis:

Scikit-survival is an open source library built on top of Scikit-learn. This library makes it possible to apply Machine Learning method to Survival Analysis.

1. Introduction

In statistic, Survival Analysis is used to study the time to happen of an event, such as:

  • How long will it take for patients to recover from illness?
  • How long will it take for industrial products to be broken? …

Unlike usual machine learning, in Survival Analysis it is possible to obtain the output as the probability of an event happen along timeline, which is called as Survival curve.

Survival curve (function)

Survival function has the famous S-curve.

A plot of a survival function is a series of declining horizontal step, with the vertical axis represent the probability of surviving of the population over the horizontal time axis.

Given a large enough sample size, the estimator will approach the true survival function of the population.

Survival function by types

It’s useful to analyze survival function for some specific feature.

In this case, the data is divide into 2 groups: the group that has price lower than average and the other group has price higher than the average. Then two survival functions of each group are plotted side by side.

Survival function by price

Interestingly, the survival function for the bike that has higher-than-average price is to the left of the other. That means the population for the high-price group decreases faster and bikes in that group are more likely to be sold first.

Does that means the bike should be priced higher if we want to sell it faster?

Not so fast, correlation does not imply causation.

This may simply because the bikes that are set with a higher price is in better quality and that’s why it is sold faster.

2. Frame the motorbike data as a Survival Analysis problem

One interesting property of Survival Analysis is that it can be used even in the case when training data can only be partially observed.

In Survival Analysis there are two type of data: censored and uncensored. Censored data is when we do not know if the event happen during the observation period, while Uncensored data is when the time an event happen to the sample is observed.

In case of motorbike retail, event of interest is when the motorbikes are sold:

Censored and Uncensored data in Motorbike online retail.

In current data, some samples does not have Publish_period values and it can be considered as censored data. For other samples that also have a price change operation, it can be divided into a censored and uncensored event as follow:

Convert one sample to a censored and an uncensored samples.
  • The censored entry A has Published_period equals the period from beginning to when the price is changed.
  • The uncensored entry B has Published_period equals the period from when the price is changed until the purchase date.

Now the Survival Analysis model treats the newly created entries as two different bike-advertisements.

3. Model Profit with Survival curves

Profit = Revenue - Cost

After a Survival Analysis estimator is fitted using the data prepared above, the plan to find the best price for maximum profit is as follow:

  • Step 1: Estimate Survival curves for all the possible values over the price range of each motorbike.
A survival curve at price p.
  • Step 2: For a survival function at price p, convert it to a table of sell amount over time.
Convert survival function into table of sell amount over time
  • Step 3: Revenue is exactly the price p that it’s sold at:
  • Step 4.1: Cost is modeled linearly in this example, as a constant C amount of money per day.
  • Step 4.2: Intuitively, we know that Total_Cost would be equal to:

On the other hand, the average day it takes to sell a motorbike is equal to the area under the survival curve of that bike. This can be calculated as integral of Motorbikes_Remains over time t, as a result:

  • Step 5: Finally, calculate Profit at price p from Revenue and Cost

4. Results

Figures below plot the calculated Profit over the price range of two different bikes, with Profit in vertical axis and Price in horizontal direction:

The suggested best price that yield the maximum profit of 100000 is around 142000.
The suggested best price that yield maximum profit of over 50000 is nearly 120000.

We now have a Machine Learning model based on Survival Analysis that can automatically set the price for a new bike advertisement with the best expected profit.

Conclusion

Survival Analysis is an interesting approach in statistic but has not been very popular in the Machine Learning community. Through this case study, now you can add a new technique to your Machine Learning toolbox.

As powerful as the tool can get, this case study proves that having a good understanding of the business processes and the ability to apply Machine Learning techniques in a flexible manner are critical factors to the success of a Machine Learning project.