# A case-study for applying Survival Analysis on real business problem.

Sample code accompanied with this article: https://github.com/vfa-phuclkh/online_retail_survival_analysis/blob/master/motorbike_online_retailer.ipynb

Over the past few years, Machine Learning has been a very useful tool in the decision-making processes for companies all over the world. In this post, I will show how to apply ** Survival Analysis** — an interesting but not very popular approach in Machine Learning — to enhance the business of motorbike online retailer. From this case study, you will see that Survival Analysis can be applied to many problems and helps companies conduct their business more efficiently.

### Situation

Imagine you’re running an online retailer that sell used motorbike. The routine business operations consist of:

- stocking the used motorbikes
- publishing them with detailed information and some photos
- responding to inquiries and order for it.

#### Data

From the retailer’s website, let collect a little dataset of many ads of the same motorbike types that have been sold. It might include:

- How old is the motorbike?
- The distance it has traveled.
- The price tag for the motorbike.
- When is it sold?
- How long does it take to sold the motorbike?

#### Problems

- Publish motorbike’s advertisement can be automated in most part, except for setting the price where human input is needed. It took time and create a bottleneck in the process speed. If we can automate this decision as well, the entire process can be done automatically.
- More importantly, does human really know how best to set the price?

#### Business flow

Let’s examine the money flow in the business:

- Retailer purchases old bikes -> Outgoing money 💸
- Stocking and maintenance -> Outgoing money 💸
- Sales <- Incoming money 💰

In this article, we focus on the process after purchasing, so purchase cost is consideredsunk costand will be ignored, as it can not be controlled.

#### Goal

In order to maximize the difference between *Incoming money *and *Outgoing money*, let consider two extreme actions:

- On one end, to minimize the “Stocking and maintenance” cost, the price should be always set at 0
*.*This ensures the bike would be sold in no time. - On the other end of the spectrum, set an extremely expensive price would greatly increase the “Sale” income. However, cost for stocking and maintaining increase for every single day that bikes are still in stock.

In total, we have **Profit = Sales revenue- Cost**. The goal is to find the sweet spot to set the price in between these two extremes using the accumulated data on the retailer’s website.

Goal: Find the best price to maximize profit.

### Methods

#### Why not regression?

In the case of our motorbike data, Linear regression is not an appropriate approach because it can not handle *censored* data. More detail about *censored *and *uncensored *data is discussed later in the **Survival Analysis** section but in short, an action that dramatically change the property of an entry such as *price down *operation is not possible to be introduced in Linear regression.

### Survival Analysis:

Scikit-survival is an open source library built on top of Scikit-learn. This library makes it possible to apply Machine Learning method to Survival Analysis.

#### 1. Introduction

In statistic, Survival Analysis is used to study the time to happen of an event, such as:

- How long will it take for patients to recover from illness?
- How long will it take for industrial products to be broken? …

Unlike usual machine learning, in Survival Analysis it is possible to obtain the output as the probability of an event happen along timeline, which is called as Survival curve.

**Survival curve (function)**

A plot of a survival function is a series of declining horizontal step, with the vertical axis represent the probability of surviving of the population over the horizontal time axis.

Given a large enough sample size, the estimator will approach the true survival function of the population.

**Survival function by types**

It’s useful to analyze survival function for some specific feature.

In this case, the data is divide into 2 groups: the group that has *price* lower than average and the other group has *price* higher than the average. Then two survival functions of each group are plotted side by side.

Interestingly, the survival function for the bike that has *higher-than-average* price is to the left of the other. That means the population for the *high-price *group decreases faster and bikes in that group are more likely to be sold first.

Does that means the bike should be priced higher if we want to sell it faster?

Not so fast, **correlation does not imply causation**.

This may simply because the bikes that are set with a higher price is in better quality and that’s why it is sold faster.

**2. Frame the motorbike data as a Survival Analysis problem**

One interesting property of Survival Analysis is that it can be used even in the case when training data can only be partially observed.

In Survival Analysis there are two type of data: *censored* and *uncensored*. *Censored* data is when we do not know if the event happen during the observation period, while *Uncensored* data is when the time an event happen to the sample is observed.

In case of motorbike retail, *event of interest* is when the motorbikes are sold:

In current data, some samples does not have *Publish_period *values and it can be considered as *censored *data*.* For other samples that also have a *price change *operation, it can be divided into a *censored *and *uncensored *event as follow:

- The
*censored*entry A has*Published_period*equals the period from beginning to when the price is changed. - The uncensored entry B has
*Published_period*equals the period from when the price is changed until the purchase date.

Now the Survival Analysis model treats the newly created entries as two different bike-advertisements.

#### 3. Model Profit with Survival curves

Profit = Revenue - Cost

After a Survival Analysis estimator is fitted using the data prepared above, the plan to find the best price for maximum profit is as follow:

**Step 1:**Estimate**Survival curves**for all the possible values over the price range of each motorbike.

**Step 2:**For a survival function at price*p**,*convert it to a table of sell amount over time.

**Step 3:**is*Revenue*that it’s sold at*p**:*

**Step 4.1: Cost**is modeled*linearly*in this example, as a constantamount of money per day.*C***Step 4.2**: Intuitively, we know thatwould be equal to*Total_Cost**:*

On the other hand, the average day it takes to sell a motorbike is equal to the area under the survival curve of that bike. This can be calculated as integral of ** Motorbikes_Remains **over time

**as a result:**

*t,***Step 5**: Finally, calculate*Profit*from*p*and*Revenue**Cost*

#### 4. Results

Figures below plot the calculated *Profit* over the price range of two different bikes, with *Profit *in vertical axis and *Price *in* *horizontal direction*:*

We now have a Machine Learning model based on Survival Analysis that can automatically set the price for a new bike advertisement with the best expected profit.

### Conclusion

Survival Analysis is an interesting approach in statistic but has not been very popular in the Machine Learning community. Through this case study, now you can add a new technique to your Machine Learning toolbox.

As powerful as the tool can get, this case study proves that having a good understanding of the business processes and the ability to apply Machine Learning techniques in a flexible manner are critical factors to the success of a Machine Learning project.