Survival Αnalysis

Common Pitfalls and How to Survive them

Eleni Nisioti
Applied Data Science
7 min readFeb 4, 2020

--

Regardless of whether we call it death, a mechanical failure or a financial downfall, our expectations of the occurrence of an irrevocable event have always been a guiding factor in our decisions. Will undertaking therapy improve your chances of recovery? Are there any steps that will help your business survive an impeding financial catastrophe?

Survival analysis has been offering an answer to these questions for decades. Being an art of statistical origin, it requires a clear understanding of the data, a careful selection of the prediction models and an appropriate interpretation of the conclusions.

Statistics is a wide field that has employed, among others, mathematicians, data scientists, market analysts and healthcare practitioners. This combination of popularity and diversity has, however, created some problems: common practices are often not correct and correct statistical conclusions unfortunately not that common.

In this post, we break down survival analysis into steps and identify some of the caveats involved. To make concepts more tangible, we view its application through a real-world example. The year is 2007 and a financial crisis is on its way.

A case study: the global financial crisis and long-term investment in Croatia

In 2007, the housing market bubble in the US quickly evolved into a major financial crisis. The rest of the world was in anticipation of the consequences, that took a couple of years to reach Europe. Croatia was hit in the first quarter of 2009, which gave a window of almost two years to firms to plan ahead and react. One is therefore tempted to ask: did firms come up with any financial strategy that proved effective, in retrospect?

In a research paper published in 2016, researchers analysed data related to financial activity in Croatia from 2003 to 2015 and examined the following hypothesis: did long-term investment in the early years of the crisis play a major role in the survival of firms?

Spotting a financial crisis is easy in retrospect

Step 1: Know your terms

The terminology that one encounters in survival analysis is quite intuitive. The concept of interest is the time to event T, which in our case is the time to financial collapse of a firm. This is a random variable, meaning that it takes values according to a probability density function f(t). The cumulative distribution function F(T<t) defines the probability that the collapse has occurred by some duration t. These common statistical concepts suffice to define the, widely used, Kaplan-Meier survival function, which describes the probability that a collapse has not occurred before t years have passed:

S(t) = Pr(T ≥ t) = 1-F(t)

Another important concept in this field is the hazard function, which represents how often a collapse occurs. When using the popular Cox proportional hazard model to define it, hazard is studied based on different parameters affecting survival time, called covariates. In our case, covariates include the legal form, size and ownership type of the firm.

Step 2: Get your hypothesis straight

The purpose of every statistical test is to challenge a hypothesis. In our case, our objective is to investigate whether long-term investment affected the survival time of firms. Being in possession of relevant data, we divide them into two distinct populations: one with firms that invested in long-term assets and one with firms that didn’t.

As part of our experiment we need to formulate the null and the alternative hypothesis. The null hypothesis, in our case stusy, is that these two populations do not differ in survival time. In contrast, the alternative hypothesis is that survival times differ significantly for the two populations.

A common misconception is that the test can tell us whether our alternative hypothesis is true. In reality, a statistical test can only help us reject or fail to reject the null hypothesis. Think of it this way: your test is a jury that may fail to convict a defendant. This doesn’t prove that the defendant is innocent: it only means the prosecution failed to prove the defendant’s guilt beyond a reasonable doubt.

Step 3: Find the right test

Before picking the shelf you are going to take your statistical test from, you need to carefully go through a list:

  • are your covariates continuous or categorical?
  • is the analysis multivariate or univariate?

While some simple tests, such as Kaplan-Meier curves and the logrank test, only work for categorical covariates and can only investigate the effect of a single parameter on survival time, methods such as the Cox proportional-hazards model offer more freedom. One needs to find the right balance between avoiding violating the assumptions of the test and adopting a complex model that will require more data than one has available.

Step 4: Deal with confounding

Confounding has been a statistical nightmare since the dawn of the field. When trying to understand whether a variable X (in our case long-term investment) is causing another variable Y (survival), we often wrongly detect causation based on our data. However, it is possible that a third variable Z is actually the cause of both X and Y. So, although X and Y appear simultaneously, the former is not causing the latter.

Confounding is very common in health studies, for example investigations of the effects of smoking on health, where factors such as age and other habits could be causing both the bad habit and a disease.

To address this problem, statisticians employ a technique called matching. In our case study, the researchers divided their data into firms that invested in long-term assets and firms that didn’t. Then, they calculated the average values of the other covariates for the two groups to see if the differ. When the average values for a covariate are very close, then we can conclude that it is not a confounding factor. This was the case for all covariates, except for the firm size, whose average value differed significantly between the two populations. Therefore, the researchers needed to test their hypothesis separately for micro, small, medium and large firms.

Matching: the mean values for long-term assets, short-term assets, capital, revenues and loan do not differ significantly between micro investing and non-investing firms

Step 5: Understand p-values

The p-value holds a special place in the history of statistics. In its simplicity, it has become a double-edged sword. On one hand it allows for an easy way to reach a statistical conclusion. On the other hand, it can be, and often is, misinterpreted.

In essence, a p-value answers the following question: How likely is it that I would get the data I have, assuming that the null hypothesis is true? If the p-value is less than your selected level, typically set at 0.05, you reject the null hypothesis in favor of the alternative hypothesis.

However, one should always bear in mind that a small p-value means that there is a small chance of observing a difference for the two populations, but does not necessarily mean that the compared populations are much different in practice. Furthermore, a large p-value would not indicate that long-term investment does not affect survival time. As we previously explained, it would just mean that we couldn’t reject the possibility that it does.

So, does long-term investment affect the survival rate of firms?

The answer, according to the authors, is that it depends. For medium and large firms, investment strategies did not seem to play a big role. However, micro and small firms investing in long-term assets did have larger survival rates. This can probably be attributed to the fact investments relative to total assets will necessarily be lower for larger firms than for smaller ones, which diminishes their importance on survival.

What is statistical power?

The conclusions that you reached are only as strong as the methodology that you followed. Did your have a small sample of data? Were some of them of questionable quality? Did you use a model that makes assumptions not validated by your data? For every positive answer, the statistical power of your methodology drops and you need to downplay a bit the significance of your conclusions.

Finally, data analysis cannot be your sole guide when stepping from reaching a statistical conclusion to shaping your future strategies. While understanding historical events is important, extrapolating learned lessons to future scenarios is a task that requires care, experience and the ability to strip complex decision problems off their irrelevant, real-world details.

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. — Ronald Fisher

If you want to learn more about how we apply survival analysis and other data science techniques for our clients, check out our website here.

Applied Data Science Partners is a London based consultancy that implements end-to-end data science solutions for businesses, delivering measurable value. If you’re looking to do more with your data, please get in touch via our website.

--

--

Eleni Nisioti
Applied Data Science

PhD student in AI. Deep learning is not just for machines. I like my coffee like I like my code. Without bugs.