--

# Motivation

There are many pedagogical resources that explain the theory behind the Cox model. There are also many resources that provide guidance on applying the Cox model using existing statistical software packages for survival analysis. However, there are few sources that bridge the gap between theory and practice by explaining the programmatic implementation of the Cox model. In this article, I 1) explain the key mathematical results that are necessary to implement the Cox model, 2) illustrate sample code that can be used to implement the Cox model, and 3) compare the output of the manually implemented function to that of survival::coxph().

# Introduction

Estimating almost any (semi) parametric regression model follows a similar procedure:

1. Specify a model with parameters β that we want to estimate
2. Determine either a) a loss function to minimize, or b) a likelihood to maximize
3. Find the maximum (for likelihood) or minimum (for loss function) of the expression above by taking the derivative with respect to the parameters of interest and setting it to zero (using calculus)
4. Find the values of the parameters of interest (β̂) that yield a derivative of zero (using either algebra or numerical optimization)

Every regression model has its own quirks that may make any one of these steps more or less difficult. For instance, the loss function for linear regression (squared-error) has a “clean” derivative, which means step (4) is just a matter of solving for the optimal parameter values through algebra. Logistic regression, on the other hand, has a loss function with a “messy” derivative, which means step (4) requires the use of a numerical optimization algorithm as opposed to simple algebra.

Here is an outline of the above five steps, specific to the Cox proportional hazards model.

1. Our model is: λ(t|X) = λ₀(t)eˣᵝ
2. We will maximize the likelihood (instead of minimizing a loss function). One quirk of the Cox model is that its estimation uses the so-called “partial likelihood” instead of the full likelihood.
3. We will maximize the partial likelihood by setting the derivative to zero.
4. The derivative is “messy”, so we will use the Newton-Raphson algorithm to computationally arrive at parameter estimates that yield a derivative of zero (instead of solving for them algebraically).

# 1. Model specification

The Cox model is λ(t|X) = λ₀(t)eˣᵝ, Where X is the covariate matrix, β is the vector of coefficients we want to estimate, and λ₀ is the so-called baseline hazard function. The hazard function can be interpreted as a patient’s “risk” of failure at time t. Mathematically, it is defined as follows:

where f(t) is the PDF of T, the survival times for all patients.

When the covariates in the Cox model are zero, the patient’s hazard function λ(t|X) is equal to the baseline hazard function λ₀(t). Non-zero covariates affect the patient’s hazard by multiplying the baseline hazard by the eˣᵝ term.

For our purposes, this explanation will suffice. Many other sources have thoroughly explained the Cox model specification in more detail (see Therneau and Grambsch 2000 for example).

# 2. Specifying the likelihood

A key characteristic of the Cox model is that it can be expressed as follows:

Where g is the baseline hazard function and h is the exponentiated covariates and betas. Importantly, g does not depend on X and h does not depend on t. This allows us to compare the hazard functions of two different patients without having to estimate g(t):

This allows for so-called semi-parametric inference, wherein we parameterize h(X) but allow g(t) to be more free-form.

We see this idea in play when we specify the likelihood L(λ₀, β). Cox showed that: L(λ₀, β) can be expressed as L₁(λ₀, β) * L₂(β), where L₂’s maximum β̂ is an asymptotically normal unbiased estimator of β, whereas L₁ contains relatively little information about β. Cox essentially proposed ignoring L₁ and basing all inference on maximizing L₂. Thus, L₂ can be called the partial likelihood. See here for a good description of this.

For simplicity, we will refer to L₂ simply as L, but with the understanding that L represents the partial likelihood. Cox specified the contribution of patient j to the partial likelihood as follows:

The numerator is the probability that patient j fails at time Tⱼ (patient j’s failure time), given that patient j was at risk for death at time Tⱼ. Being in the risk set R(Tⱼ) essentially means that the patient hasn’t failed yet or that their censoring date hasn’t passed yet. The denominator is the probability that patient k fails at Tⱼ given that they are in the risk set R(Tⱼ) for all patients k in R(Tⱼ).

Mathematically, this can be represented as follows:

The entire partial likelihood can be written as:

If the dataset has censoring, the likelihood expression can be modified:

Where δⱼ is 0 if the patient is censored and 1 otherwise. For simplicity, we will not consider censoring in the rest of the mathematical results or code.

# 3. Maximizing the likelihood

A note about notation before proceeding:
Since we will be eventually implementing the math that follows in code, it’s important to be clear about notation, particulary matrix/vector algebra:

1. X is a n × p matrix representing p covariates for n patients
2. xⱼ is a 1 × p vector representing p covariates for patient j
3. β is a p × 1 vector representing the coefficients for the p covariates
4. Based on the above notation, yields a n × 1 vector, and xⱼβ yields a vector of length 1.

As in many cases of MLE, maximizing the log-likelihood is easier and equivalent to maximizing the likelihood, since f(x) = log(x) is monotonically increasing. The partial log-likelihood for the Cox model is:

Taking the derivative with respect to β:

# Testing it out

data(veteran)set.seed(12345)sim <- coxph_AS(Surv(time) ~ trt + karno + age, data = veteran)

Comparing the numerical outputs from the two models shows that our estimates (with suffix “_AS”) match survival::coxph() pretty closely.

We can also see how our β̂ estimates and corresponding confidence intervals compare to survival::coxph() (in red).

Finally, we can see how (quickly!) our gradient converges to zero.

# References

--

--

Data scientist, global health researcher, language learner