Published in

CodeX

Simulating Common, Censored, Outcome Variables as Dependent Variables in R [part 1]

[Under Review]

Following what we did here, what constitute the response variable Y in this short article, are individual times or durations that mark the time of an event of interest. They are commonly called time-to-event (tte). These could be in days, weeks months or years or times converted in some other forms. The event may have happened before an investigator starts investigating with a planned trial study, the event may happen during the study, or it may happen after the study. Not all patients/participants have experienced or will experience the event of interest. A common scenario is presented this way: enrolled participants either experience the event and, the time the event happened is recorded and they are stopped being followed, OR they don’t experience the event and they choose to quit being followed or the investigator lost track of them and the investigator has no idea whether the patient experienced the event or not and so the last time the patient was seen is recorded as censored time, OR the end time of the study reaches and the times recorded for the patients who have not experience the event yet, correspond to the study end time and these times are also marked as censored times. All these individual times are recorded in the response variable Y. Censored times in this kind of scenario are qualified as right-censored. Other alternative scenarios, not of interest in this short article, are left-censored and interval-censored times.

To simulate times from the described right-censoring mechanism, we generate two variables for times: the tte variable T and the time to right-censoring variable C, assuming T is independent of C given a set of covariates X. Then we consider the observations in Y to be the minimum between the T and the C values. An important variable necessary in survival analysis is the indicator binary variable of 1’s for tte’s and 0’s for censored times. The event of interest is usually death, from which we get the name survival times (time you survived until you died). Another common term is the overall survival times (OS) when event is death from all causes (a patient may die of a road accident while the event of interest is death from the observed health condition), and progression free survival times (PFS) when the event of interest include death and disease progression.

We simulate survival times that can be fitted in the general three classes of survival models for right-censored data: A class of parametric models (commonly the exponential, Weibull, Rayleigh, Gompertz, loglogistic and lognormal models), which include the accelerated failure time (AFT) models, and the cox proportional (cox-PH) semi-parametric models. We present simulation strategies for AFT models in the next article. There are three fundamental functions in survival modelling:

The hazard function h() which measures for each individual i, the (instantaneous) risk of experiencing the event of interest at time t. It depends on the baseline (i.e measured at the start of the study) hazard h0(), and the linear predictor of potential covariates. We will use the same linear predictor from the other article about simulations, considering that survival times depend on those variables. However, the set of coefficients beta, does not include the intercept coefficient b0 for the parametric models (we exclude the AFT models in this restriction). You may get some clarifications about this, here. The cumulative hazard function Hi() integrates or sum up to t, the hazard function hi(), when the latter is continuous or discrete respectively. The survival function Si() measures how likely an individual i can experience the event of interest after some time t or the probability to “survive” up to time t.

The parametric models and the cox-PH model are essentially based on these functions (check here how they are one-to-one related). One most note that:

“A hazard function may remain constant with respect to time (correspond-
ing to an exponential density); may increase as a function of time according
to some power function (corresponding to a Weibull density); may increase
linearly with time (corresponding to a Rayleigh density); or may increase ex-
ponentially with time (corresponding to a Gompertz density). In addition,
there may be intervals of time where the hazard may alternatingly decrease
or increase according to some power, linear, or exponential function of time.” — Ding and Karl E (2013) chapter 5.

Many functions have been defined for the hazard h0() measured at baseline; it could be a constant a polynomial some log or some function of exponents of t. Commonly hazards related to the exponential, the gompertz and the weibull distributions are:

Parametric models make distributional assumptions on the baseline hazard, which is equivalent to making distributional assumptions on the survival times, due to their one-to-one relationships. The cox-PH model requires no specific distribution for baseline hazard or the survival function , making the model flexible to fit any data parametric models can fit. From the definition of the hazard function, we get the equality log[hi(t)/h0(t)] = beta*Xi, from which the assumption of constant log hazard ratio (lhr) over time is made (i.e the linear predictor beta*Xi is considered not to depend on t), also known as the proportional hazard (PH) assumption. The latter is a particularly known assumption under the semi-parametric cox-PH modelling framework.

Simulating survival times from standard parametric survival distributions (Assuming proportional hazards)

This has been facilitated by the R package simsurv (web documentation, YouTube).

set.seed(10082021)
b<-runif(10,-.05,.05)
betas<-c(Xage=b[2], Xbmi=b[3], Xfemale=b[4], Xexsmoker=b[5], Xnonsmoker=b[6], XB=b[7], XC=b[8], XD=b[9], XE=b[10]). “betas” is a named vector.

We use the same generated covariates from this article stored in data2 in which we add an individual index column required by the simsurv() R function:

data2<-data.frame(id = 1:1000, gender=factor(gender), Xage=age, Xbmi=bmi, smoking.status=factor(smoking.status), trtgps=factor(trtgps), Xfemale=contrast.matrix[,2], Xexsmoker=contrast.matrix[,3], Xnonsmoker=contrast.matrix[,4], XB=contrast.matrix[,5], XC=contrast.matrix[,6], XD=contrast.matrix[,7], XE=contrast.matrix[,8])

The Exponential Model

exp.survdat <- simsurv(dist = “exponential”, lambdas = 1.3, betas = betas, x = data2, seed=77)

“exp.survdat” is a data frame of 3 variables. The “status” variable is a vector of 1’s indicating none of the “eventtimes” is censored. To create censored times, one may assume that, any eventtime of 4 or more than 4 is considered (right) censored. We specify this in the simsurv() function with the component maxt=4 and get the following. Remember to set the seed within the simsurv function for me and you to have the same simulated survival times.

We don’t have much censored times in this data. The censoring rate is 28/1000*100 = 2.8%. The censored times are fixed values, which is not really common. Another way to create right-censored times, is to generate two exponential eventtimes T and C, say with the objects exp1.survdat and exp2.survdat respectively. The final event time would be the minimum min(Ti, Ci) and the status variable would be 1 if Ti less than or equal to Ci and 0 otherwise. We vary the lambdas values (say 1.3 and 1) and the seeds (say 77 and 78) in exp1.survdat and exp2.survdat respectively. Assuming we have generated exp1.survdat and exp2.survdat, we get the new data frame exp.survdat as

exp.survdat<-data.frame(id=exp1.survdat\$id, T=exp1.survdat\$eventtime, C=exp2.survdat\$eventtime, eventtime=pmin(exp1.survdat\$eventtime, exp2.survdat\$eventtime),status=as.numeric(exp1.survdat\$eventtime<= exp2.survdat\$eventtime))

The first 5 values of the data frame look like these. This time censored values have diverse values. We proceed the same way the remaining models.

The gompertz model

gomp1.survdat <- simsurv(dist = “gompertz”, lambdas = .03, gammas = .5, betas = betas, x = data2, seed = 77)

gomp2.survdat <- simsurv(dist = “gompertz”, lambdas = .01, gammas = .3, betas = betas, x = data2, seed = 78)

The resulting survival times are in the data frame gomp.survdat:

gomp.survdat<-data.frame(id=gomp1.survdat\$id, T=gomp1.survdat\$eventtime, C=gomp2.survdat\$eventtime, eventtime=pmin(gomp1.survdat\$eventtime,gomp2.survdat\$eventtime), status=as.numeric(gomp1.survdat\$eventtime<=gomp2.survdat\$eventtime))

The weibull model

weib1.survdat <- simsurv(dist = “weibull”, lambdas = .3, gammas = 1.5, betas = betas, x = data2, seed = 77)

weib2.survdat <- simsurv(dist = “weibull”, lambdas = .1, gammas = 1.3, betas = betas, x = data2, seed = 78)

weib.survdat<-data.frame(id=weib1.survdat\$id, T=weib1.survdat\$eventtime, C=weib2.survdat\$eventtime, eventtime=pmin(weib1.survdat\$eventtime, weib2.survdat\$eventtime), status=as.numeric(weib1.survdat\$eventtime<=weib2.survdat\$eventtime))

The simulations setting above (positive lambda and gamma under gompertz with gamma>1 for weibull) imply that the baseline hazards are monotonically(strictly) increasing for both the weibull and gompertz (the exponential hazard is constant). However, in reality, h0(t) could also decrease. A 2-component mixture model relaxes this assumption of monotonicity of the baseline hazard such that survival times under a 2-component mixture gompertz model for example can be generated this way:

2cgomp.survdat <- simsurv(dist = “gompertz”, lambdas = c(0.03, 0.01), gammas = c(.3, .5), betas = betas, mixture = TRUE, pmix = 0.5, x = data2, maxt = 9.00, seed=77)

Simulating survival times from standard parametric survival distributions (Assuming non proportional hazards)

This assumption implies the linear predictor beta*Xi depends on t. For example, beta*Xi can be of the form beta*Xi = beta1*Xi + beta2*Xi*log(t). We then define two sets of coefficients beta1 and beta2, they have the same dimension, beta1 is the lhr when log(t)=0, beta2 is the change in lhr per unit change in log(t).

set.seed(10082021)
b<-runif(10,-.05,.05)
beta1<-c(Xage=b[2], Xbmi=b[3], Xfemale=b[4], Xexsmoker=b[5], Xnonsmoker=b[6], XB=b[7], XC=b[8], XD=b[9], XE=b[10]).

b2<-runif(10,-.01,.05)
beta2<-c(Xage=b2[2], Xbmi=b2[3], Xfemale=b2[4], Xexsmoker=b2[5], Xnonsmoker=b2[6], XB=b2[7], XC=b2[8], XD=b2[9], XE=b2[10])

Survival times are finally obtained with the line code below:

nonPHgomp.survdat <- simsurv(dist = “gompertz”, lambdas = 0.03, gammas = .5, betas = beta1, tde = beta2, x = data2, tdefunction = “log”, maxt = 4, seed=77)

Another scenario would be to define the beta2 for the treatment groups only:

beta2<-c(XB=b2[7], XC=b2[8], XD=b2[9], XE=b2[10]) and the code above would still generate the corresponding survival times.

Fitting the simulated survival times

We fit the gompertz-PH and the coxPH models to 2cgomp.survdat and nonPHgomp.survdat respectively, and test for PH assumption, taking into account for the potential correlation between the survival times and the covariates age, bmi, gender, smoking status and the 5 treatments under investigation. We need the R package survival. We use the survreg() R function from the package to fit the gompertz parametric model, then use the cox.ph() R function from the same package to fit a coxPH model. The response variable is specified in these functions as Surv(eventtime, Status). Unfortuntely, survreg() only supports the following distributions: “extreme”, “logistic”, “gaussian”, “weibull”, “exponential”, “rayleigh”, “loggaussian”, “lognormal”, “loglogistic”, “t”. Maybe other functions do, such as the flexsurvreg() function in the flexsurv R package. However, we can rely on the coxph() function to fit the data, as we mentioned earlier, we do not need to specify a component like dist=”gompertz in coxph() the way it is required in the survreg() function. We need to include all the covariates stored in data2, in 2cgomp.survdat and keep the same name 2cgomp.survdat for the survival data frame.

Fitting the 2-component mixture data

2cgomp.survdat<-data.frame(2cgomp.survdat, data2)
fitcph.mixGomp <- coxph(Surv(eventtime, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+XB+XC+XD+XE, 2cgomp.survdat)

Fitting the non-PH data

nonPHgomp.survdat<-data.frame(nonPHgomp.survdat, data2)
fitcph.nonPHGomp <- coxph(Surv(eventtime, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+XB+XC+XD+XE, nonPHgomp.survdat)

Testing for the proportional hazard assumption

We use Schoenfeld test for PH defined by the R function cox.zph(). The null hypothesis H0 states that “hazards are proportional (or hazard ratio is constant overtime)”, the alternative Ha states that “hazards are not proportional (or hazard ratio is not constant overtime)”. Below is the R output for cox.zph(fitcph.mixGomp) and cox.zph(fitcph.nonPHGomp).

The cox.zph() tests each of the covariates and provide an overall test of the model as well. Use cox.zph()[] to get each one of the components of the test. We get what we expected, the test is significant (pvalue=1.4e-06) with the survival times simulated under the non-PH setting, and not significant (pvalue=0.084) with the survival times simulated under the PH setting. We can further plot cox.zph()[] for Xfemale and Xexsmoker for example, from the significant test, along with the 0 line of no change in hazard ratio overtime on each plot, to see how often the line travels the 95% confidence band around the hazard ratio change over time.

par(mfrow=c(1,2))
plot(
cox.zph(fitcph.nonPHGomp)[3])
abline(h=0,col=2)
plot(
cox.zph(fitcph.nonPHGomp)[4])
abline(h=0,col=2)

On the fist plot, the 0 line crosses the band just briefly, compared to the second plot where it crosses almost all the time, confirming non-constant hazard ratio in the first plot (pvalue<.05) and constant ratio (pvalue>.05) in the second.

Extra analysis

model selection

In case we wish to select the covariates worth for better model performance, we use the step() R function for stepwise model selection in combination to the anova() R function to select statistically significant covariates.

The function anova(), anova(step(fitcph.mixGomp)) to be specific, produces a lengthy output of several steps. We present the last step.

We are left with 4 significant covariates: age bmi treatment groups C and D. We repeat the same analysis with anova(step()) on the model reduced to those 4 variables (where datMixGomp = 2cgomp.survdat).

This new analysis suggests we drop D. So the summary statistics from the final reduced model is presented below.

Similarly, fitcph.nonPHGomp regression model is reduced to age bmi female and D.

Testing for linearity assumption

Some covariates may be best fitted with their quadrature cubic exponential or log version. Let’s check overall linearity from the reduced models above. We visualize this with the line codes below

par(mfrow=c(1,2))
plot(predict(reducedGomp.cPHmodel), residuals(reducedGomp.cPHmodel, type=”martingale”), xlab = “fitted values”, ylab = “martingale residuals”, main = “Residual plot”, las=1)
abline(h=0)
lines(smooth.spline(predict(reduced.cPHmodel), residuals(reduced.cPHmodel, type=”martingale”)), col=”red”)
plot(predict(reducedGomp.nonPHmodel), residuals(reducedGomp.nonPHmodel, type=”martingale”),
xlab = “fitted values”, ylab = “martingale residuals”, main = “Residual plot”, las=1)
abline(h=0)
lines(smooth.spline(predict(reducedGomp.nonPHmodel), residuals(reducedGomp.nonPHmodel, type=”martingale”)), col=”red”)

There is a suspicion of non collinearity between the survival times (simulated under the PH setting) and the covariates from the first graph. The second graph looks satisfactory, from the model fitted on survival times simulated under the nonPH setting.

Recommendations:

When PH assumption not met: categorize or stratify the concerned covariates, or apply some other possible validated transformations, or fit time-dependent coefficient models, or fit the data under the AFT model framework (see our next article).

When the linearity assumption not met: Identify the non linear variables, include quadratic cubic terms (i.e add polynomial terms), categorize/stratify the variable, apply some log transformation or exponent transformation, etc.

References

1. Clinical Trial Data Analysis Using R by Ding-Geng (Din) Chen, Karl E. Peace. International Statistical Review, 2013

--

--