# Simulating Common, Censored, Outcome Variables as Dependent Variables in R [part 1]

*[Under Review]*

Following what we did here, what constitute the response variable Y in this short article, are individual times or durations that mark the time of an event of interest. They are commonly called *time-to-event (tte)*. These could be in days, weeks months or years or times converted in some other forms. The event may have happened before an investigator starts investigating with a planned trial study, the event may happen during the study, or it may happen after the study. Not all patients/participants have experienced or will experience the event of interest. A common scenario is presented this way: enrolled participants either experience the event and, the time the event happened is recorded and they are stopped being followed, OR they don’t experience the event and they choose to quit being followed or the investigator lost track of them and the investigator has no idea whether the patient experienced the event or not and so the last time the patient was seen is recorded as *censored time*, OR the end time of the study reaches and the times recorded for the patients who have not experience the event yet, correspond to the study end time and these times are also marked as *censored times*. All these individual times are recorded in the response variable Y. Censored times in this kind of scenario are qualified as *right-censored*. Other alternative scenarios, not of interest in this short article, are left-censored and interval-censored times.

To simulate times from the described *right-censoring mechanism*, we generate two variables for times: the *tte variable T and the time to right-censoring variable C, *assuming T is independent of C given a set of covariates X.* *Then we consider the observations in Y to be the minimum between the T and the C values. An important variable necessary in survival analysis is the indicator binary variable of *1’s for tte’s* and *0’s for censored times*. The event of interest is usually death, from which we get the name *survival times* (time you survived until you died). Another common term is the *overall survival times (OS)* when event is *death from all causes* (a patient may die of a road accident while the event of interest is death from the observed health condition), and *progression free survival times (PFS)* when the event of interest include *death and disease progression.*

We simulate survival times that can be fitted in the general three classes of survival models for right-censored data: A *class of parametric models (commonly the exponential, Weibull, Rayleigh, Gompertz, loglogistic and lognormal models), which include the accelerated failure time (AFT) models, and the cox proportional (cox-PH) semi-parametric models.* We present simulation strategies for AFT models in the next article. There are three fundamental functions in survival modelling:

The *hazard function h()* which measures for each individual i, the *(instantaneous) *risk of experiencing the event of interest at time t. It depends on *the baseline* (*i.e measured at the start of the study*) *hazard h0()*, and the linear predictor of potential covariates. We will use the same linear predictor from the other article about simulations, considering that survival times depend on those variables. However, the set of coefficients *beta*, does not include the intercept coefficient b0 for the parametric models (we exclude the AFT models in this restriction). You may get some clarifications about this, here. The *cumulative hazard function Hi()* integrates or sum up to t, the hazard function *hi()*, when the latter is continuous or discrete respectively. The *survival function* *Si()* measures how likely an individual i can experience the event of interest after *some time t* or the probability to “*survive” *up to time t.

The parametric models and the cox-PH model are essentially based on these functions (check *here* how they are one-to-one related). One most note that:

“A hazard function may remain constant with respect to time (correspond-

ing to an exponential density); may increase as a function of time according

to some power function (corresponding to a Weibull density); may increase

linearly with time (corresponding to a Rayleigh density); or may increase ex-

ponentially with time (corresponding to a Gompertz density). In addition,

there may be intervals of time where the hazard may alternatingly decrease

or increase according to some power, linear, or exponential function of time.” —Ding and Karl E (2013) chapter 5.

Many functions have been defined for the hazard *h0()* measured at baseline; it could be a *constant *a *polynomial* some *log* or some *function of exponents of t*. Commonly hazards related to the *exponential, *the* gompertz* and the *weibull* distributions are:

Parametric models make distributional assumptions on the baseline hazard, which is equivalent to making distributional assumptions on the survival times, due to their one-to-one relationships. The cox-PH model requires no specific distribution for baseline hazard or the survival function , making the model flexible to fit any data parametric models can fit. From the definition of the hazard function, we get the equality *log[hi(t)/h0(t)] = beta*Xi, *from which* the assumption of constant log hazard ratio (lhr) *over time is made (i.e the linear predictor *beta*Xi *is considered not to depend on t), also known as the proportional hazard (PH) assumption*. *The latter is a particularly known assumption under the semi-parametric cox-PH modelling framework.

## Simulating survival times from standard parametric survival distributions (Assuming proportional hazards)

This has been facilitated by the R package *simsurv (*web documentation, YouTube*).*

*set.seed(10082021)**b**<-runif(10,-.05,.05)*

*betas**<-c(Xage=b[2], Xbmi=b[3], Xfemale=b[4], Xexsmoker=b[5], Xnonsmoker=b[6], XB=b[7], XC=b[8], XD=b[9], XE=b[10]). “betas” is a named vector.*

We use the same generated covariates from this article stored in *data2* in which we add an individual index column required by the *simsurv()* R function:

*data2**<-data.frame(id = 1:1000, gender=factor(gender), Xage=age, Xbmi=bmi, smoking.status=factor(smoking.status), trtgps=factor(trtgps), Xfemale=contrast.matrix[,2], Xexsmoker=contrast.matrix[,3], Xnonsmoker=contrast.matrix[,4], XB=contrast.matrix[,5], XC=contrast.matrix[,6], XD=contrast.matrix[,7], XE=contrast.matrix[,8])*

**The Exponential Model**

*exp.survdat** <- simsurv(dist = “exponential”, lambdas = 1.3, betas = betas, x = data2, seed=77)*

*“exp.survdat” *is a data frame of 3 variables. The “status” variable is a vector of 1’s indicating none of the *“eventtimes” *is censored. To create censored times, one may assume that, any *eventtime* of 4 or more than 4 is considered (right) censored. We specify this in the *simsurv()* function with the component *maxt=4 *and get the following. Remember to set the seed within the* simsurv *function for me and you to have the same simulated survival times.

We don’t have much censored times in this data. The censoring rate is 28/1000*100 = 2.8%. The censored times are fixed values, which is not really common. Another way to create right-censored times, is to generate two exponential *eventtimes *T and C, say with the objects ** exp1.survdat **and

**respectively. The final event time would be the minimum**

*exp2.survdat**min(Ti, Ci)*and the status variable would be 1 if Ti less than or equal to Ci and 0 otherwise. We vary the lambdas values (say 1.3 and 1) and the seeds (say 77 and 78) in

**and**

*exp1.survdat***respectively. Assuming we have generated**

*exp2.survdat***and**

*exp1.survdat***we get the new data frame**

*exp2.survdat,***as**

*exp.survdat**exp.survdat**<-data.frame(id=exp1.survdat$id, T=exp1.survdat$eventtime, C=exp2.survdat$eventtime, eventtime=pmin(exp1.survdat$eventtime, exp2.survdat$eventtime),status=as.numeric(exp1.survdat$eventtime<= exp2.survdat$eventtime))*

The first 5 values of the data frame look like these. This time censored values have diverse values. We proceed the same way the remaining models.

## The gompertz model

*gomp1.survdat** <- simsurv(dist = “gompertz”, lambdas = .03, gammas = .5, betas = betas, x = data2, seed = 77)*

*gomp2.survdat** <- simsurv(dist = “gompertz”, lambdas = .01, gammas = .3, betas = betas, x = data2, seed = 78)*

The resulting survival times are in the data frame *gomp.survdat:*

*gomp.survdat**<-data.frame(id=gomp1.survdat$id, T=gomp1.survdat$eventtime, C=gomp2.survdat$eventtime, eventtime=pmin(gomp1.survdat$eventtime,gomp2.survdat$eventtime), status=as.numeric(gomp1.survdat$eventtime<=gomp2.survdat$eventtime))*

## The weibull model

*weib1.survdat** <- simsurv(dist = “weibull”, lambdas = .3, gammas = 1.5, betas = betas, x = data2, seed = 77)*

*weib2.survdat** <- simsurv(dist = “weibull”, lambdas = .1, gammas = 1.3, betas = betas, x = data2, seed = 78)*

*weib.survdat**<-data.frame(id=weib1.survdat$id, T=weib1.survdat$eventtime, C=weib2.survdat$eventtime, eventtime=pmin(weib1.survdat$eventtime, weib2.survdat$eventtime), status=as.numeric(weib1.survdat$eventtime<=weib2.survdat$eventtime))*

The simulations setting above *(positive lambda and gamma under gompertz with gamma>1 for weibull)* imply that the baseline hazards are monotonically(strictly) increasing for both the weibull and gompertz *(the exponential hazard is constant)*. However, in reality, h0(t) could also decrease. A 2-component mixture model relaxes this assumption of monotonicity of the baseline hazard such that survival times under a 2-component mixture gompertz model for example can be generated this way:

*2cgomp.survdat **<- simsurv(dist = “gompertz”, lambdas = c(0.03, 0.01), gammas = c(.3, .5), betas = betas, mixture = TRUE, pmix = 0.5, x = data2, maxt = 9.00, seed=77)*

## Simulating survival times from standard parametric survival distributions (Assuming non proportional hazards)

This assumption implies the linear predictor *beta*Xi *depends on t.* *For example,* beta*Xi *can be of the form* beta*Xi = beta1*Xi + beta2*Xi*log(t). *We then define two sets of coefficients *beta1* and *beta2*, they have the same dimension*, beta1 *is the* lhr* when* log(t)=0, beta2 *is the change in *lhr* per unit change in log(t).

*set.seed(10082021)**b**<-runif(10,-.05,.05)*

*beta1**<-c(Xage=b[2], Xbmi=b[3], Xfemale=b[4], Xexsmoker=b[5], Xnonsmoker=b[6], XB=b[7], XC=b[8], XD=b[9], XE=b[10]).*

*b2**<-runif(10,-.01,.05) *

*beta2**<-c(Xage=b2[2], Xbmi=b2[3], Xfemale=b2[4], Xexsmoker=b2[5], Xnonsmoker=b2[6], XB=b2[7], XC=b2[8], XD=b2[9], XE=b2[10])*

Survival times are finally obtained with the line code below:

*nonPHgomp.survdat** <- simsurv(dist = “gompertz”, lambdas = 0.03, gammas = .5, betas = beta1, tde = beta2, x = data2, tdefunction = “log”, maxt = 4, seed=77)*

Another scenario would be to define the beta2 for the treatment groups only:

*beta2**<-c(XB=b2[7], XC=b2[8], XD=b2[9], XE=b2[10])* and the code above would still generate the corresponding survival times.

## Fitting the simulated survival times

We fit the *gompertz-PH *and the *coxPH* models to ** 2cgomp.survdat **and

**respectively,**

*nonPHgomp.survdat***and test for PH assumption, taking into account for the potential correlation between the survival times and the covariates**

*age, bmi, gender, smoking status*and

*the 5 treatments*under investigation. We need the

*R package survival*. We use the

*survreg()*R function from the package to fit the gompertz parametric model, then use the cox.ph() R function from the same package to fit a coxPH model. The response variable is specified in these functions as

*Surv(eventtime, Status).*Unfortuntely,

*survreg()*only supports the following distributions:

*“extreme”, “logistic”, “gaussian”, “weibull”, “exponential”, “rayleigh”, “loggaussian”, “lognormal”, “loglogistic”, “t”.*Maybe other functions do, such as the

*flexsurvreg()*function

*in the*

*flexsurv R package*. However, we can rely on the

*coxph()*function to fit the data, as we mentioned earlier, we do not need to specify a component like

*dist=”*gompertz

*”*in

*coxph()*the way it is required in the

*survreg() function.*We need to include all the covariates stored in

*data2,*in

*2cgomp.survdat*and keep the same name

*2cgomp.survdat*for the survival data frame.

**Fitting the 2-component mixture data**

*2cgomp.survdat**<-data.frame(2cgomp.survdat, data2)*

*fitcph.mixGomp**<- coxph(Surv(eventtime, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+XB+XC+XD+XE, 2cgomp.survdat)*

**Fitting the non-PH data**

*nonPHgomp.survdat**<-data.frame(nonPHgomp.survdat, data2)*

*fitcph.nonPHGomp**<- coxph(Surv(eventtime, status) ~ Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+XB+XC+XD+XE, nonPHgomp.survdat)*

**Testing for the proportional hazard assumption**

We use *Schoenfeld test* for PH defined by the R function *cox.zph()*. The null hypothesis H0 states that *“hazards are proportional (or hazard ratio is constant overtime)”*, the alternative Ha states that *“hazards are not proportional (or hazard ratio is not constant overtime)”. *Below is the R output for* cox.zph(**fitcph.mixGomp**) *and* cox.zph(**fitcph.nonPHGomp**).*

The *cox.zph()* tests each of the covariates and provide an overall test of the model as well. Use *cox.zph()[] *to get each one of the components of the test.* *We get what we expected, the test is significant *(pvalue=1.4e-06)* with the survival times simulated under the non-PH setting, and not significant *(pvalue=0.084) *with the survival times simulated under the PH setting. We can further plot *cox.zph()[] *for* Xfemale and Xexsmoker *for example, from the significant test*, *along with* the 0 line of no change in hazard ratio overtime *on each plot*, *to see how often the line travels* the 95% confidence band around the hazard ratio change over time.*

*par(mfrow=c(1,2))plot(*

*cox.zph(fitcph.nonPHGomp)[3]**)*

abline(h=0,col=2)

plot(

abline(h=0,col=2)

plot(

*cox.zph(fitcph.nonPHGomp)[4]**)*

abline(h=0,col=2)

abline(h=0,col=2)

On the fist plot, the 0 line crosses the band just briefly, compared to the second plot where it crosses almost all the time, confirming non-constant hazard ratio in the first plot *(pvalue<.05)* and constant ratio *(pvalue>.05)* in the second.

**Extra analysis**

**model selection**

In case we wish to select the covariates worth for better model performance, we use the *step()* R function for *stepwise model selection* in combination to the *anova()* R function to select statistically significant covariates.

The function *anova(), anova(step(fitcph.mixGomp)) *to be specific, produces a lengthy output of several steps. We present the last step.

We are left with 4 significant covariates: *age bmi* *treatment groups C* and *D. *We repeat the same analysis with* anova(step()) *on the model* *reduced to those 4 variables (where *datMixGomp = 2cgomp.survdat*).

This new analysis suggests we drop D. So the summary statistics from the final reduced model is presented below.

Similarly, ** fitcph.nonPHGomp **regression

**model is reduced to**

*age bmi female and D.*

**Testing for linearity assumption**

Some covariates may be best fitted with their quadrature cubic exponential or log version. Let’s check overall linearity from the reduced models above. We visualize this with the line codes below

*par(mfrow=c(1,2))**plot**(predict(reducedGomp.cPHmodel), residuals(reducedGomp.cPHmodel, type=”martingale”), xlab = “fitted values”, ylab = “martingale residuals”, main = “Residual plot”, las=1)*

*abline**(h=0)*

*lines**(smooth.spline(predict(reduced.cPHmodel), residuals(reduced.cPHmodel, type=”martingale”)), col=”red”)*

*plot**(predict(reducedGomp.nonPHmodel), residuals(reducedGomp.nonPHmodel, type=”martingale”),*

xlab = “fitted values”, ylab = “martingale residuals”, main = “Residual plot”, las=1)

xlab = “fitted values”, ylab = “martingale residuals”, main = “Residual plot”, las=1)

*abline**(h=0)*

*lines**(smooth.spline(predict(reducedGomp.nonPHmodel), residuals(reducedGomp.nonPHmodel, type=”martingale”)), col=”red”)*

There is a suspicion of non collinearity between the survival times (simulated under the PH setting) and the covariates from the first graph. The second graph looks satisfactory, from the model fitted on survival times simulated under the nonPH setting.

**Recommendations:**

When PH assumption not met: categorize or stratify the concerned covariates, or apply some other possible validated transformations, or fit time-dependent coefficient models, or fit the data under the AFT model framework (see our next article).

When the linearity assumption not met: Identify the non linear variables, include quadratic cubic terms (i.e add polynomial terms), categorize/stratify the variable, apply some log transformation or exponent transformation, etc.

# References

1. *Clinical Trial Data Analysis Using R* by Ding-Geng (Din) Chen, Karl E. Peace. International Statistical Review, 2013