Simulating Common, Non-Censored, Outcome Variables as Dependent Variables in R

Laure Ngouanfo Fopa
Analytics Vidhya
Published in
7 min readAug 9, 2021

[Under Review]

In order to understand statistical modelling techniques and better handle real world data, simulated data or other kind of synthetic data have purposely aided a lot so far in this regard. In healthcare for example, a simulated trial often guide the investigator about the expected number of participants that would yield a higher clinical success rate.

We stick our attention to clinical trials data. The data generating process for such data is commonly considered to yield: normally, binomial, or poisson distributed data; and sometimes a multinomial and mutivariate normally distributed data for a joint analysis of multiple outcomes while accounting for correlations/relationships between the competing outcomes.

Lets say we are administering 5 treatments to 5 groups of 1000 patients in total. Before that, we collect information on their age (in years), body mass index (bmi in kg/m²), gender, and smoking status. The last two variables, in addition to the treatment variable, are categorical variables of levels 2 3 and 5 respectively. A patient is either male or female, smoker ex-smoker or non-smoker, and is given treatment A, B, C, D, or E. We acknowledge that outcome response (Y) to treatment would depend on all these variables (X). For simplicity, we assume the expected value of Y, E(Y), to be linearly related to these variables through an invertible function g (often referred to as a link function) such that

g(E(Y))=b0+b.age*(Xage-mean(Xage))+b.bmi*(Xbmi-mean(Xbmi))+b.female*Xfemale+b.exsmoker*Xexsmoker+b.nonsmoker*Xnonsmoker+b.B*XB+b.C*XC+b.D*XD+b.E*XE

The references for the three categorical variables are male, smoker and A respectively. The b’s are the true regression coefficients: b0 is the true effect on the outcome response at the mean age, mean bmi and when the binary variables Xfemale, Xexsmoker, Xnonsmoker, XB, XC, XD, XE are all zero. In other words, b0 is the true effect of male smoking patients enrolled in the treatment group A with average age and bmi. That’s the main role of centering the continuous (potentially) non-zero variables age and bmi, in order to have a reasonable interpretation of the intercept b0. For the simulation and fitting purpose in R, we would invoke the following packages: MASS, tidyverse, truncnorm and fitdistrplus.

Simulating the linear predictor

First, we simulate the 5 variables (covariates) the following way:

set.seed(10082021) #to generate the same values anytime (do not worry if we do not have the same simulated values anyway!)
gender<-sample(c(“male”, “female”), size=1000, replace = T, prob=c(.4,.6))
age<-sample(17:103, size=1000, replace = T, prob = NULL)
bmi<-rtruncnorm(1000, a=0, b=50, mean=20.7, sd=6.5)
smoking.status<-sample(c(“smoker”,”ex-smoker”, “non-smoker”), size=1000, replace = T, prob=c(.3,.2,.5))
trtgps<-sample(LETTERS[1:5], size=1000, replace = T, prob=c(.15,.1,.35,.15,.25))

One could use any other distributions to generate these values: binomial distribution for gender; uniform, truncated weibull (tweibull) truncated normal (truncnorm), exponential or truncated poisson distributions and any other appropriate continuous distribution you may think of for the continuous variable age as well as for bmi; multinomial distribution for smoking status as well as treatment groups (trtgps) variable. The simulated variables bundled in a data frame as data1<-data.frame(gender=factor(gender), age=age, bmi=bmi, smoking.status=factor(smoking.status), trtgps=factor(trtgps)), look like this:

With View(data1) in R

The categorical variables had to be wrapped in the R function factor(), in the data frame data1, otherwise they would remain character variables and not appropriate for statistical analysis in R.

Next, we specify the references of the categorical variables with the following line of code: relevel.data1 <- data1 %>%
mutate(gender = relevel(gender, ref = “male”),
smoking.status = relevel(smoking.status, ref = “smoker”),
trtgps = relevel(trtgps, ref = “A”)).
A categorical variable with k levels is handled in the analysis in such a way that, the k-1 levels that are not references are coded as binary variables. For example, XB=1 if trtgps=B & XB=0 if trtgps is not B. To easily detach these levels from their respective parent variables, we use the function model.matrix() in R. This creates a (contrast) matrix: contrast.matrix <- model.matrix(~gender+smoking.status+trtgps, data = relevel.data1). The matrix looks like below. The first column (Intercept) is not our interest.

With View(contrast.matrix) in R

We add the last 7 columns to data1 and form a new dataset data2:

data2<-data.frame(gender=factor(gender), Xage=age, Xbmi=bmi,smoking.status=factor(smoking.status),trtgps=factor(trtgps), Xfemale=contrast.matrix[,2], Xexsmoker=contrast.matrix[,3], Xnonsmoker=contrast.matrix[,4], XB=contrast.matrix[,5], XC=contrast.matrix[,6], XD=contrast.matrix[,7], XE=contrast.matrix[,8])

With View(data2) in R

We generate arbitrary values b<-runif(10,-.05,.05) for the true coefficients c(b0, b.age, b.bmi, b.female, b.exsmoker, b.nonsmoker, b.B, b.C, b.D, b.E) respectively. These coefficients are relatively very small values in the interval [-.05,.05]. They are log values or log odds values when the link function g is the log or the log odds (see below), as such, if they are large, they may yield unrealistic response values (you can try to verify by varying the bounds of the uniform distribution). We define the linear expression/predictor:

linear.expr <- b[1]+b[2]*(Xage-mean(Xage))+b[3]*(Xbmi-mean(Xbmi))+b[4]*Xfemale+b[5]*Xexsmoker+b[6]*Xnonsmoker+b[7]*XB+b[8]*XC+b[9]*XD+b[10]*XE

Simulating the response variable Y with respect to the linear predictor

Scenario 1: we impose no known distribution to the outcome response variable Y

In this case, Y is simply the linear expression defined above.

Y<-with(data2, linear.expr)

A quick summary statistics of Y indicates that, it is a continuous variable with some negative values. Its mean and median are similar, so it could describe a normal distribution.

Anyway, some common distributions we can fit in Y are the uniform and the normal distributions. We do that with the fisdist() R function

fit.u <- fitdist(Y, “unif”)
fit.n <- fitdist(Y, “norm”)

and compare both distributions in the plot generated by the R function denscomp(list(fit.u, fit.n), legendtext=c(“uniform”, “normal”)) specifically.

Plot with denscomp(list(fit.u, fit.n), legendtext=c(“uniform”, “normal”))

and compare the goodness-of-fit characteristics offered by the R function gofstat()

The difference in AIC or BICs, is not that much (< 3), and the graph shows that a uniform as well as a normal model, would provide reliable estimates of the parameters that describe Y. A thorough investigation can be done to find out what other distributions Y could fit best within the fitdistrplus R package.

Scenario 2: we impose a normal distribution to the outcome response variable Y

Two parameters are needed to simulate a normally distributed variable in R: the mean and standard deviation sd. There are two cases to distinguish from, to account for the possible link functions. We choose arbitrary values for the sd. However, the mean is a function of the predefined linear combination of the covariates.

case 1: g is the identity link g(E(Y))=E(Y)

Y<-with(data2, rnorm(1000, mean = linear.expr, sd = 3.7))

case 2: g is the log link g(E(Y))=log(E(Y))

Y<-with(data2, rtruncnorm(1000, a=0, b=Inf, mean = exp(linear.expr), sd = 0.3))

Scenario 3: we impose a binomial distribution to the outcome response variable Y

Only one parameter is needed here, which is prob, the probability of the event of interest, commonly characterizes as the probability of success.

case 1: g is the logit link g(E(Y))=log(E(Y)/1-E(Y)) (also known as log odds)

Y<-with(data2, rbinom(1000, size=1, prob=exp(linear.expr)/(1+exp(linear.expr))))

case 2: g is the probit link g(E(Y))=Phi^(-1)(E(Y)), the inverse cdf of the standard normal distribution N(0,1). The pnorm() R function generate cdf values for N(0,1).

Y<-with(data2, rbinom(1000, size=1, prob=pnorm(linear.expr)))

case 3: g is the complementary log-log link function g(E(Y))=log(-log(1-E(Y)))

Y<-with(data2, rbinom(1000, size=1, prob=1-exp(-exp(linear.expr))))

Scenario 4: we impose a Poisson distribution to the outcome response variable Y

Same here, only one parameter is needed to simulate a Poisson data, lambda, the rate of events of interest.

case 1: g is the log link g(E(Y))=log(E(Y))

Y<-with(data2, rpois(1000, lambda=exp(linear.expr)))

Fitting the response variable Y

Each of the response variable from the different scenarios can therefore be fitted under the generalized linear model (glm) framework.

fit.Y<-glm(Y~Xage+Xbmi+Xfemale+Xexsmoker+Xnonsmoker+
XB+XC+XD+XE, family=distribution(link=”.”), data=data2)

Replace the distribution component by gaussian (normal), binomial, or Poisson, and the link component could be identity, log, logit, cloglog, accordingly. The glms are designed for the distributions in the exponential family. The uniform distribution is however not part of the exponential family of distributions and other models should be sought (a simple linear model could do with the lm() R function). There is a concern of over dispersion or under dispersion when fitting binomial data or poisson data for example. Over-dispersion occurs when the deviance from the fitted model is greater than the degrees of freedom of the asymptotic chi squared distribution of the residual deviance under the maximum likelihood estimation (MLE). It is otherwise under-dispersion when the deviance is smaller than the degrees of freedom. Alternatives to deal with dispersion issue include using quasi-likelihoods (such as the quasibinomial, quasipoisson likelihood) and fitting a negative-binomial model with the R function glm.nb() to the over or under dispersed univariate data. The fact that only a single parameter describes a Poisson data for example, makes it harder to find the best fit for such data, one of the reasons why issues such as the one related to dispersion arise.

References

1. Clinical Trial Data Analysis Using R by Ding-Geng (Din) Chen, Karl E. Peace. International Statistical Review, 2013

--

--

Laure Ngouanfo Fopa
Analytics Vidhya

Teaching/Research Assistant in Mathematics/(Bayesian) Statistics, Writer