# Survival analysis using Cox regression

When we analyze the longitudinal sequences measurements and time-to-event data, the response variable is time dependent. Given the situation that the distributions of length of time-to-event are usually heavily skewed, it cannot be added in a simple linear model directly. The Cox proportional hazards model consider the survival time and is more robust for various types of data distributions. Thus, Bayesian Cox proportional hazards model is suitable to analyze such kind of data. Note that we need to assume proportional-hazards in Cox regression, i.e., if an subject has a risk of death at the initial time point which is twice as high as that of another subject, then at all later times the mortality rate is assumed to be twice as high.

For example, we want to examine the association of COVID-19 cases’ mortality on time \(t\) with their blood test parameters. Here, we have response variable \(dN_it\) counts the number of death as they occur for individuals at time \(t\). If patient \(i\) is observed to die during the time interval \([t,t+dt)\), \(dN_it\) would be 1, otherwise \(dN_it\) equals 0. Predictor variables are three lab measurements: lactic dehydrogenase (LDH), lymphocyte and high-sensitivity C-reactive protein (hs-CRP).

In the model, response variable is assumed to be Poisson distributed (Equation 1). The corresponding intensity process for \(dN_it\) is defined as \(I_i(t)dt\). Indicator parameter \(Y_i(t)\) has value 1 or 0, presenting that whether or not the data of patient \(i\) is observed at time \(t\). Constant parameters \(\beta_1, \beta_2, \beta_3\) are regression coefficients for three bio-markers and assumed to be normal distributed independently. Variables \(X_1,X_2,X_3\) contain the observed values of three bio-markers we selected. Parameter \(d\Lambda_0(t)\) is the increments of baseline conditional instantaneous mortality rate at time interval \([t,t+dt)\), i.e. the mortality rate for a survival patient at time interval \([t,t+dt)\) when all of bio-markers \(X_1,X_2,X_3\) equal 0. In some ways, parameter \(d\Lambda_0(t)\) is similar as the random effect term \(\beta_i\) for individuals \(i\) in the random effects model, while ‘individual’ for \(d\Lambda_0(t)\) in the cox regression is each time interval \([t,t+dt)\). We assumed the increments of baseline conditional instantaneous mortality rate gamma distributed at every time interval \([t,t+dt)\) (Equation 3), which is suggested by Kalbfleisch. Here, \(d\Lambda_0^*(t)\) represents a prior guess at the unknown \(d\Lambda_0(t)\), with \(c\) representing the degree of confidence in this guess. Constant \(r\) is a guess at the mortality rate per time interval \([t,t+dt)\). Note that if we take the same length of time interval \(dt\), then every \(d\Lambda_0(t)\) will have same distribution \(\text{Gamma}(cd\Lambda_0^*(t), c)\) because \(d\Lambda_0^*(t)=rdt\) and \(r\) is a constant.

\[\begin{equation} dN_i(t) \sim \text{Poisson}(I_i(t)dt), i=1,2,3... \end{equation}\] \[\begin{equation} I_i(t)dt=Y_i(t)exp(\beta_1X_1+\beta_2X_2+ \beta_3X_3)d\Lambda_0(t) \end{equation}\] \[\begin{equation} d\Lambda_0(t) \sim \text{Gamma}(cd\Lambda_0^*(t), c) \end{equation}\] \[\begin{equation} d\Lambda_0^*(t) = rdt \end{equation}\]

Since some patients do not have blood test results for all three bio-markers, we handled missing values in longitudinal data with multiple imputation. We treated missing value as random variables and built a multivariate normal random intercept and slope model to link the observed data and unknown parameters. As shown in Eqn \(5\sim7\), \(i,j\) represents patient \(i\) and \(j\)th observations within person, respectively. Variable \(t_{ij}\) is the time of the observation and variables \(X_{1ij},X_{2ij},X_{3ij}\) are the blood test results for three bio-markers at time \(t_{ij}\). Fix effects \(\alpha_1,\alpha_2,\alpha_3\), residual variances \(\epsilon_{1ij},\epsilon_{2ij}\) and marginal variance \(\epsilon_{3ij}\) are assumed normal distributed independently. Random intercept \(\beta_{intercept.i.1},\beta_{intercept.i.2},\beta_{intercept.i.3}\) and random slope \(\beta_{slope.i.1}, \beta_{slope.i.2}, \beta_{slope.i.3}\) are are assumed to follow the multivariate normal distribution, which have an 6*6 inverse Wishart prior \(D\) (Eqn 8). Then we used a Bayesian method to estimate the unknown random variables, provided that those values are part of the joint model.

\[\begin{equation} X_{1ij} = \alpha_1 X_{2ij}+\alpha_2X_{3ij}+\beta_{intercept.i.1}+\beta_{slope.i.1}*t_{ij}+\epsilon_{1ij} \end{equation}\] \[\begin{equation} X_{2ij} = \alpha_3 X_{3ij}+\beta_{intercept.i.2}+\beta_{slope.i.2}*t_{ij}+\epsilon_{2ij} \end{equation}\] \[\begin{equation} X_{3ij} = \beta_{intercept.i.3}+\beta_{slope.i.3}*t_{ij}+\epsilon_{3ij} \end{equation}\] \[\begin{equation} (\beta_{intercept.i.1},\beta_{intercept.i.2},\beta_{intercept.i.3},\beta_{slope.i.1}, \beta_{slope.i.2}, \beta_{slope.i.3}) \sim N(0, D) \end{equation}\]

After setting proper priors, we can use R software with `R2jags`

package to conduct Bayesian analysis and estimate posteriors.