Introduction
It is common to have multiple outcomes of interest in a single study. Usually, these outcomes are analysed separately, ignoring the correlation between them. In high dimensional setting, one would expect that a carefully designed multivariate approach would be a more efficient alternative to individual analyses of each outcome [Rothman et al.,2010]. For jointly modeling multiple correlated outcomes in GWAS, we have developed a statisitcal method in this paper. Here we provide same demonstrations for our method, which is implemented in the R package vimco.
Assume we have measured \(q\) responses \(y_1, \dots, y_q\) and the same set of \(p\) predictors on each individual \(x_1,\dots, x_p\). Each response has its own regression model: \[y_j = \mathbf X\beta_j + \varepsilon_j, ~ j=1, \dots, p,\] where \(\beta_j\) is the regression vector for the \(j\)th response. We can formulate the multivariate multiple regression model as follows: \[\mathbf Y = \mathbf X \mathbf B + \mathbf E,\] where \(\mathrm E(\mathbf E)=0\), and \(\mathrm {Var}(\mathbf E)=\Sigma\).
Simulate high dimensional data
You can simulate a simple example with the following R code:
AR = function(rho, p) {
outer(1:p, 1:p, FUN = function(x, y) rho^(abs(x - y)))
}
library(mvtnorm)
n = 300 # sample size
p = 400 # dimension of predictors
K = 5 # dimension of outcomes
set.seed(20132014)
X = rmvnorm(n, mean=rep(0, p))
sigma.beta = rep(1, K)
bet = matrix(0, nrow = p, ncol = K)
lambda = 0.15 # proportion of non-zero entries in matrix \beta
eta = rbinom(p, 1, lambda)
alpha = 1
gam = matrix(rbinom(p*K, 1, alpha), ncol=K)
for (k in 1:K){
bet[, k] = rnorm(p, mean = 0, sd = sigma.beta[k]) * gam[,k] * eta
}
sigma = AR(0.8, K)
lp = X %*% bet
sigma.e = diag(sqrt(diag(var(lp)))) %*% sigma %*% diag(sqrt(diag(var(lp))))
err = rmvnorm(n, rep(0, K), sigma.e)
Y = lp + err
print(round(cor(Y),2))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00 0.39 0.25 0.11 0.20
## [2,] 0.39 1.00 0.35 0.29 0.25
## [3,] 0.25 0.35 1.00 0.44 0.23
## [4,] 0.11 0.29 0.44 1.00 0.42
## [5,] 0.20 0.25 0.23 0.42 1.00
Fit each outcome seperately
library(vimco)
fit_Ind = emInd(X, Y)
Fit all outcomes jointly
# initial values
p = ncol(X)
mu0 = fit_Ind$mu
sigb0 = fit_Ind$sigb
Theta0 = matrix(0, nrow=ncol(Y), ncol=ncol(Y))
diag(Theta0) = 1/c(fit_Ind$sige)
Lambda0 = rep(1, p)
Alpha0 = fit_Ind$Alpha
lambda0 = 1
#fit_Mul = emMultiple(X, Y)
fit_Mul = emMultiple(X, Y, mu0, sigb0, Theta0, Lambda0, Alpha0, TRUE)
Compare results
plot(bet, fit_Ind$Beta, xlab = "ture beta", ylab = "estimate", col=2)
points(bet, fit_Mul$Beta, col=3)
abline(0, 1, lwd=2, col=1)
legend("topleft", c("jointly", "seperately"), pch=21, col=3:2)
Fit VIMCO for GWAS data with multiple traits
To handle GWAS data that is in a PLINK format (incudes sim.bed, sim.bim, sim.fam) directly, vimco package provides a function vimco::vimco()
, check its help file for more details. Here we provide a simple examples. The following 3 PLINK files are already provided in our packages:
- sim.bed
- sim.bim
- sim.fam
path <- system.file(package = "vimco")
setwd(path)
stringname <- "sim"
tmp <- vimco(stringname, nPheno = 4, fit = TRUE)