Abstract
In genome-wide association studies (GWAS), joint analysis of multiple phenotypes could have increased statistical power over analyzing each phenotype individually to identify genetic variants that are associated with complex diseases. With this motivation, several statistical methods that jointly analyze multiple phenotypes have been developed, such as O’Brien’s method, Trait-based Association Test that uses Extended Simes procedure (TATES), multivariate analysis of variance (MANOVA), and joint model of multiple phenotypes (MultiPhen). However, the performance of these methods under a wide range of scenarios is not consistent: one test may be powerful in some situations, but not in the others. Thus, one challenge in joint analysis of multiple phenotypes is to construct a test that could maintain good performance across different scenarios. In this article, we develop a novel statistical method to test associations between a genetic variant and Multiple Phenotypes based on cross-validation Prediction Error (MultP-PE). Extensive simulations are conducted to evaluate the type I error rates and to compare the power performance of MultP-PE with various existing methods. The simulation studies show that MultP-PE controls type I error rates very well and has consistently higher power than the tests we compared in all simulation scenarios. We conclude with the recommendation for the use of MultP-PE for its good performance in association studies with multiple phenotypes.
Similar content being viewed by others
Introduction
Traditionally, genome-wide association studies (GWAS) have performed on individual phenotype. In spite of the success of GWAS in identifying thousands of associations between genetic variants and complex diseases, these identified variants only contribute to a small proportion of the phenotypic variation. In the study of a complex disease, several correlated phenotypes are usually measured for a disorder or its risk factors1, therefore, by jointly analyzing multiple correlated phenotypes, we may increase statistical power to detect causal variants with weak genetic effects on complex diseases.
One method to use multiple phenotypes in association studies is to analyze each phenotype separately as the standard univariate association test and then aggregate the results. This approach will have a loss in power due to the penalties from the multiple testing1,2 and the ignorance of the correlation structure among phenotypes3,4. Thus, multiple-phenotype association study that uses multiple phenotypes simultaneously has become popular.
Several methods to detect association using multiple phenotypes simultaneously have been introduced in recent years. For example, O’Brien method (OB) is proposed to combines test statistics obtained from association test for each individual phenotype5. OB is the most powerful test when the genetic effects are homogeneous and loses power when genetic effects are heterogeneous, especially when genetic effects have opposite directions1,6. van der Sluis et al.7 proposed a trait-based association test using an extended Simes procedure (TATES) that conducts association test for each phenotype and then combines the univariate p-values while correcting for the correlation between p-values. The canonical correlation analysis (CCA) conducts the linear combination of phenotypes that explain the largest amount of correlation between a genetic variant and phenotypes8. One could also use multivariate analysis of variance (MANOVA) in regression to study multiple phenotypes9. MANOVA is equivalent to CCA when canonical correlation analysis is applied to a single variant10. MultiPhen proposed by O’Reilly et al.2 can be used to detect the association between one variant and multiple phenotypes by reversing response and predictors via a proportional odds regression model. When a small number of phenotypes are included, MultiPhen and MANOVA lead to similar performance6,11. MANOVA and CCA require the assumption of normality of multiple phenotypes, while MultiPhen has no inflated type I error rates on non-normal phenotypes2. Some other variable reduction methods have also been proposed to test for the association between a genetic variant and the linear combination of multiple phenotypes rather than the original phenotypes12,13,14. For example, principal component of phenotypes (PCP) that maximizes the phenotype variation is the most popular dimension reduction method13. Based on PCP, Klei et al.12 developed principal component of heritability (PCH) by maximizing the heritability among all linear combination of phenotypes. Recently, Turley et al.15 introduced the Multi-Trait Analysis of GWAS (MTAG) for joint analysis of multiple phenotypes. MTAG can be applied to GWAS summary statistics from an arbitrary number of phenotypes without access to individual-level data.
Although there are many proposed methods for joint analysis of multiple phenotypes, the performance of these methods under a wide range of scenarios is not consistent6: one test may be powerful in some situations, but not in the others. Thus, one challenge in multiple phenotype analysis is to construct a test that could maintain good performance across different scenarios. In this article, we develop a novel statistical method to test the association between a genetic variant and Multiple Phenotypes based on cross-validation Prediction Error (MultP-PE). Extensive simulation studies are conducted to evaluate the type I error rates and to compare the power performance of MultP-PE with various existing methods. Our simulation studies show that MultP-PE controls the type I error rates very well and has consistently higher power than other methods we compared in all simulation scenarios.
Method
We consider a sample with n unrelated individuals. Each individual has K (potentially correlated) phenotypes and has been genotyped at a variant of interest. Let yik denote the kth phenotype value of the ith individual and xi denote the genotype score of the ith individual, where xi∈{0, 1, 2} is the number of minor alleles that the ith individual carries. We model the relationship between the multiple phenotypes and the genetic variant using an inverse linear regression model, in which the genotype at the variant of interest is the response variable and the multiple phenotypes are predictors. That is,
We are not the first using an ordinal variable as response variable in a linear model. To correct for population stratification, Price et al.16 used a qualitative phenotype or genotypes as response variables in linear models. To adjust the effects of covariates in rare variant association studies, Sha et al.17 also used a qualitative phenotype or genotypes as response variables in linear models. To test the association between the K multiple phenotypes and the variant, we test the null hypothesis H0:β1 = ··· = βK = 0 under model (1).
Let yi = (1,yi1, …, yiK)T and β = (β0, β1, …, βK)T, then the regression model in equation (1) can be written as \({x}_{i}={y}_{i}^{T}\beta +{\varepsilon }_{i},i=1,2,\ldots ,n\). The ordinary linear square estimate of β is \(\hat{\beta }={({Y}^{T}Y)}^{-1}{Y}^{T}x\), where Y = (y1, …, yn)T and x = (x1, …, xn)T. When multiple phenotypes are highly correlated, the rank of matrix Y may be less than K, then the inverse of YTY may not exist, which results in that the ordinary linear square estimate of β may not be unique18. Since multiple phenotypes in a GWAS are usually highly correlated, we propose to use Ridge regression19,20,21,22,23,24. Ridge regression penalizes the size of the regression coefficients. The Ridge regression estimator of β is defined as the value of β that minimizes
where λ (λ ≥ 0) is a tuning parameter. The solution to the Ridge regression is given by \({\hat{\beta }}_{\lambda }={({Y}^{T}Y+\lambda I)}^{-1}{Y}^{T}x\). Here the estimator of β depends on λ and we use the subscript λ to indicate that the estimator of β is a function of λ.
Based on Ridge regression, we propose to use the leave-one-out cross validation (LOOCV) prediction error under model (1) as a test statistic. Let \({\hat{x}}_{-i}^{\lambda }\) denote the LOOCV predicted value (leave the ith individual out) of xi under model (1) with parameter λ in Ridge regression. Then, the statistic can be written as \({T}_{\lambda }=\sum _{i=1}^{n}{({x}_{i}-{\hat{x}}_{-i}^{\lambda })}^{2}\). Note that Tλ is the LOOCV prediction error, thus low values of Tλ would imply significance. Let pλ denote the p-value of Tλ (see next paragraph on how to calculate pλ). We define the test statistic of Multiple Phenotypes based on Prediction Error (MultP-PE) as
We propose to use a grid search method in equation (2) to evaluate the minimization. We divide the interval [0, ∞) into subintervals \(0\le {\lambda }_{1} < \cdot \cdot \cdot < {\lambda }_{M-1} < {\lambda }_{M} < \infty \). Then, \({T}_{MultP-PE}={{\rm{\min }}}_{\lambda }{p}_{\lambda }={\min }_{1\le m\le M}{p}_{{\lambda }_{m}}\). We use a permutation procedure to evaluate the p-value of TMultP−PE. Intuitively, we need to use two layers of permutations to estimate \({p}_{{\lambda }_{m}}\) and the overall p-value for the test statistic TMultP−PE. For microarray data analysis, Ge et al.25 proposed that one layer of permutation can be used to estimate p-values. We use the permutation procedure of Ge et al. to estimate \({p}_{{\lambda }_{m}}\) and the overall p-value for the test statistic TMultP−PE. In each permutation, we randomly shuffle the genotypes at the variant. Suppose that we perform B times of permutations. Let \({T}_{{\lambda }_{m}}^{(b)}\) denote the value of \({T}_{{\lambda }_{m}}\) based on the bth permuted data for b = 0, 1, …, B and m = 1, …, M, and \({p}_{{\lambda }_{m}}^{(b)}\) denote the p-value of \({T}_{{\lambda }_{m}}^{(b)}\), where b = 0 represents the original data. Then, we can estimate \({p}_{{\lambda }_{m}}^{(b)}\) using \({p}_{{\lambda }_{m}}^{(b)}=\frac{\#\{d:{T}_{{\lambda }_{m}}^{(d)} < {T}_{{\lambda }_{m}}^{(b)}\,{\rm{for}}\,d=1,\ldots ,B\}}{B}\). Let \({T}_{MultP-PE}^{(b)}={\min }_{1\le m\le M}{p}_{{\lambda }_{m}}^{(b)}\) denote the test statistic of TMultP−PE based on the bth permuted data, then the p-value of TMultP−PE is given by
To apply MultP-PE to GWAS with hundreds of thousands of SNPs, we also propose an algorithm that can perform the permutation procedure described above more efficiently in the following section.
A Fast Algorithm for the Permutation Procedure
We use the notations in the above section and let Aλ = (YTY + λI)−1, \({h}_{i}^{\lambda }={y}_{i}^{T}{A}_{\lambda }{y}_{i}\), \(\,{h}_{\lambda }=({h}_{1}^{\lambda },\ldots ,{h}_{n}^{\lambda })\), and \({\hat{\beta }}_{\lambda }={A}_{\lambda }{Y}^{T}x\). Then, the Ridge predicted value of xi is \({\hat{x}}_{i}^{\lambda }={y}_{i}^{T}{\hat{\beta }}_{\lambda }\) and \({\hat{x}}_{\lambda }={({\hat{x}}_{1}^{\lambda },\ldots ,{\hat{x}}_{n}^{\lambda })}^{T}=Y{({Y}^{T}Y+\lambda I)}^{-1}{Y}^{T}x\). We can show that the LOOCV prediction error in Ridge regression has a closed-form formula24,26, that is, \({x}_{i}-{\hat{x}}_{-i}^{\lambda }=({x}_{i}-{\hat{x}}_{i}^{\lambda })/(1-{h}_{i}^{\lambda })\). Note that for two matrices or vectors A and B, we use A*B and \(\frac{A}{B}\) to denote the element-wise operations; for a matrix C, we use colSum(C) to denote the sums of the columns of matrix C. We assume n ≥ K + 1. We perform singular value decomposition of Y, that is, Y = UDV, where U is an n × (K + 1) matrix with orthonormal columns, D is (K + 1) × (K + 1) diagonal matrix with non-negative real numbers on the diagonal, and V is an (K + 1) × (K + 1) orthogonal matrix. Let D = diag(d1, …, dK + 1). Then, \({\hat{x}}_{\lambda }=U{C}_{\lambda }{U}^{T}x\), where Cλ = diag(cλ,1, …, cλ,K + 1) and \({c}_{\lambda ,j}={d}_{j}^{2}/({d}_{j}^{2}+\lambda )\) for j = 1, …, K + 1. Let \({c}_{\lambda }={({c}_{\lambda ,1},\ldots ,{c}_{\lambda ,K+1})}^{T}\) and x(K) = UTx be a K + 1 dimensional vector. Then, \({\hat{x}}_{\lambda }=U{C}_{\lambda }{x}^{(K)}=U({c}_{\lambda }\ast {x}^{(K)})\) and hλ = diag(UCλUT). For \(0\le {\lambda }_{1} < \ldots < {\lambda }_{M} < \infty \), let \(C=({c}_{{\lambda }_{1}},\ldots ,{c}_{{\lambda }_{M}})\) and \(H=({h}_{{\lambda }_{1}},\ldots ,{h}_{{\lambda }_{M}})\). Then, \((\,{\hat{x}}_{{\lambda }_{1}},\ldots ,\,{\hat{x}}_{{\lambda }_{M}})=U(C\ast {x}^{(K)})=U({c}_{{\lambda }_{1}}\ast {x}^{(K)},\ldots ,{c}_{{\lambda }_{M}}\ast {x}^{(K)})\). If we denote \(Q=\frac{(x-{\hat{x}}_{{\lambda }_{1}}\,,\ldots ,\,x-{\hat{x}}_{{\lambda }_{M}})}{1-H}\), then \(({T}_{{\lambda }_{1}},\ldots ,{T}_{{\lambda }_{M}})=colSum(Q\ast Q)\). Note that C, U, and H only depend on phenotypes and λ1, …, λM. Thus, C, U, and H do not change in each permutation. For a GWAS, C, U, and H also do not change at different SNPs. For 1,000 permutations on one SNP, our fast algorithm is about 20 times faster than the original algorithm (the original algorithm calculates Tλ by \({T}_{\lambda }=\sum _{i=1}^{n}{({x}_{i}-{\hat{x}}_{-i}^{\lambda })}^{2}\)). To perform a GWAS with hundreds of thousands of SNPs, we can use the same approach as was suggested in Zhu et al.14, that is, we can first select SNPs that show evidence of association based on a small number of permutations (e.g. 1,000), then use a large number of permutations to test the selected SNPs. For example, in our real data analysis with 630,860 SNPs, we first performed 1,000 permutations and selected SNPs with p-value ≤ 0.005, then we performed 108 permutations on the selected SNPs because SNPs with p-value > 0.005 are not significantly associated with phenotypes.
Although we use a permutation procedure to calculate the p-value of MultP-PE, by using our fast algorithm, we can use less than one day to perform a typical GWAS. In our read data analysis on COPD in the following section, we performed a GWAS with 5,430 individuals across 630,860 SNPs and seven phenotypes. We completed the analysis in 10 hours on Intel Xeon E5-2680v3 by using a single node.
In the above section, we describe MulP-PE without considering covariates. If covariates are needed to be considered, we can incorporate covariates using the following approach in MultP-PE. Suppose that there are total G covariates we would like to consider and let (zi1, …, ziG)T denote the covariates for the ith individual. We can adjust esch of the phenotypes by the covariates by applying the linear regression model \({y}_{ik}={a}_{0k}+{a}_{1k}{z}_{i1}+\ldots +{a}_{Mk}{z}_{iG}+{\varepsilon }_{ik}\), for i = 1, 2, …, n, k = 1, 2, …, K, and use the residual of yik to replace yik in MultP-PE. In our real data analysis, we used this approach to incorporate four covariates. This approach has been used in the literature. For example, Sha et al.16 and Zhu et al.14 also used the same approach to adjust phenotypes for the covariates.
In association studies for unrelated individuals, it has been well known that population stratification can seriously confound association results27. There are several methods that have been developed to control for population stratification. For example, Genomic Control (GC) approach28,29, Principal Component (PC) approach16,30,31,32, and Mixed Linear Model (MLM) approach33,34. Similar to most association tests for unrelated individuals, MulP-PE subjects to bias due to population stratification. To make MultP-PE robust to population stratification, we can use the PC approach. Let ci1, …, ciL denote the top L PCs of the genotypes at a set of genomic markers for the ith individual. We can use the residuals of the regression model \({x}_{i}=\alpha +{\beta }_{1}{c}_{i1}+\cdot \cdot \cdot +{\beta }_{L}{c}_{iL}+{\varepsilon }_{i}\) to replace xi and use the residuals of the regression model \({y}_{ik}={\alpha }_{k}+{\beta }_{1k}{c}_{i1}+\cdot \cdot \cdot +{\beta }_{Lk}{c}_{iL}+{\varepsilon }_{ik}\) to replace yik for k = 1, 2, …, K in MultP-PE to adjust for population stratification.
Comparison of Methods
We evaluate the performance of the proposed test MultP-PE by comparing it with five most commonly used methods for association studies using multiple phenotypes. These five methods include the O’Brien’s method (OB)5, Trait-based Association Test that uses Extended Simes procedure (TATES)7, Optimal weight method (OW)6, Multivariate analysis of variance (MANOVA)9, and Joint model of multiple phenotypes (MultiPhen)2.
Simulation Study
In simulation studies, we evaluate type I error rates of MultP-PE by generating data sets with three different sample sizes, 500, 1,000 and 2,000. For power comparison, we compare the powers of different methods by simulation data sets with 1,000 unrelated individuals.
For genotype data, we generate genotype at a genetic variant according to minor allele frequency (MAF) and assume Hardy-Weinberg Equilibrium (HWE). For each individual, we generate K phenotypes using models similar to the models used in Zhu et al.14 and Wang et al.35. The K phenotypes are generated from the following model
where y = (y1, …, yK)T; ϕ = (ϕ1, …, ϕK) are the genetic effects of the variant on the K phenotypes; x is the genotypic score at the variant; c is a constant number; γ is a K × R matrix; ω = (ω1, …, ωR)T is a vector of factors with R elements and \(\omega ={({\omega }_{1},\ldots ,{\omega }_{R})}^{T}\sim MVN(0,\Sigma )\), \(\Sigma =\rho A+(1-\rho )I\), ρ is the correlation between factors, A is a matrix with elements of 1, and I is the identity matrix; ε = (ε1, …, εK)T is a vector of residuals, ε1, …, εK are independent, and \({\varepsilon }_{k}\sim N(0,1)\) for k = 1, …, K. Based on equation (4), we consider the following four models in which the within-factor correlation is c2 and the between-factor correlation is ρc2.
Model 1
There is only one factor and genotypes impact on all phenotypes with different effect sizes. That is, R = 1, ϕ = β(1, 2, …, K)T, and γ = (1, …, 1)T.
Model 2
There are two factors and genotypes impact on one factor. That is, R = 2, \(\varphi ={(0,\ldots ,0,\mathop{\underbrace{\beta ,\ldots ,\beta }}\limits_{K/2})}^{T}\), and γ = Bdiag(D1, D2), where \({D}_{i}={(\mathop{\underbrace{1,\ldots ,1}}\limits_{K/2})}^{T}\) for i = 1, 2 and Bdiag means block diagonal.
Model 3
There are five factors and genotypes impact on two factors. That is, R = 5, \(\varphi ={({\beta }_{11},\ldots ,{\beta }_{1k},{\beta }_{21},\ldots ,{\beta }_{2k},{\beta }_{31},\ldots ,{\beta }_{3k},{\beta }_{41},\ldots ,{\beta }_{4k},{\beta }_{51},\ldots ,{\beta }_{5k})}^{T}\), and γ = Bdiag(D1, D2, D3, D4, D5), where \({D}_{i}={(\mathop{\underbrace{1,\ldots ,1}}\limits_{K/5})}^{T}\) for i = 1, …, 5; k = K/5; \({\beta }_{11}=\cdot \cdot \cdot ={\beta }_{1k}={\beta }_{21}=\cdot \cdot \cdot ={\beta }_{2k}={\beta }_{31}=\cdot \cdot \cdot ={\beta }_{3k}=0\); β41 = ··· = β4k = −β; and \(({\beta }_{51},\ldots ,{\beta }_{5k})=\frac{2\beta }{k+1}(1,\ldots ,k)\).
Model 4
There are five factors and genotypes impact on four factors. That is, R = 5, \(\varphi ={({\beta }_{11},\ldots ,{\beta }_{1k},{\beta }_{21},\ldots ,{\beta }_{2k},{\beta }_{31},\ldots ,{\beta }_{3k},{\beta }_{41},\ldots ,{\beta }_{4k},{\beta }_{51},\ldots ,{\beta }_{5k})}^{T}\), and γ = Bdiag(D1, D2, D3, D4, D5), where \({D}_{i}={(\mathop{\underbrace{1,\ldots ,1}}\limits_{K/5})}^{T}\) for i = 1, …, 5; k = K/5; β11 = ··· = β1k = 0; β21 = ··· = β2k = β; β31 = ··· = β3k = −β; \(({\beta }_{41},\ldots ,{\beta }_{4k})=-\,\frac{2\beta }{k+1}(1,\ldots ,k)\); and \(({\beta }_{51},\ldots ,{\beta }_{5k})=\frac{2\beta }{k+1}(1,\ldots ,k)\).
For the type I error rates, we set β = 0 to indicate that the genetic variant has no effect on all phenotypes. For power comparisons, we consider different values of β. To evaluate type I error rate and power, we set MAF = 0.3, the between-factor correlation is 0.14, and the within-factor correlation is 0.25. In the following simulation studies and real data analysis, we use eight different values of λ (M = 8) and set \(\mathrm{log}\,\lambda =0,\,1,\,2,\,3,\,3.5,\,3.8,\,4,\,4.5\).
The R codes for implementation of MultP-PE and for simulation of data under the four models are available at Dr. Shuanglin Zhang’s homepage http://www.math.mtu.edu/shuzhang/software.html.
Results
To evaluate the type I error rates of MultP-PE, we consider different significance levels (0.01 and 0.05), different sample sizes (500, 1000 and 2000), and different number of phenotypes (10, 20 and 40). We use 1,000 permutations to calculate the p-values of MultP-PE and use 10,000 replicated samples to evaluate type I error rates of MultP-PE. For 10,000 replicated samples, the 95% confidence intervals (CIs) for the estimated type I error rates with nominal levels 0.05 and 0.01 are (0.04562, 0.05438) and (0.00804, 0.01196), respectively. We summarize the estimated type I error rates of the proposed test in Table 1. This table shows that only one type I error rate is not in the corresponding 95% CI (it is very close to the upper-bound of the CI), which indicates that the proposed method is valid.
In power comparisons, we calculate the p-values of MultP-PE using 1,000 permutations and the p-values of MultiPhen, OW, TATES, MANOVA, OB using their asymptotic distributions. We evaluate the powers of all of the six tests using 1,000 replicated samples at a significance level of 0.05. Figures 1 and 2 show the powers of the six methods as a function of the effect size β with K = 20 and 40, respectively. As shown in these two figures: (1) MultP-PE is the most powerful test. The power of MultP-PE is much higher than the second most powerful test; (2) as the effect size β increases, the powers of all tests increase as well; as the number of phenotypes K increases from 20 to 40, MultP-PE presents more ascendancy than the other five tests; (3) MultiPhen, OW, and MANOVA have similar powers under all four models. A similar conclusion has been reached in some published papers2,6,7; (4) OB is comparable to MultiPhen, OW, and MANOVA in models 1 and 2, but has almost no power when the genetic effects have different directions (models 3 and 4); (5) TATES is more powerful than MultiPhen, OW, and MANOVA in model 2, but is less powerful than MultiPhen, OW, and MANOVA in models 3 and 4.
Power comparisons of the six methods as a function of the within-factor correlation, c2, with K = 20 and 40 are given in Figs 3 and 4, respectively. As shown in these two figures: (1) the patterns of the power performance are similar to those in Figs 1 and 2; (2) when the within-factor correlation is increasing, the powers of all six tests have increasing trend or decreasing trend depending on different model settings. This pattern has been confirmed in Zhu’s paper6; (3) OB is the least powerful test except under model 2 with the within-factor correlation > 0.2.
Power comparisons of the six methods as a function of the between-factor correlation, c2ρ, with K = 20 and 40 are given in Figs S1 and S2, respectively. As shown in these two figures: (1) the patterns of the power performance are similar to those in Figs 1 and 2; (2) when the between-factor correlation is increasing, the powers of all six tests have increasing trend except for these under model 1; (3) MultP-PE is the most powerful test, while OB is the least powerful test except under model 2 with the between-factor correlation = 0.1.
In summary, MultP-PE is consistently the most powerful test among the tests we compared under all simulation scenarios.
Real Data Analysis
Chronic obstructive pulmonary disease (COPD) is a terminology to describe progressive life-threatening lung diseases that causes breathlessness and serious illness, including emphysema, chronic bronchitis, refractory asthma, and some forms of bronchiectasis. A global prevalence of 251 million cases of COPD is reported in 2016 and it is estimated that COPD caused 3.17 million deaths in 201536. The COPDGene aims to find inherited or genetic factors that associated with COPD. The COPDGene dataset includes 10,192 participants, 3,408 of them are African-Americans (AA), and 6,784 of them are Non-Hispanic Whites (NHW). Same as Liang et al.37, we considered Age, Sex, BMI, and Pack-Years as four covariates and selected seven quantitative COPD-related phenotypes (FEV1, Emphysema, Emphysema Distribution, Gas Trapping, Airway Wall Area, Exacerbation frequency, and Six-minute walk distance) in the following data analysis.
We deleted individuals and genotypes with missing data. After excluding missing data, a set of 5,430 NHW across 630,860 SNPs was used in the analysis. Then we adjusted the phenotypes for the covariates by applying a linear regression14,17. We regressed each phenotype on the four covariates, replaced original phenotypes with the residuals of the regression, and applied each of the six tests to detect the association between the covariates-adjusted phenotypes (residuals) and each SNP.
We used genome-wide significance level 5 × 10−8 to identify SNPs that are significantly associated with the seven COPD-related phenotypes. There were total 14 SNPs identified by at least one method (Table 2). All of the 14 SNPs had been reported to be associated with COPD by previous studies38,39,40,41,42,43,44,45,46,47,48,49,50. As shown in Table 2, MultiPhen identified 14 SNPs; OW, MANOVA, and MultP-PE identified 13 SNPs; TATES identified 9 SNPs; and OB did not identify any SNPs. The number of SNPs identified by MultP-PE was comparable to the largest number of SNPs identified by other tests and the COPD analysis results were consistent with our simulation results. We also performed individual phenotype analysis on each of the seven phenotypes. Table S1 gives the adjusted p-values (Bonferroni correction for multiple testing) to test each of the seven phenotypes on the 14 significant SNPs. We can see from Table S1, among the 14 SNPs, only nine SNPs are significantly associated with Emphysema Distribution at the genome-wide significance level. The number of SNPs identified by individual phenotype is the same as TATES and is less than the number of SNPs identified by four multiple phenotype analyses (OW, MANOVA, Multiphen, and MultP-PE), which showed that the simultaneous analysis of multiple phenotypes can increase power comparing with single phenotype analysis.
Discussion
For complex diseases in GWAS, the association between a genetic variant and each phenotype is usually weak. Analyzing multiple disease-related phenotypes could increase statistical power to identify the association between genetic variants and complex diseases. In this article, we developed a novel statistical method, MultP-PE, to test the association between a genetic variant and multiple phenotypes based on cross-validation prediction error. We showed that MultP-PE controls type I error rates very well and has consistently higher power than other methods we compared among all the simulation scenarios. Overall, MultP-PE is the most powerful test and has much higher power than the second most powerful test; OW, MANOVA, and MultiPhen have very similar performance; OB loses power dramatically when genetic effects have opposite directions on phenotypes; TATES is more powerful when the genetic effect only works on a portion of phenotypes. In real data analysis, MultP-PE identified 13 out of 14 significant SNPs, which is comparable to MultiPhen (14 out of 14).
References
Yang, Q., Wu, H., Guo, C. Y. & Fox, C. S. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genet Epidemiol 34(5), 444–454 (2010).
O’Reilly, P. F. et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One 7(5), e34861 (2012).
Wang, Y. et al. Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol 39(4), 259–275 (2015).
Yang, J. J., Li, J., Williams, L. K. & Buu, A. An efficient genome-wide association test for multivariate phenotypes based on the Fisher combination function. BMC Bioinform 17(1), 1 (2016).
O’Brien, P. C. Procedures for comparing samples with multiple endpoints. Biometrics 40, 1079–1087 (1984).
Zhu, H., Zhang, S. & Sha, Q. Power Comparisons of Methods for Joint Association Analysis of Multiple Phenotypes. Hum Hered 80(3), 144–52 (2016).
van der Sluis, S., Posthuma, D. & Dolan, C. V. TATES: Efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet 9(1), e1003235 (2013).
Ferreira, M. A. & Purcell, S. M. A multivariate test of association. Bioinformatics 25(1), 132–133 (2009).
Cole, D. A., Maxwell, S. E., Avrey, R. & Salas, E. How the power of MANOVA can both increase and decrease as a funcion of the intercorrelations among the dependent variables. Psychol Bull 115(3), 465 (1994).
Galesloot, T. E., van Steen, K., Kiemeney, L. A. L. M., Janss, L. L. & Vermeulen, S. H. A comparison of multivariate genome-wide association methods. PLoS One 9, e95923 (2014).
Aschard, H. et al. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet 94(5), 662–676 (2014).
Klei, L., Luca, D., Devlin, B. & Roeder, K. Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol 32(1), 9–19 (2008).
Wang, K. & Abbott, D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol 32, 108–118 (2008).
Zhu, H., Zhang, S. & Sha, Q. A novel method to test associations between a weighted combination of phenotypes and genetic variants. PLoS ONE 13(1), e0190788, https://doi.org/10.1371/journal.pone.0190788 (2018).
Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet 50, 229–37 (2018).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–9 (2006).
Sha, Q., Wang, X., Wang, X. & Zhang, S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol 36, 561–571 (2012).
Draper, N. R. & Smith, H. Applied Regression Analysis, (John Wiley & Sons, 2014).
Cule, E. & De Iorio, M. Ridge regression in prediction problems: automatic choice of the ridge parameter. Genet Epidemiol 37(7), 704–14, https://doi.org/10.1002/gepi.21750. PubMed PMID: 23893343; PMCID: PMC4377081 (2013).
Cule, E., Vineis, P. & De Iorio, M. Significance testing in ridge regression for genetic data. BMC Bioinformatics 12, 372, https://doi.org/10.1186/1471-2105-12-372. PubMed PMID: 21929786; PMCID: PMC3228544 (2011).
Halawa, A. & El Bassiouni, M. Tests of regression coefficients under ridge regression models. J Stat Comput and Simul 65(1–4), 341–56 (2000).
Hoerl, A. E., Kannard, R. W. & Baldwin, K. F. Ridge regression: some simulations. Commun Stat Theory Methods 4(2), 105–23 (1975).
Malo, N., Libiger, O. & Schork, N.J. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am J Hum Genet 82(2), 375–85, https://doi.org/10.1016/j.ajhg.2007.10.012. PubMed PMID: 18252218; PMCID: PMC2427310 (2008).
Yang, X., Wang, S., Zhang, S. & Sha, Q. Detecting association of rare and common variants based on cross-validation prediction error. Genet Epidemiol 41(3), 233–243 (2017).
Ge, Y., Dudoit, S. & Speed, T. P. Resampling-based multiple testing for microarray data analysis. Test 12(1), 1–77 (2003).
James, G., Witten, D., Hastie, T. & Tibshirani, R. An introduction to statistical learning, (Springer, 2013).
Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science 265, 2037–48 (1994).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Reich, D. E. & Goldstein, D. B. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol 20, 4–16 (2001).
Chen, H. S., Zhu, X., Zhao, H. & Zhang, S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet 67, 250–64 (2003).
Zhang, S., Zhu, X. & Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol 24, 44–56 (2003).
Zhu, X., Zhang, S., Zhao, H. & Cooper, R. S. Association mapping, using a mixture model for complex traits. Genet Epidemiol 23, 181–96 (2002).
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42, 355–60 (2010).
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42, 348–54 (2010).
Wang, Z., Sha, Q. & Zhang, S. Joint Analysis of Multiple Traits Using “Optimal” Maximum Heritability Test. PloS one 11(3), e0150975 (2016).
Chronic obstructive pulmonary disease (COPD). WHO. Retrieved from, http://www.who.int/mediacentre/factsheets/fs315/en/ (Nov. 2017).
Liang, X. et al. An Adaptive Fisher’s Combination Method for Joint Analysis of Multiple Phenotypes in Association Studies. Sci Rep 6, 34323, https://doi.org/10.1038/srep34323 (2016).
Brehm, J. M. et al. Identification of FGF7 as a novel susceptibility locus for chronic obstructive pulmonary disease. Thorax 66(12), 1085–1090 (2011).
Cui, K., Ge, X. & Ma, H. Four SNPs in the CHRNA3/5 alpha-neuronal nicotinic acetylcholine receptor subunit locus are associated with COPD risk based on meta-analyses. PloS One 9(7), e102324 (2014).
Du, Y., Xue, Y. & Xiao, W. Association of IREB2 gene rs2568494 polymorphism with risk of chronic obstructive pulmonary disease: a meta-analysis. Med Sci Monit 22, 177 (2016).
Cho, M. H. et al. Variants in FAM13A are associated with chronic obstructive pulmonary disease. Nat Genet 42(3), 200–202 (2010).
Hancock, D. B. et al. Meta-analyses of genome-wide association studies identify multiple loci associated with pulmonary function. Nat Genet 42(1), 45–52 (2010).
Lutz, S. M. et al. A genome-wide association study identifies risk loci for spirometric measures among smokers of European and African ancestry. BMC Genet 16(1), 1 (2015).
Li, X. et al. Importance of hedgehog interacting protein and other lung function genes in asthma. J Allergy Clin Immunol 127(6), 1457–1465 (2011).
Pillai, S. G. et al. A genome-wide association study in chronic obstructive pulmonary disease (COPD): identification of two major susceptibility loci. PLoS Genet 5(3), e1000421 (2009).
Wilk, J. B. et al. A genome-wide association study of pulmonary function measures in the Framingham Heart Study. PLoS Genet 5(3), e1000429 (2009).
Wilk, J. B. et al. Genome-wide association studies identify CHRNA5/3 and HTR4 in the development of airflow obstruction. Am J Respir Crit Care Med 186(7), 622–632 (2012).
Young, R. P. et al. Chromosome 4q31 locus in COPD is also associated with lung cancer. Eur Respir J 36(6), 1375–1382 (2010).
Zhang, J., Summah, H., Zhu, Y. G. & Qu, J. M. Nicotinic acetylcholine receptor variants associated with susceptibility to chronic obstructive pulmonary disease: a meta-analysis. Respir Res 12(1), 1 (2011).
Zhu, A. Z. et al. Association of CHRNA5-A3-B4 SNP rs2036527 with smoking cessation therapy response in African-American smokers. Clin Pharmacol Ther 96(2), 256–265 (2014).
Acknowledgements
Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R15HG008209. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research used data generated by the COPDGene study, which was supported by National Institutes of Health (NIH) grants U01HL089856 and U01HL089897. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion. Superior, a high-performance computing infrastructure at Michigan Technological University, was used in obtaining results presented in this publication.
Author information
Authors and Affiliations
Contributions
S.Z. and Q.S. designed research, X.Y. and S.Z. performed statistical analysis, and X.Y., S.Z. and Q.S. wrote the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, X., Zhang, S. & Sha, Q. Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error. Sci Rep 9, 1073 (2019). https://doi.org/10.1038/s41598-018-37538-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-018-37538-y
- Springer Nature Limited