Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error

Yang, Xinlan; Zhang, Shuanglin; Sha, Qiuying

doi:10.1038/s41598-018-37538-y

Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error

Article
Open access
Published: 31 January 2019

Volume 9, article number 1073, (2019)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error

Download PDF

Xinlan Yang¹,
Shuanglin Zhang¹ &
Qiuying Sha¹

2831 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

In genome-wide association studies (GWAS), joint analysis of multiple phenotypes could have increased statistical power over analyzing each phenotype individually to identify genetic variants that are associated with complex diseases. With this motivation, several statistical methods that jointly analyze multiple phenotypes have been developed, such as O’Brien’s method, Trait-based Association Test that uses Extended Simes procedure (TATES), multivariate analysis of variance (MANOVA), and joint model of multiple phenotypes (MultiPhen). However, the performance of these methods under a wide range of scenarios is not consistent: one test may be powerful in some situations, but not in the others. Thus, one challenge in joint analysis of multiple phenotypes is to construct a test that could maintain good performance across different scenarios. In this article, we develop a novel statistical method to test associations between a genetic variant and Multiple Phenotypes based on cross-validation Prediction Error (MultP-PE). Extensive simulations are conducted to evaluate the type I error rates and to compare the power performance of MultP-PE with various existing methods. The simulation studies show that MultP-PE controls type I error rates very well and has consistently higher power than the tests we compared in all simulation scenarios. We conclude with the recommendation for the use of MultP-PE for its good performance in association studies with multiple phenotypes.

Cross-Phenotype Association Analysis Using Summary Statistics from GWAS

Introduction to Statistical Methods for Integrative Data Analysis in Genome-Wide Association Studies

Family-based association analysis: a fast and efficient method of multivariate association analysis with multiple variants

Article Open access 15 February 2015

Introduction

Traditionally, genome-wide association studies (GWAS) have performed on individual phenotype. In spite of the success of GWAS in identifying thousands of associations between genetic variants and complex diseases, these identified variants only contribute to a small proportion of the phenotypic variation. In the study of a complex disease, several correlated phenotypes are usually measured for a disorder or its risk factors¹, therefore, by jointly analyzing multiple correlated phenotypes, we may increase statistical power to detect causal variants with weak genetic effects on complex diseases.

One method to use multiple phenotypes in association studies is to analyze each phenotype separately as the standard univariate association test and then aggregate the results. This approach will have a loss in power due to the penalties from the multiple testing^1,2 and the ignorance of the correlation structure among phenotypes^3,4. Thus, multiple-phenotype association study that uses multiple phenotypes simultaneously has become popular.

Several methods to detect association using multiple phenotypes simultaneously have been introduced in recent years. For example, O’Brien method (OB) is proposed to combines test statistics obtained from association test for each individual phenotype⁵. OB is the most powerful test when the genetic effects are homogeneous and loses power when genetic effects are heterogeneous, especially when genetic effects have opposite directions^1,6. van der Sluis et al.⁷ proposed a trait-based association test using an extended Simes procedure (TATES) that conducts association test for each phenotype and then combines the univariate p-values while correcting for the correlation between p-values. The canonical correlation analysis (CCA) conducts the linear combination of phenotypes that explain the largest amount of correlation between a genetic variant and phenotypes⁸. One could also use multivariate analysis of variance (MANOVA) in regression to study multiple phenotypes⁹. MANOVA is equivalent to CCA when canonical correlation analysis is applied to a single variant¹⁰. MultiPhen proposed by O’Reilly et al.² can be used to detect the association between one variant and multiple phenotypes by reversing response and predictors via a proportional odds regression model. When a small number of phenotypes are included, MultiPhen and MANOVA lead to similar performance^6,11. MANOVA and CCA require the assumption of normality of multiple phenotypes, while MultiPhen has no inflated type I error rates on non-normal phenotypes². Some other variable reduction methods have also been proposed to test for the association between a genetic variant and the linear combination of multiple phenotypes rather than the original phenotypes^12,13,14. For example, principal component of phenotypes (PCP) that maximizes the phenotype variation is the most popular dimension reduction method¹³. Based on PCP, Klei et al.¹² developed principal component of heritability (PCH) by maximizing the heritability among all linear combination of phenotypes. Recently, Turley et al.¹⁵ introduced the Multi-Trait Analysis of GWAS (MTAG) for joint analysis of multiple phenotypes. MTAG can be applied to GWAS summary statistics from an arbitrary number of phenotypes without access to individual-level data.

Although there are many proposed methods for joint analysis of multiple phenotypes, the performance of these methods under a wide range of scenarios is not consistent⁶: one test may be powerful in some situations, but not in the others. Thus, one challenge in multiple phenotype analysis is to construct a test that could maintain good performance across different scenarios. In this article, we develop a novel statistical method to test the association between a genetic variant and Multiple Phenotypes based on cross-validation Prediction Error (MultP-PE). Extensive simulation studies are conducted to evaluate the type I error rates and to compare the power performance of MultP-PE with various existing methods. Our simulation studies show that MultP-PE controls the type I error rates very well and has consistently higher power than other methods we compared in all simulation scenarios.

Method

We consider a sample with n unrelated individuals. Each individual has K (potentially correlated) phenotypes and has been genotyped at a variant of interest. Let y_ik denote the k^th phenotype value of the i^th individual and x_i denote the genotype score of the i^th individual, where x_i∈{0, 1, 2} is the number of minor alleles that the i^th individual carries. We model the relationship between the multiple phenotypes and the genetic variant using an inverse linear regression model, in which the genotype at the variant of interest is the response variable and the multiple phenotypes are predictors. That is,

$${x}_{i}={\beta }_{0}+{\beta }_{1}{y}_{i1}+\ldots +{\beta }_{K}{y}_{iK}+{\varepsilon }_{i}.$$

(1)

We are not the first using an ordinal variable as response variable in a linear model. To correct for population stratification, Price et al.¹⁶ used a qualitative phenotype or genotypes as response variables in linear models. To adjust the effects of covariates in rare variant association studies, Sha et al.¹⁷ also used a qualitative phenotype or genotypes as response variables in linear models. To test the association between the K multiple phenotypes and the variant, we test the null hypothesis H₀:β₁ = ··· = β_K = 0 under model (1).

Let y_i = (1,y_i1, …, y_iK)^T and β = (β₀, β₁, …, β_K)^T, then the regression model in equation (1) can be written as ${x}_{i}={y}_{i}^{T}\beta +{\varepsilon }_{i},i=1,2,\ldots ,n$. The ordinary linear square estimate of β is $\hat{\beta }={({Y}^{T}Y)}^{-1}{Y}^{T}x$, where Y = (y₁, …, y_n)^T and x = (x₁, …, x_n)^T. When multiple phenotypes are highly correlated, the rank of matrix Y may be less than K, then the inverse of Y^TY may not exist, which results in that the ordinary linear square estimate of β may not be unique¹⁸. Since multiple phenotypes in a GWAS are usually highly correlated, we propose to use Ridge regression^{19,20,21,22,23,24}. Ridge regression penalizes the size of the regression coefficients. The Ridge regression estimator of β is defined as the value of β that minimizes

$$\sum _{i}{({x}_{i}-{y}_{i}^{T}\beta )}^{2}+\lambda \sum _{j}{\beta }_{j}^{2},$$

where λ (λ ≥ 0) is a tuning parameter. The solution to the Ridge regression is given by ${\hat{\beta }}_{\lambda }={({Y}^{T}Y+\lambda I)}^{-1}{Y}^{T}x$. Here the estimator of β depends on λ and we use the subscript λ to indicate that the estimator of β is a function of λ.

Based on Ridge regression, we propose to use the leave-one-out cross validation (LOOCV) prediction error under model (1) as a test statistic. Let ${\hat{x}}_{-i}^{\lambda }$ denote the LOOCV predicted value (leave the i^th individual out) of x_i under model (1) with parameter λ in Ridge regression. Then, the statistic can be written as ${T}_{\lambda }=\sum _{i=1}^{n}{({x}_{i}-{\hat{x}}_{-i}^{\lambda })}^{2}$. Note that T_λ is the LOOCV prediction error, thus low values of T_λ would imply significance. Let p_λ denote the p-value of T_λ (see next paragraph on how to calculate p_λ). We define the test statistic of Multiple Phenotypes based on Prediction Error (MultP-PE) as

$${T}_{MultP-PE}={\min }_{\lambda }{p}_{\lambda }.$$

(2)

We propose to use a grid search method in equation (2) to evaluate the minimization. We divide the interval [0, ∞) into subintervals $0\le {\lambda }_{1} < \cdot \cdot \cdot < {\lambda }_{M-1} < {\lambda }_{M} < \infty $. Then, ${T}_{MultP-PE}={{\rm{\min }}}_{\lambda }{p}_{\lambda }={\min }_{1\le m\le M}{p}_{{\lambda }_{m}}$. We use a permutation procedure to evaluate the p-value of T_MultP−PE. Intuitively, we need to use two layers of permutations to estimate ${p}_{{\lambda }_{m}}$ and the overall p-value for the test statistic T_MultP−PE. For microarray data analysis, Ge et al.²⁵ proposed that one layer of permutation can be used to estimate p-values. We use the permutation procedure of Ge et al. to estimate ${p}_{{\lambda }_{m}}$ and the overall p-value for the test statistic T_MultP−PE. In each permutation, we randomly shuffle the genotypes at the variant. Suppose that we perform B times of permutations. Let ${T}_{{\lambda }_{m}}^{(b)}$ denote the value of ${T}_{{\lambda }_{m}}$ based on the b^th permuted data for b = 0, 1, …, B and m = 1, …, M, and ${p}_{{\lambda }_{m}}^{(b)}$ denote the p-value of ${T}_{{\lambda }_{m}}^{(b)}$, where b = 0 represents the original data. Then, we can estimate ${p}_{{\lambda }_{m}}^{(b)}$ using ${p}_{{\lambda }_{m}}^{(b)}=\frac{\#\{d:{T}_{{\lambda }_{m}}^{(d)} < {T}_{{\lambda }_{m}}^{(b)}\,{\rm{for}}\,d=1,\ldots ,B\}}{B}$. Let ${T}_{MultP-PE}^{(b)}={\min }_{1\le m\le M}{p}_{{\lambda }_{m}}^{(b)}$ denote the test statistic of T_MultP−PE based on the b^th permuted data, then the p-value of T_MultP−PE is given by

$$\frac{\#\{{T}_{MultP-PE\,}^{(b)}:{T}_{MultP-PE}^{(b)} < {T}_{MultP-PE}^{(0)}\,{\rm{for}}\,b=1,2,\ldots ,B\}}{B}$$

(3)

To apply MultP-PE to GWAS with hundreds of thousands of SNPs, we also propose an algorithm that can perform the permutation procedure described above more efficiently in the following section.

A Fast Algorithm for the Permutation Procedure

We use the notations in the above section and let A_λ = (Y^TY + λI)⁻¹, ${h}_{i}^{\lambda }={y}_{i}^{T}{A}_{\lambda }{y}_{i}$, $\,{h}_{\lambda }=({h}_{1}^{\lambda },\ldots ,{h}_{n}^{\lambda })$, and ${\hat{\beta }}_{\lambda }={A}_{\lambda }{Y}^{T}x$. Then, the Ridge predicted value of x_i is ${\hat{x}}_{i}^{\lambda }={y}_{i}^{T}{\hat{\beta }}_{\lambda }$ and ${\hat{x}}_{\lambda }={({\hat{x}}_{1}^{\lambda },\ldots ,{\hat{x}}_{n}^{\lambda })}^{T}=Y{({Y}^{T}Y+\lambda I)}^{-1}{Y}^{T}x$. We can show that the LOOCV prediction error in Ridge regression has a closed-form formula^24,26, that is, ${x}_{i}-{\hat{x}}_{-i}^{\lambda }=({x}_{i}-{\hat{x}}_{i}^{\lambda })/(1-{h}_{i}^{\lambda })$. Note that for two matrices or vectors A and B, we use A*B and $\frac{A}{B}$ to denote the element-wise operations; for a matrix C, we use colSum(C) to denote the sums of the columns of matrix C. We assume n ≥ K + 1. We perform singular value decomposition of Y, that is, Y = UDV, where U is an n × (K + 1) matrix with orthonormal columns, D is (K + 1) × (K + 1) diagonal matrix with non-negative real numbers on the diagonal, and V is an (K + 1) × (K + 1) orthogonal matrix. Let D = diag(d₁, …, d_K + 1). Then, ${\hat{x}}_{\lambda }=U{C}_{\lambda }{U}^{T}x$, where C_λ = diag(c_λ,1, …, c_λ,K + 1) and ${c}_{\lambda ,j}={d}_{j}^{2}/({d}_{j}^{2}+\lambda )$ for j = 1, …, K + 1. Let ${c}_{\lambda }={({c}_{\lambda ,1},\ldots ,{c}_{\lambda ,K+1})}^{T}$ and x^(K) = U^Tx be a K + 1 dimensional vector. Then, ${\hat{x}}_{\lambda }=U{C}_{\lambda }{x}^{(K)}=U({c}_{\lambda }\ast {x}^{(K)})$ and h_λ = diag(UC_λU^T). For $0\le {\lambda }_{1} < \ldots < {\lambda }_{M} < \infty $, let $C=({c}_{{\lambda }_{1}},\ldots ,{c}_{{\lambda }_{M}})$ and $H=({h}_{{\lambda }_{1}},\ldots ,{h}_{{\lambda }_{M}})$. Then, $(\,{\hat{x}}_{{\lambda }_{1}},\ldots ,\,{\hat{x}}_{{\lambda }_{M}})=U(C\ast {x}^{(K)})=U({c}_{{\lambda }_{1}}\ast {x}^{(K)},\ldots ,{c}_{{\lambda }_{M}}\ast {x}^{(K)})$. If we denote $Q=\frac{(x-{\hat{x}}_{{\lambda }_{1}}\,,\ldots ,\,x-{\hat{x}}_{{\lambda }_{M}})}{1-H}$, then $({T}_{{\lambda }_{1}},\ldots ,{T}_{{\lambda }_{M}})=colSum(Q\ast Q)$. Note that C, U, and H only depend on phenotypes and λ₁, …, λ_M. Thus, C, U, and H do not change in each permutation. For a GWAS, C, U, and H also do not change at different SNPs. For 1,000 permutations on one SNP, our fast algorithm is about 20 times faster than the original algorithm (the original algorithm calculates T_λ by ${T}_{\lambda }=\sum _{i=1}^{n}{({x}_{i}-{\hat{x}}_{-i}^{\lambda })}^{2}$). To perform a GWAS with hundreds of thousands of SNPs, we can use the same approach as was suggested in Zhu et al.¹⁴, that is, we can first select SNPs that show evidence of association based on a small number of permutations (e.g. 1,000), then use a large number of permutations to test the selected SNPs. For example, in our real data analysis with 630,860 SNPs, we first performed 1,000 permutations and selected SNPs with p-value ≤ 0.005, then we performed 10⁸ permutations on the selected SNPs because SNPs with p-value > 0.005 are not significantly associated with phenotypes.

Although we use a permutation procedure to calculate the p-value of MultP-PE, by using our fast algorithm, we can use less than one day to perform a typical GWAS. In our read data analysis on COPD in the following section, we performed a GWAS with 5,430 individuals across 630,860 SNPs and seven phenotypes. We completed the analysis in 10 hours on Intel Xeon E5-2680v3 by using a single node.

In the above section, we describe MulP-PE without considering covariates. If covariates are needed to be considered, we can incorporate covariates using the following approach in MultP-PE. Suppose that there are total G covariates we would like to consider and let (z_i1, …, z_iG)^T denote the covariates for the i^th individual. We can adjust esch of the phenotypes by the covariates by applying the linear regression model ${y}_{ik}={a}_{0k}+{a}_{1k}{z}_{i1}+\ldots +{a}_{Mk}{z}_{iG}+{\varepsilon }_{ik}$, for i = 1, 2, …, n, k = 1, 2, …, K, and use the residual of y_ik to replace y_ik in MultP-PE. In our real data analysis, we used this approach to incorporate four covariates. This approach has been used in the literature. For example, Sha et al.¹⁶ and Zhu et al.¹⁴ also used the same approach to adjust phenotypes for the covariates.

In association studies for unrelated individuals, it has been well known that population stratification can seriously confound association results²⁷. There are several methods that have been developed to control for population stratification. For example, Genomic Control (GC) approach^28,29, Principal Component (PC) approach^16,30,31,32, and Mixed Linear Model (MLM) approach^33,34. Similar to most association tests for unrelated individuals, MulP-PE subjects to bias due to population stratification. To make MultP-PE robust to population stratification, we can use the PC approach. Let c_i1, …, c_iL denote the top L PCs of the genotypes at a set of genomic markers for the i^th individual. We can use the residuals of the regression model ${x}_{i}=\alpha +{\beta }_{1}{c}_{i1}+\cdot \cdot \cdot +{\beta }_{L}{c}_{iL}+{\varepsilon }_{i}$ to replace x_i and use the residuals of the regression model ${y}_{ik}={\alpha }_{k}+{\beta }_{1k}{c}_{i1}+\cdot \cdot \cdot +{\beta }_{Lk}{c}_{iL}+{\varepsilon }_{ik}$ to replace y_ik for k = 1, 2, …, K in MultP-PE to adjust for population stratification.

Comparison of Methods

We evaluate the performance of the proposed test MultP-PE by comparing it with five most commonly used methods for association studies using multiple phenotypes. These five methods include the O’Brien’s method (OB)⁵, Trait-based Association Test that uses Extended Simes procedure (TATES)⁷, Optimal weight method (OW)⁶, Multivariate analysis of variance (MANOVA)⁹, and Joint model of multiple phenotypes (MultiPhen)².

Simulation Study

In simulation studies, we evaluate type I error rates of MultP-PE by generating data sets with three different sample sizes, 500, 1,000 and 2,000. For power comparison, we compare the powers of different methods by simulation data sets with 1,000 unrelated individuals.

For genotype data, we generate genotype at a genetic variant according to minor allele frequency (MAF) and assume Hardy-Weinberg Equilibrium (HWE). For each individual, we generate K phenotypes using models similar to the models used in Zhu et al.¹⁴ and Wang et al.³⁵. The K phenotypes are generated from the following model

$$y=\varphi x+c\gamma \omega +\sqrt{1-{c}^{2}}\times \varepsilon $$

(4)

where y = (y₁, …, y_K)^T; ϕ = (ϕ₁, …, ϕ_K) are the genetic effects of the variant on the K phenotypes; x is the genotypic score at the variant; c is a constant number; γ is a K × R matrix; ω = (ω₁, …, ω_R)^T is a vector of factors with R elements and $\omega ={({\omega }_{1},\ldots ,{\omega }_{R})}^{T}\sim MVN(0,\Sigma )$, $\Sigma =\rho A+(1-\rho )I$, ρ is the correlation between factors, A is a matrix with elements of 1, and I is the identity matrix; ε = (ε₁, …, ε_K)^T is a vector of residuals, ε₁, …, ε_K are independent, and ${\varepsilon }_{k}\sim N(0,1)$ for k = 1, …, K. Based on equation (4), we consider the following four models in which the within-factor correlation is c² and the between-factor correlation is ρc².

Model 1

There is only one factor and genotypes impact on all phenotypes with different effect sizes. That is, R = 1, ϕ = β(1, 2, …, K)^T, and γ = (1, …, 1)^T.

Model 2

There are two factors and genotypes impact on one factor. That is, R = 2, $\varphi ={(0,\ldots ,0,\mathop{\underbrace{\beta ,\ldots ,\beta }}\limits_{K/2})}^{T}$, and γ = Bdiag(D₁, D₂), where ${D}_{i}={(\mathop{\underbrace{1,\ldots ,1}}\limits_{K/2})}^{T}$ for i = 1, 2 and Bdiag means block diagonal.

Model 3

There are five factors and genotypes impact on two factors. That is, R = 5, $\varphi ={({\beta }_{11},\ldots ,{\beta }_{1k},{\beta }_{21},\ldots ,{\beta }_{2k},{\beta }_{31},\ldots ,{\beta }_{3k},{\beta }_{41},\ldots ,{\beta }_{4k},{\beta }_{51},\ldots ,{\beta }_{5k})}^{T}$, and γ = Bdiag(D₁, D₂, D₃, D₄, D₅), where ${D}_{i}={(\mathop{\underbrace{1,\ldots ,1}}\limits_{K/5})}^{T}$ for i = 1, …, 5; k = K/5; ${\beta }_{11}=\cdot \cdot \cdot ={\beta }_{1k}={\beta }_{21}=\cdot \cdot \cdot ={\beta }_{2k}={\beta }_{31}=\cdot \cdot \cdot ={\beta }_{3k}=0$; β₄₁ = ··· = β_4k = −β; and $({\beta }_{51},\ldots ,{\beta }_{5k})=\frac{2\beta }{k+1}(1,\ldots ,k)$.

Model 4

There are five factors and genotypes impact on four factors. That is, R = 5, $\varphi ={({\beta }_{11},\ldots ,{\beta }_{1k},{\beta }_{21},\ldots ,{\beta }_{2k},{\beta }_{31},\ldots ,{\beta }_{3k},{\beta }_{41},\ldots ,{\beta }_{4k},{\beta }_{51},\ldots ,{\beta }_{5k})}^{T}$, and γ = Bdiag(D₁, D₂, D₃, D₄, D₅), where ${D}_{i}={(\mathop{\underbrace{1,\ldots ,1}}\limits_{K/5})}^{T}$ for i = 1, …, 5; k = K/5; β₁₁ = ··· = β_1k = 0; β₂₁ = ··· = β_2k = β; β₃₁ = ··· = β_3k = −β; $({\beta }_{41},\ldots ,{\beta }_{4k})=-\,\frac{2\beta }{k+1}(1,\ldots ,k)$; and $({\beta }_{51},\ldots ,{\beta }_{5k})=\frac{2\beta }{k+1}(1,\ldots ,k)$.

For the type I error rates, we set β = 0 to indicate that the genetic variant has no effect on all phenotypes. For power comparisons, we consider different values of β. To evaluate type I error rate and power, we set MAF = 0.3, the between-factor correlation is 0.14, and the within-factor correlation is 0.25. In the following simulation studies and real data analysis, we use eight different values of λ (M = 8) and set $\mathrm{log}\,\lambda =0,\,1,\,2,\,3,\,3.5,\,3.8,\,4,\,4.5$.

The R codes for implementation of MultP-PE and for simulation of data under the four models are available at Dr. Shuanglin Zhang’s homepage http://www.math.mtu.edu/shuzhang/software.html.

Results

To evaluate the type I error rates of MultP-PE, we consider different significance levels (0.01 and 0.05), different sample sizes (500, 1000 and 2000), and different number of phenotypes (10, 20 and 40). We use 1,000 permutations to calculate the p-values of MultP-PE and use 10,000 replicated samples to evaluate type I error rates of MultP-PE. For 10,000 replicated samples, the 95% confidence intervals (CIs) for the estimated type I error rates with nominal levels 0.05 and 0.01 are (0.04562, 0.05438) and (0.00804, 0.01196), respectively. We summarize the estimated type I error rates of the proposed test in Table 1. This table shows that only one type I error rate is not in the corresponding 95% CI (it is very close to the upper-bound of the CI), which indicates that the proposed method is valid.

Table 1 Estimated type I error rates for the MultP-PE method under four models.

Full size table

In power comparisons, we calculate the p-values of MultP-PE using 1,000 permutations and the p-values of MultiPhen, OW, TATES, MANOVA, OB using their asymptotic distributions. We evaluate the powers of all of the six tests using 1,000 replicated samples at a significance level of 0.05. Figures 1 and 2 show the powers of the six methods as a function of the effect size β with K = 20 and 40, respectively. As shown in these two figures: (1) MultP-PE is the most powerful test. The power of MultP-PE is much higher than the second most powerful test; (2) as the effect size β increases, the powers of all tests increase as well; as the number of phenotypes K increases from 20 to 40, MultP-PE presents more ascendancy than the other five tests; (3) MultiPhen, OW, and MANOVA have similar powers under all four models. A similar conclusion has been reached in some published papers^2,6,7; (4) OB is comparable to MultiPhen, OW, and MANOVA in models 1 and 2, but has almost no power when the genetic effects have different directions (models 3 and 4); (5) TATES is more powerful than MultiPhen, OW, and MANOVA in model 2, but is less powerful than MultiPhen, OW, and MANOVA in models 3 and 4.

Power comparisons of the six methods as a function of the within-factor correlation, c², with K = 20 and 40 are given in Figs 3 and 4, respectively. As shown in these two figures: (1) the patterns of the power performance are similar to those in Figs 1 and 2; (2) when the within-factor correlation is increasing, the powers of all six tests have increasing trend or decreasing trend depending on different model settings. This pattern has been confirmed in Zhu’s paper⁶; (3) OB is the least powerful test except under model 2 with the within-factor correlation > 0.2.

Power comparisons of the six methods as a function of the between-factor correlation, c²ρ, with K = 20 and 40 are given in Figs S1 and S2, respectively. As shown in these two figures: (1) the patterns of the power performance are similar to those in Figs 1 and 2; (2) when the between-factor correlation is increasing, the powers of all six tests have increasing trend except for these under model 1; (3) MultP-PE is the most powerful test, while OB is the least powerful test except under model 2 with the between-factor correlation = 0.1.

In summary, MultP-PE is consistently the most powerful test among the tests we compared under all simulation scenarios.

Real Data Analysis

Chronic obstructive pulmonary disease (COPD) is a terminology to describe progressive life-threatening lung diseases that causes breathlessness and serious illness, including emphysema, chronic bronchitis, refractory asthma, and some forms of bronchiectasis. A global prevalence of 251 million cases of COPD is reported in 2016 and it is estimated that COPD caused 3.17 million deaths in 2015³⁶. The COPDGene aims to find inherited or genetic factors that associated with COPD. The COPDGene dataset includes 10,192 participants, 3,408 of them are African-Americans (AA), and 6,784 of them are Non-Hispanic Whites (NHW). Same as Liang et al.³⁷, we considered Age, Sex, BMI, and Pack-Years as four covariates and selected seven quantitative COPD-related phenotypes (FEV1, Emphysema, Emphysema Distribution, Gas Trapping, Airway Wall Area, Exacerbation frequency, and Six-minute walk distance) in the following data analysis.

We deleted individuals and genotypes with missing data. After excluding missing data, a set of 5,430 NHW across 630,860 SNPs was used in the analysis. Then we adjusted the phenotypes for the covariates by applying a linear regression^14,17. We regressed each phenotype on the four covariates, replaced original phenotypes with the residuals of the regression, and applied each of the six tests to detect the association between the covariates-adjusted phenotypes (residuals) and each SNP.

We used genome-wide significance level 5 × 10⁻⁸ to identify SNPs that are significantly associated with the seven COPD-related phenotypes. There were total 14 SNPs identified by at least one method (Table 2). All of the 14 SNPs had been reported to be associated with COPD by previous studies^{38,39,40,41,42,43,44,45,46,47,48,49,50}. As shown in Table 2, MultiPhen identified 14 SNPs; OW, MANOVA, and MultP-PE identified 13 SNPs; TATES identified 9 SNPs; and OB did not identify any SNPs. The number of SNPs identified by MultP-PE was comparable to the largest number of SNPs identified by other tests and the COPD analysis results were consistent with our simulation results. We also performed individual phenotype analysis on each of the seven phenotypes. Table S1 gives the adjusted p-values (Bonferroni correction for multiple testing) to test each of the seven phenotypes on the 14 significant SNPs. We can see from Table S1, among the 14 SNPs, only nine SNPs are significantly associated with Emphysema Distribution at the genome-wide significance level. The number of SNPs identified by individual phenotype is the same as TATES and is less than the number of SNPs identified by four multiple phenotype analyses (OW, MANOVA, Multiphen, and MultP-PE), which showed that the simultaneous analysis of multiple phenotypes can increase power comparing with single phenotype analysis.

Table 2 Significant SNPs and the corresponding p-values in the analysis of COPDGene.

Full size table

Discussion

For complex diseases in GWAS, the association between a genetic variant and each phenotype is usually weak. Analyzing multiple disease-related phenotypes could increase statistical power to identify the association between genetic variants and complex diseases. In this article, we developed a novel statistical method, MultP-PE, to test the association between a genetic variant and multiple phenotypes based on cross-validation prediction error. We showed that MultP-PE controls type I error rates very well and has consistently higher power than other methods we compared among all the simulation scenarios. Overall, MultP-PE is the most powerful test and has much higher power than the second most powerful test; OW, MANOVA, and MultiPhen have very similar performance; OB loses power dramatically when genetic effects have opposite directions on phenotypes; TATES is more powerful when the genetic effect only works on a portion of phenotypes. In real data analysis, MultP-PE identified 13 out of 14 significant SNPs, which is comparable to MultiPhen (14 out of 14).

References

Yang, Q., Wu, H., Guo, C. Y. & Fox, C. S. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genet Epidemiol 34(5), 444–454 (2010).
Article PubMed PubMed Central Google Scholar
O’Reilly, P. F. et al. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One 7(5), e34861 (2012).
Article ADS PubMed PubMed Central CAS Google Scholar
Wang, Y. et al. Pleiotropy analysis of quantitative traits at gene level by multivariate functional linear models. Genet Epidemiol 39(4), 259–275 (2015).
Article PubMed PubMed Central Google Scholar
Yang, J. J., Li, J., Williams, L. K. & Buu, A. An efficient genome-wide association test for multivariate phenotypes based on the Fisher combination function. BMC Bioinform 17(1), 1 (2016).
CAS Google Scholar
O’Brien, P. C. Procedures for comparing samples with multiple endpoints. Biometrics 40, 1079–1087 (1984).
Article MathSciNet PubMed Google Scholar
Zhu, H., Zhang, S. & Sha, Q. Power Comparisons of Methods for Joint Association Analysis of Multiple Phenotypes. Hum Hered 80(3), 144–52 (2016).
Article CAS Google Scholar
van der Sluis, S., Posthuma, D. & Dolan, C. V. TATES: Efficient multivariate genotype-phenotype analysis for genome-wide association studies. PLoS Genet 9(1), e1003235 (2013).
Article PubMed PubMed Central CAS Google Scholar
Ferreira, M. A. & Purcell, S. M. A multivariate test of association. Bioinformatics 25(1), 132–133 (2009).
Article CAS PubMed Google Scholar
Cole, D. A., Maxwell, S. E., Avrey, R. & Salas, E. How the power of MANOVA can both increase and decrease as a funcion of the intercorrelations among the dependent variables. Psychol Bull 115(3), 465 (1994).
Article Google Scholar
Galesloot, T. E., van Steen, K., Kiemeney, L. A. L. M., Janss, L. L. & Vermeulen, S. H. A comparison of multivariate genome-wide association methods. PLoS One 9, e95923 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Aschard, H. et al. Maximizing the power of principal-component analysis of correlated phenotypes in genome-wide association studies. Am J Hum Genet 94(5), 662–676 (2014).
Article CAS PubMed PubMed Central Google Scholar
Klei, L., Luca, D., Devlin, B. & Roeder, K. Pleiotropy and principal components of heritability combine to increase power for association analysis. Genet Epidemiol 32(1), 9–19 (2008).
Article PubMed Google Scholar
Wang, K. & Abbott, D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol 32, 108–118 (2008).
Article PubMed Google Scholar
Zhu, H., Zhang, S. & Sha, Q. A novel method to test associations between a weighted combination of phenotypes and genetic variants. PLoS ONE 13(1), e0190788, https://doi.org/10.1371/journal.pone.0190788 (2018).
Article CAS PubMed PubMed Central Google Scholar
Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet 50, 229–37 (2018).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–9 (2006).
Article CAS PubMed Google Scholar
Sha, Q., Wang, X., Wang, X. & Zhang, S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet Epidemiol 36, 561–571 (2012).
Article PubMed Google Scholar
Draper, N. R. & Smith, H. Applied Regression Analysis, (John Wiley & Sons, 2014).
Cule, E. & De Iorio, M. Ridge regression in prediction problems: automatic choice of the ridge parameter. Genet Epidemiol 37(7), 704–14, https://doi.org/10.1002/gepi.21750. PubMed PMID: 23893343; PMCID: PMC4377081 (2013).
Cule, E., Vineis, P. & De Iorio, M. Significance testing in ridge regression for genetic data. BMC Bioinformatics 12, 372, https://doi.org/10.1186/1471-2105-12-372. PubMed PMID: 21929786; PMCID: PMC3228544 (2011).
Halawa, A. & El Bassiouni, M. Tests of regression coefficients under ridge regression models. J Stat Comput and Simul 65(1–4), 341–56 (2000).
Article MathSciNet MATH Google Scholar
Hoerl, A. E., Kannard, R. W. & Baldwin, K. F. Ridge regression: some simulations. Commun Stat Theory Methods 4(2), 105–23 (1975).
MATH Google Scholar
Malo, N., Libiger, O. & Schork, N.J. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am J Hum Genet 82(2), 375–85, https://doi.org/10.1016/j.ajhg.2007.10.012. PubMed PMID: 18252218; PMCID: PMC2427310 (2008).
Yang, X., Wang, S., Zhang, S. & Sha, Q. Detecting association of rare and common variants based on cross-validation prediction error. Genet Epidemiol 41(3), 233–243 (2017).
Article PubMed PubMed Central Google Scholar
Ge, Y., Dudoit, S. & Speed, T. P. Resampling-based multiple testing for microarray data analysis. Test 12(1), 1–77 (2003).
Article MathSciNet MATH Google Scholar
James, G., Witten, D., Hastie, T. & Tibshirani, R. An introduction to statistical learning, (Springer, 2013).
Lander, E. S. & Schork, N. J. Genetic dissection of complex traits. Science 265, 2037–48 (1994).
Article ADS CAS PubMed Google Scholar
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Article CAS PubMed MATH Google Scholar
Reich, D. E. & Goldstein, D. B. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol 20, 4–16 (2001).
Article CAS PubMed Google Scholar
Chen, H. S., Zhu, X., Zhao, H. & Zhang, S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet 67, 250–64 (2003).
Article CAS PubMed Google Scholar
Zhang, S., Zhu, X. & Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol 24, 44–56 (2003).
Article PubMed Google Scholar
Zhu, X., Zhang, S., Zhao, H. & Cooper, R. S. Association mapping, using a mixture model for complex traits. Genet Epidemiol 23, 181–96 (2002).
Article PubMed Google Scholar
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42, 355–60 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet 42, 348–54 (2010).
Article CAS PubMed PubMed Central Google Scholar
Wang, Z., Sha, Q. & Zhang, S. Joint Analysis of Multiple Traits Using “Optimal” Maximum Heritability Test. PloS one 11(3), e0150975 (2016).
Article PubMed PubMed Central CAS Google Scholar
Chronic obstructive pulmonary disease (COPD). WHO. Retrieved from, http://www.who.int/mediacentre/factsheets/fs315/en/ (Nov. 2017).
Liang, X. et al. An Adaptive Fisher’s Combination Method for Joint Analysis of Multiple Phenotypes in Association Studies. Sci Rep 6, 34323, https://doi.org/10.1038/srep34323 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Brehm, J. M. et al. Identification of FGF7 as a novel susceptibility locus for chronic obstructive pulmonary disease. Thorax 66(12), 1085–1090 (2011).
Article PubMed Google Scholar
Cui, K., Ge, X. & Ma, H. Four SNPs in the CHRNA3/5 alpha-neuronal nicotinic acetylcholine receptor subunit locus are associated with COPD risk based on meta-analyses. PloS One 9(7), e102324 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Du, Y., Xue, Y. & Xiao, W. Association of IREB2 gene rs2568494 polymorphism with risk of chronic obstructive pulmonary disease: a meta-analysis. Med Sci Monit 22, 177 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cho, M. H. et al. Variants in FAM13A are associated with chronic obstructive pulmonary disease. Nat Genet 42(3), 200–202 (2010).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Hancock, D. B. et al. Meta-analyses of genome-wide association studies identify multiple loci associated with pulmonary function. Nat Genet 42(1), 45–52 (2010).
Article CAS PubMed Google Scholar
Lutz, S. M. et al. A genome-wide association study identifies risk loci for spirometric measures among smokers of European and African ancestry. BMC Genet 16(1), 1 (2015).
Article MathSciNet CAS Google Scholar
Li, X. et al. Importance of hedgehog interacting protein and other lung function genes in asthma. J Allergy Clin Immunol 127(6), 1457–1465 (2011).
Article CAS PubMed PubMed Central Google Scholar
Pillai, S. G. et al. A genome-wide association study in chronic obstructive pulmonary disease (COPD): identification of two major susceptibility loci. PLoS Genet 5(3), e1000421 (2009).
Article PubMed PubMed Central CAS Google Scholar
Wilk, J. B. et al. A genome-wide association study of pulmonary function measures in the Framingham Heart Study. PLoS Genet 5(3), e1000429 (2009).
Article PubMed PubMed Central CAS Google Scholar
Wilk, J. B. et al. Genome-wide association studies identify CHRNA5/3 and HTR4 in the development of airflow obstruction. Am J Respir Crit Care Med 186(7), 622–632 (2012).
Article CAS PubMed PubMed Central Google Scholar
Young, R. P. et al. Chromosome 4q31 locus in COPD is also associated with lung cancer. Eur Respir J 36(6), 1375–1382 (2010).
Article CAS PubMed Google Scholar
Zhang, J., Summah, H., Zhu, Y. G. & Qu, J. M. Nicotinic acetylcholine receptor variants associated with susceptibility to chronic obstructive pulmonary disease: a meta-analysis. Respir Res 12(1), 1 (2011).
Article Google Scholar
Zhu, A. Z. et al. Association of CHRNA5-A3-B4 SNP rs2036527 with smoking cessation therapy response in African-American smokers. Clin Pharmacol Ther 96(2), 256–265 (2014).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R15HG008209. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research used data generated by the COPDGene study, which was supported by National Institutes of Health (NIH) grants U01HL089856 and U01HL089897. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion. Superior, a high-performance computing infrastructure at Michigan Technological University, was used in obtaining results presented in this publication.

Author information

Authors and Affiliations

Department of Mathematical Sciences, Michigan Technological University, Houghton, Michigan, United States of America
Xinlan Yang, Shuanglin Zhang & Qiuying Sha

Authors

Xinlan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shuanglin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qiuying Sha
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.Z. and Q.S. designed research, X.Y. and S.Z. performed statistical analysis, and X.Y., S.Z. and Q.S. wrote the manuscript.

Corresponding author

Correspondence to Qiuying Sha.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, X., Zhang, S. & Sha, Q. Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error. Sci Rep 9, 1073 (2019). https://doi.org/10.1038/s41598-018-37538-y

Download citation

Received: 05 April 2018
Accepted: 19 November 2018
Published: 31 January 2019
DOI: https://doi.org/10.1038/s41598-018-37538-y
Springer Nature Limited

This article is cited by

Recent innovations and in-depth aspects of post-genome wide association study (Post-GWAS) to understand the genetic basis of complex phenotypes
- Zahra Mortezaei
- Mahmood Tavallaei
Heredity (2021)

Joint Analysis of Multiple Phenotypes in Association Studies based on Cross-Validation Prediction Error

Abstract

Similar content being viewed by others

Cross-Phenotype Association Analysis Using Summary Statistics from GWAS

Introduction to Statistical Methods for Integrative Data Analysis in Genome-Wide Association Studies

Family-based association analysis: a fast and efficient method of multivariate association analysis with multiple variants

Introduction

Method