Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach

Espin-Garcia, Osvaldo; Shen, Xiaowei; Qiu, Xin; Brhane, Yonathan; Liu, Geoffrey; Xu, Wei

doi:10.1186/1753-6561-8-S1-S70

Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach

Proceedings
Open access
Published: 17 June 2014

Volume 8, article number S70, (2014)
Cite this article

Download PDF

You have full access to this open access article

BMC Proceedings Aims and scope

Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach

Download PDF

Osvaldo Espin-Garcia^1,2,
Xiaowei Shen^1,2,
Xin Qiu¹,
Yonathan Brhane³,
Geoffrey Liu^4,5 &
…
Wei Xu^1,5

1355 Accesses
2 Citations
Explore all metrics

Abstract

We propose a genetic association analysis using Dirichlet regression to analyze the Genetic Analysis Workshop 18 data. Clinical variables, arranged in a longitudinal data structure, are employed to fit a multistate transition model in which the transition probabilities are served as a response in the proposed analysis. Furthermore, a gene-based association analysis via penalized regression is implemented using the markers at a single-nucleotide polymorphism level that we previously identified via nonpenalized Dirichlet regression.

Background

Genetic association analyses have had tremendous successes in recent years; however, most of these analyses were based on binary or continuous responses. Thus we propose a multivariate response vector indicating probabilities of transitions to predefined hypertensive states. This enables us to reflect the inherent uncertainty involved in the probability that a patient will transfer to a given state.

An important feature of our approach is the incorporation of prehypertension as an intermediate state. As Winegarden argues, prehypertension blood pressure in young patients helps predict the development of hypertension [1].

Methods

Definition of response

We defined a response summarizing the phenotype information into a vector that will be used in a genetic association analysis. The response is defined as a 3-dimensional vector of probabilities $y = (y_{1}, y_{2}, y_{3}), \sum^{y_{j}} = 1 (y_{1}, y_{2}, y_{3}), \sum^{y_{j}} = 1$ , such that each element measures the probability of a transition to a blood pressure level (normotensive, prehypertensive, or hypertensive) given the previous level.

The analysis was done without the knowledge of the underlying simulation model and we used the real phenotype data only.

Data quality control

Data quality control was performed in PLINK [2]. We only considered the data from chromosome 3 for analysis. We used a call rate for individuals of 95%, a Hardy-Weinberg disequilibrium test at a significance level of 1 × 10⁻⁶, and a missing rate of 95% for each marker. Markers with a minor allele frequency of at least 5% were retained for analysis. Additionally, all individuals' time points with at least 1 missing clinical variable were excluded.

Multistate transition model

We describe hypertension, our trait of interest, using a 3-state model based on recorded blood pressure levels for each individual at each examination. The states are defined as follows: normal blood pressure (state 1) when the systolic blood pressure is less than 120 mm Hg and diastolic blood pressure is less than 80 mm Hg; prehypertension (state 2) when the blood pressure level is not in state 1, the systolic blood pressure is less than 140 mm Hg, and the diastolic blood pressure is less than 90 mm Hg; and hypertension (state 3) for all other cases. Also, if a patient used antihypertensive medication, the state assigned at that examination is hypertension (state 3) regardless of the recorded blood pressure levels. Once the states are defined, we consider a multistate transition model; it is important to note that all 9 transitions are possible.

Our interest in transition models lies in estimating of the transition probabilities as defined in Kalbfleisch and Lawless [3] which are given by

P (S_{i} (t) = j | S_{i} (t - 1) = l, x_{i} (t - 1)) = y_{i l j} (t), l, j \in \{1, 2, 3\}

where $\{S_{i} (r), r = 1, 2, \dots\}$ and $\{x_{i} (r) = (x_{i 1} (r), \dots, x_{i p} (r)), r = 1, 2, \dots\}$ denote the observed state and the covariates for subject i at the $r^{th}$ examination respectively.

This model takes advantage of the longitudinal data structure and the definition of the response follows naturally. To estimate the transition probabilities, we fit a multinomial regression model, based on covariates (gender, smoking status and age) and the state at the previous examination.

To get expressions for $y_{i l} = (y_{i l 1}, y_{i l 2}, y_{i l 3}), l = 1, 2, 3,$ we consider a generalized logit model of the form

log (y_{i l j} / y_{i l l}) = z_{i l} γ_{l j}, j = 1, 2, 3, j \neq l

besides, $1 = y_{i l l} + \sum_{j = 1, j \neq l}^{3} y_{i l j} = y_{i l l} (1 + \sum_{j = 1, j \neq l}^{3} exp (z_{i l} γ_{l j}))$ , where $z_{i l} = (x_{i} (l), t i m e)$ is the observed vector of covariates for subject i plus a categorical variable denoting the effect of examination time in the model (and possible interactions), and $γ_{l j}$ is the vector of coefficients for the corresponding multinomial regression model.

Thus, a transition probability matrix (TPM) is defined for each individual as follows

TP M_{i} = (\begin{matrix} \frac{1}{1 + \sum_{j = 2}^{3} exp (z_{i 1} γ_{1 j})} & \frac{exp (z_{i l} γ_{12})}{1 + \sum_{j = 2}^{3} exp (z_{i 1} γ_{1 j})} & \frac{exp (z_{i l} γ_{13})}{1 + \sum_{j = 2}^{3} exp (z_{i 1} γ_{1 j})} \\ \frac{exp (z_{i 2} γ_{21})}{1 + \sum_{j = 1, j \neq 2}^{3} exp (z_{i 2} γ_{2 j})} & \frac{1}{1 + \sum_{j = 1, j \neq 2}^{3} exp (z_{i 2} γ_{2 j})} & \frac{exp (z_{i 2} γ_{23})}{1 + \sum_{j = 1, j \neq 2}^{3} exp (z_{i 2} γ_{2 j})} \\ \frac{exp (z_{i 3} γ_{31})}{1 + \sum_{j = 1}^{2} exp (z_{i 3} γ_{3 j})} & \frac{exp (z_{i 3} γ_{32})}{1 + \sum_{j = 1}^{2} exp (z_{i 3} γ_{3 j})} & \frac{1}{1 + \sum_{j = 1}^{2} exp (z_{i 3} γ_{3 j})} \end{matrix})

Therefore, the response for subject i is a row taken from $TP M_{i}$ and is determined by conditioning on the patient's last available observed state and covariates.

Dirichlet regression

Once the response is modeled our objective is to determine whether there is an association between it and the genotypes. We assess this association using Dirichlet regression [4], which suits this response structure. The advantage of this approach lies in its tractability in dealing with the proposed response. It also allows a more comprehensive understanding of the genetic effect on the expression of hypertension, and therefore in its possible interpretation. For instance, if a signal was detected for a marker, it would suggest an association between the marker and the transition of blood pressure states jointly rather than a single level. Therefore, the Dirichlet approach is more informative in the sense of explaining the plausibility of each defined state.

To relate the genetic information and the defined response under a Dirichlet regression approach, the likelihood given each individual's vector of covariates, $s_{i}$ , is

L = \prod_{i = 1}^{n} \{Γ (Λ (s_{i})) \prod_{j = 1}^{3} \frac{y_{i j}^{λ_{j} (s_{i}) - 1}}{Γ (λ_{j} ({t e x t b f s}_{i}))}\}

where $λ_{j} (s_{i}) = λ_{i j} > 0, Λ (s_{i}) = Λ_{i} = \sum_{j = 1}^{3} λ_{j} (s_{i})$ and $Γ (\cdot)$ is the gamma function.

The parameters, $λ_{j} (s_{i})$ , are defined in terms of a linear predictor using a logarithm link,

log (λ_{j} (s_{i})) = log (λ_{i j}) = \sum_{m = 1}^{M} β_{j m} s_{i m} = s_{i} β_{j} j = 1, 2, 3

where $M$ is the number of covariates included in the model and $β_{j}$ is the vector of regression coefficients that explains the effects (in log scale) of the covariates on the j^th component.

Considering the above, 2 models are analyzed:

Model 1 (M1): $log (λ_{i j}) = α_{j}^{M 1} + β_{j}^{M 1} g_{i}^{k}$ (base model)

Model 2 (M2): $log (λ_{i j}) = α_{j}^{M 2} + β_{j}^{M 2} g_{i}^{k} + FA M_{i} δ_{j}^{M 2}$ (adjusted model)

Here $g_{i}^{k}$ represents the number of copies of the minor allele on the k^th single-nucleotide polymorphism (SNP) for the i^th individual under an additive genetic model; $FA M_{i}$ is the i^th row of contrast matrix for the pedigree number considered as a categorical variable and $θ_{j}^{h} = {(α_{j}^{h}, β_{j}^{h}, {δ^{h}}_{j}^{t})}^{t}$ is the vector of regression coefficients on the j^th component. (Note $δ_{j}^{M 1} = 0)$ .

Our interest in these models lies in the potential genetic effect of each marker on the proposed response. To assess this, Wald statistics were used to test the null hypothesis of no association between each SNP and the response, $H_{0} : β = 0$ (vs. $H_{A} : not H_{0}$ ), $β = (β_{1}, β_{2}, β_{3})$ .

Gene-based association

Once we identify significant SNPs through the genetic association analysis as described above, we proceed to perform the analysis at a gene level. To achieve this, we propose a penalized regression. Including all the markers simultaneously, this penalized method aims to select those SNPs with higher association. The analysis is done on those candidate genes that contain at least 1 significant marker that has already been determined. Variable selection on the SNPs is assessed via a penalized likelihood of the form

p l (η; Y, G, c, κ) = l (η; Y, G) - c κ \sum_{l = 1}^{p} \sqrt{k} {∥η_{\cdot l}∥}_{2} - (1 - c) κ \sum_{l = 1}^{p} {∥η_{\cdot l}∥}_{1}

where $l (η; Y, G)$ represents the log-likelihood of a dirichlet distributed sample with response matrix $Y = {(y_{1}^{t}, \dots, y_{n}^{t})}^{t};$ $G = {(g_{1}^{t}, \dots, g_{n}^{t})}^{t}$ (or $G : FAM$ for M2) is the design matrix, $g_{i} = (1, g_{i}^{1}, \dots, g_{i}^{p});$ $p$ denotes the number of markers considered for variable selection; $k$ is the number of states; $η$ is the regression coefficients vector; $c$ and $λ$ are parameters for the penalized regression; and ${∥η_{. l}∥}_{2} = {(\sum_{j = 1}^{k} η_{l j}^{2})}^{1 / 2}$ and are the penalty norms. It is important to note that when $c = 1$ we have a ridge regression penalty, whereas when $c = 0$ we have a lasso penalty. We implement the variable selection for the penalized dirichlet regression using R code provided on the Statistical Genetics and Genomics Laboratory at the University of Pennsylvania webpage [5].

Results

Data quality results

The Genetic Analysis Workshop 18 (GAW18) data consists of 855 individuals with genotype and phenotype information. As a result of missing data, transition probabilities are estimated for only 835 individuals. Of these, 43 are removed because of low call rate. The overall call rate for the remaining 792 individuals is 99.82%.

The Genome Wide Association Study (GWAS) data includes 65,519 SNPs for chromosome 3, of which 59 are excluded because it is not possible to reliably obtain position information for these markers. The remaining 65,460 SNPs are considered for data quality control. Because of a low genotyping rate, 114 markers are removed; none are excluded by the Hardy-Weinberg equilibrium test; and 13,011 markers are removed because of low minor allele frequency. The remaining 52,357 markers are considered for analysis.

Analysis results

The parameter estimates for the transition models are obtained using R [6]. To test examination time effect; likelihood ratio tests are performed in which the null model considers only the available clinical variables. Table 1 presents the final transition models.

Table 1 Selected transition models

Full size table

After the response is estimated, models M1 and M2 are fit using R [7] for each available SNP. Figure 1 displays the Manhattan plots for the p values that result from testing the null hypotheses of no association between the markers and the response. The graphs show that only 1 marker under M2 is significant at the standard significance level for GWAS (5 × 10⁻⁸). Interestingly, the same marker is the most significant marker under M1, although it is not significant at the standard threshold. This suggests that the adjustment for family incorporated in M2 accounts for the family structure in the data. Also, the proposed methodology demonstrates consistency in that the same marker proves to be the most significant under both models. Table 2 summarizes these findings.

Table 2 Association analysis results

Full size table

Once significant markers were identified, a gene-level association analysis is performed using the penalized regression described above for different levels of $c$ (0, 0.3, 0.5, 0.7, and 1). The analysis is conducted utilizing both the GWAS and the dosage imputed genotypes (GENO) information as the explanatory variables. Figure 2 shows the penalized regression results for the gene containing the significant SNP (rs12492830) for $c = 0.5$ only. This level of $c$ is a blended penalty function, equally weighting the ridge and lasso penalties. Table 3 shows the results for different levels of $c$ under M2 for gene PCCB.

Table 3 Comparison of penalized regression under different levels of $c$

Full size table

Discussion

The present work implements a multistate transition model that conveniently accommodates the longitudinal data structure. Whether the information contained by the available clinical variables is sufficient for predicting the hypertensive state is debatable, however.

Although the adjusted model (M2) is an improvement over the base model (M1), neither of the described models accounts for correlation between individuals nor heteroscedasticity. One way to possibly overcome this is to incorporate a latent variable into the model. Such an extension follows.

Model 3 (M3): $log (λ_{i j}) = α_{j}^{M 3} + β_{j}^{M 3} g_{i}^{k} + u_{i}$ where $u_{i}$ is the i^th element of a vector $u$ that follows a $MVN (0, K)$ distribution; here $K$ is twice the estimated kinship matrix. In this case, however, the estimation of the parameters of interest, $β_{j}$ , is not straightforward. Further research of this methodology is warranted.

With respect to the penalized regression, to avoid an arbitrary selection of c,a cross-validation method could be implemented.

Conclusions

We propose a methodology that conveniently uses the longitudinal data structure to define a probabilistic outcome, which, we believe, explains hypertension in a more suitable way. Dirichlet regression provides an interesting approach that, along with other more common responses, can be successfully used in genetic association analysis. Our model finds a statistically significant marker at the standard significant level for GWAS, which is noteworthy, considering that it is often difficult to find association. Moreover, when the penalized method is used on the GENO data we are able to find significant markers in addition to those have already found using GWAS data.

References

Winegarden CR: From "Prehypertension" to Hypertension? Additional Evidence. Ann Epidemiol. 2005, 15: 720-725. 10.1016/j.annepidem.2005.02.010.
Article CAS PubMed Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al: PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.
Article PubMed Central CAS PubMed Google Scholar
Kalbfleisch JD, Lawless JF: The analysis of panel data under a Markov assumption. J Am Statist Assoc. 1985, 80: 863-871. 10.1080/01621459.1985.10478195.
Article Google Scholar
Campbell G, Mosimann JE: Multivariate methods for proportional shape. ASA Proceedings of the Section on Statistical Graphics. 1987, Washington, DC, 10-17.
Google Scholar
Chen J, Li H: Variable selection for Dirichlet-multinomial regression for identifying covariates that are associated with microbiomes. Ann Appl Stat. 2013, 7: 418-442. 10.1214/12-AOAS592.
Article Google Scholar
Venables WN, Ripley BD: Modern Applied Statistics with S. 2002, New York, Springer, 4
Chapter Google Scholar
Maier MJ: DirichletReg: Dirichlet Regression in R. Ver. 0.4-0, [http://dirichletreg.r-forge.r-project.org]

Download references

Acknowledgements

In a personal communication, 2012, Daniel Merico (TCAG/Sick Kids) provided Gene annotation files. OEG thanks Ian Jones for his valuable support in the manuscript review. Thanks to the reviewers for their comments and suggestions. The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of BMC Proceedings Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

Author information

Authors and Affiliations

Department of Biostatistics, Princess Margaret Cancer Centre, 610 University Ave, Toronto, ON, M5G 2M9, Canada
Osvaldo Espin-Garcia, Xiaowei Shen, Xin Qiu & Wei Xu
Department of Statistics and Actuarial Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada
Osvaldo Espin-Garcia & Xiaowei Shen
Samuel Lunenfeld Research Institute, Mount Sinai Hospital, 60 Murray Street, Toronto, ON, M5T 3L9, Canada
Yonathan Brhane
Ontario Cancer Institute/Princess Margaret Cancer Centre, 610 University Ave, Toronto, ON, M5G 2M9, Canada
Geoffrey Liu
Dalla Lana School of Public Health, University of Toronto, 155 College St, Toronto, ON, M5T 3M7, Canada
Geoffrey Liu & Wei Xu

Authors

Osvaldo Espin-Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xin Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Yonathan Brhane
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Xu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

OEG and WX designed the overall study; OEG conceived the study, conducted statistical analyses, and wrote the manuscript; XS, XQ, and YB helped develop the study; GL revised the clinical aspects of the study. All authors read and approved the final manuscript.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Espin-Garcia, O., Shen, X., Qiu, X. et al. Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach. BMC Proc 8 (Suppl 1), S70 (2014). https://doi.org/10.1186/1753-6561-8-S1-S70

Download citation

Published: 17 June 2014
DOI: https://doi.org/10.1186/1753-6561-8-S1-S70

Genetic association analysis for common variants in the Genetic Analysis Workshop 18 data: a Dirichlet regression approach

Abstract

Background