Abstract
Microbiome studies generate multivariate compositional responses, such as taxa counts, which are strictly non-negative, bounded, residing within a simplex, and subject to unit-sum constraint. In presence of covariates (which can be moderate to high dimensional), they are popularly modeled via the Dirichlet-Multinomial (D-M) regression framework. In this paper, we consider a Bayesian approach for estimation and inference under a D-M compositional framework, and present a comparative evaluation of some state-of-the-art continuous shrinkage priors for efficient variable selection to identify the most significant associations between available covariates, and taxonomic abundance. Specifically, we compare the performances of the horseshoe and horseshoe+ priors (with the benchmark Bayesian lasso), utilizing Hamiltonian Monte Carlo techniques for posterior sampling, and generating posterior credible intervals. Our simulation studies using synthetic data demonstrate excellent recovery and estimation accuracy of sparse parameter regime by the continuous shrinkage priors. We further illustrate our method via application to a motivating oral microbiome data generated from the NYC-Hanes study. RStan implementation of our method is made available at the GitHub link: (https://github.com/dattahub/compshrink).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Human microbiome studies (Kuczynski et al. 2012) reveal microbial compositions to be linked to several human diseases, such as periodontal disease (PD; Di Stefano et al. 2022), auto-immune diseases (De Luca and Shoenfeld 2019), and cancer (Kandalai et al. 2023). A key question in this context is finding associations between taxonomic abundances of microbiomes and environmental, clinical, and/or sociodemographic predictors, while taking into account the over-dispersion and sparsity in existing associations.
Existing statistical approaches for modeling oral microbiome outcomes (our current focus) mostly rely on using a negative binomial (NB) regression model (Beghini et al. 2019; Renson et al. 2019) in conjunction with a differential abundance analysis, using the popular r packages such as edger (Robinson et al. 2010) and deseq2 (Love et al. 2014), with a false discovery rate (FDR) control mechanism to correct for multiplicity. While these approaches offer valuable insights on social inequities reflected in the oral microbiota, and identify differentially abundant operational taxonomic units (OTUs), they do not abide by the compositional framework of the responses (Gloor et al. 2017). Furthermore, they also do not engage in considering an appropriate shrinkage mechanism (on the model covariates) which can automatically adapt to sparsity, and thereby select the most important list of covariables (from an extended list). Existing shrinkage-based approaches include \(\ell _1\) penalized regression on linear log-contrast models (Lin et al. 2014), or spike-and-slab priors on regression coefficients for Bayesian variable selection (Wadsworth et al. 2017). While these methods offer an ‘integrative’ view of assessing the microbiome-environment association, each of them comes with associated challenges in a fully Bayesian set-up. The convex \(\ell _1\)-penalties work well if one only considers the point estimator corresponding to the posterior mode or MAP estimator, however, their Bayesian analogs (e.g. a Laplace prior for Bayesian Lasso) lead to sub-optimal uncertainty quantification. In that vein, Castillo et al. (2015) pointed out that for such priors, the entire posterior distribution contracts at a suboptimal rate unlike the posterior model, and this phenomenon is driven by the insufficiently heavy tails of the Laplace prior. The spike-and-slab priors, on the other hand, often carry a heavy computational cost unless one carefully avoids the combinatorial search through the huge model space. More importantly, the idealized dichotomy of 0’s and 1’s is often deemed artificial in most applications as effect sizes are usually not exact zeroes, and relaxing this restriction leads to better performance as showcased by the continuous shrinkage priors, also called the ‘global–local’ (G-L) shrinkage priors (Bhadra et al. 2021).
Motivated by the 2013-14 NYC-HANES-II (Waldron 2023) oral microbiome data, henceforth (NYC-HANES), our objective in this paper is to quantify a precise and pragmatic relationship between the oral microbiome outcomes and a host of available sociodemographic factors. Specifically, we cast our microbiome regression framework into a Dirichlet-Multinomial (D-M) specification (Chen and Li 2013) under a Bayesian paradigm, and explore the performance of the G-L priors, specifically, the horseshoe (Carvalho et al. 2010) and horseshoe+ (Bhadra et al. 2017) priors, in terms of inducing sparsity for efficient variable shrinkage and selection, and related uncertainty assessments. The entire computational framework is powered by Hamiltonian Monte Carlo (HMC) dynamics (Betancourt et al. 2017) implemented via RStan. HMC is a Markov chain Monte Carlo (MCMC) method which adopts physical system dynamics (instead of a probability distribution considered by an usual random walk sampler) to attain the future states of the underlying Markov chain, thereby allowing a more efficient exploration of the target distribution resulting in faster convergence. A key contribution is to also produce a set of reproducible code and software in GitHub, and thus promote the generalizability of our methodology to other datasets of similar architecture.
The rest of the manuscript is organized as follows. After an introduction to the G-L shrinkage priors, our D-M regression framework in presented in Sect. 2. Section B outlines the RStan based implementation of the proposed HMC estimation scheme. The methodology is illustrated via application to the motivating NYC-HANES data in Sect. 3. The comparative efficiency of the proposed variable selection methods, and their estimation accuracy is explored via synthetic data in Sect. 4. Finally, some concluding remarks are relegated to Sect. 5, while Appendices A and B presents additional results from the motivating data analysis, and the R-Stan implementation of our modeling, respectively.
2 Statistical Model
We begin with an introduction to the G-L shrinkage priors in Sect. 2.1, and then outline our D-M hierarchical model in Sect. 2.2
2.1 Global–Local Shrinkage
The key idea behind the G-L shrinkage (Polson and Scott 2011) for high-dimensional regression is to model the parameter of interest \(\varvec{\beta }\) with a prior consisting of two scale parameters: a local shrinkage parameter that helps in identifying the signals, and a global shrinkage parameter that adapts to the overall sparsity in the data. In the simplest setting, the G-L shrinkage prior is specified as follows:
The hyper-priors are usually taken to be a heavy-tailed distribution that provides robustness to large signals. The parameters \(\lambda _i\) and \(\tau\) are called local and global shrinkage parameters, as they help identify the signals and capture sparsity, respectively. The most popular member of this family is the horseshoe estimator (Carvalho et al. 2010), which results from putting half-Cauchy priors on both f and g in (2.1). The horseshoe has inspired several continuous shrinkage priors. For example, for sharpening the signal recovery property of the horseshoe prior, Bhadra et al. (2017) proposed a horseshoe+ prior by using the novel idea of ‘Jacobian shrinkage’. The key here is an extra Jacobian term introduced in the representation on the shrinkage scale, which exhibits fundamentally different behavior for separating signals from the noise, and allows efficient signal detection in the ultra-sparse cases. There is a wealth of theoretical and practical optimality results for the class of G-L priors. We list some of them next.
-
(i)
For the sparse normal means model, G-L priors attain the Bayes oracle risk for the multiple testing rule under a 0-1 additive loss (Ghosh et al. 2016; Datta and Ghosh 2013), and they also attain the asymptotic minimax rate for parameter estimation for ‘nearly-black’ parameter spaces (van der Pas et al. 2016), while providing excellent uncertainty quantification (van der Pas et al. 2017), at least for parameters that are either close, or away from zero.
-
(ii)
Bhadra et al. (2016) resolve Efron’s marginalization paradoxes in Bayesian inference, by showing that the horseshoe prior can act as the default prior for non-linear functions of a high-dimensional parameter \(\varvec{\beta }\). The key insight here is that the regularly varying tails of the horseshoe prior leads to robustness in presence of non-linearity.
-
(iii)
Bhadra et al. (2019) prove that the horseshoe prior can outperform key competitors such as ridge regression in terms of finite sample prediction risk properties.
These priors induce sparsity but avoid the computational bottleneck of searching over an exponentially growing model space, which is often cited as the main obstacle for the implementation of the spike-and-slab class of priors on large parameter spaces. Inspired by the success of the horseshoe, many G-L priors have emerged in recent times, focusing on the sparse normal means and regression problem. Some of the popular G-L priors include the Normal Exponential Gamma (Griffin and Brown 2010), generalized double Pareto (GDP) (Armagan et al. 2013), the three-parameter beta (Armagan et al. 2011), the Dirichlet-Laplace (Bhattacharya et al. 2015), and the more recent spike-and-slab Lasso (Ročková and George 2016), horseshoe+ (Bhadra et al. 2017) and the R2-D2 (Zhang et al. 2016) priors. Figure 1 displays the functional form of some of these prior densities near the origin and their tails. As is evident from the picture, a common feature of the G-L priors is a peak near zero and heavy tails. The peak helps in adapting to sparsity while the heavy tails provide robustness to sparse signals.
It is well-known that Bayesian methods based on convex penalties, such as Bayesian Lasso, inherit the inherent problems of the associated regularization method (Castillo et al. 2015; Polson and Scott 2010). For example, the Bayesian lasso based on double exponential priors lead to non-vanishing bias in the tails due to lack of regularly varying tails. Castillo et al. (2015) argues that for Bayesian lasso, the posterior mode does not contract at the optimal rate, leading to unsatisfactory uncertainty quantification. On the other hand, the Bayesian gold-standard spike-and-slab priors work well for small to moderate dimensions, but suffers from the computational burden of searching over a combinatorial model space, untenable for huge data-dimensions. The G-L shrinkage priors (Polson and Scott 2010, 2012) is a pragmatic alternative offering an optimal solution by producing continuous scale mixture priors that can simultaneously shrink small noises to zero and leave large observations intact, while providing for fast computing strategies (e.g. Bhattacharya et al. 2016; Johndrow et al. 2020). G-L shrinkage priors are now recognized widely as the state-of-the-art Bayesian tool for shrinkage and selection, owing to their efficiency and rich theoretical support.
Our central contribution in this paper is to extend and examine the inferential capacity of these G-L shrinkage priors in terms of variable selection under a compositional regression framework, where the likelihood is characterized by the D-M distribution. We now outline the hierarchical Bayesian D-M regression framework
2.2 Dirichlet-Multinomial Hierarchical Model with Shrinkage Prior
Consider a high-throughput sequencing data, where the total sequencing depth for each sample or location, and frequency for each of the species or taxa is known, along with a covariate vector of moderate to high dimension. Let \(\textbf{y}_{i} = (y_{i1}, y_{i2}, \ldots , y_{iN})\), \(1 \le i \le M\), be the vector of counts representing the abundance of different species in the \(i^{th}\) sample, i.e., \(y_{ij}\) is the frequency of the \(j^{th}\) species from the \(i^{th}\) sample. Let \(\textbf{X}= (X_1, X_2, \ldots , X_p)\) be the \(M \times p\) matrix of covariates. We assume the count vector for each patient/sample follows a multinomial distribution with species distribution \(\varvec{\pi }_i\) for the \(i^{th}\) location. The weights \(\varvec{\pi }_i\) satisfies the unity-sum constraint and further follow an appropriate distribution over an N-dimensional simplex, such as the Dirichlet distribution.
To initiate a simple modeling framework, consider the conjugate Dirichlet prior on the compositional parameter \(\varvec{\pi }_i\), i.e. each \(\varvec{\pi }_i\) is given a Dirichlet prior with hyperparameter \(\textbf{a}= (a_1, \ldots , a_N)\), where \(a_j\)’s are strictly positive shape parameters. This D-M framework is advantageous as it allows us to marginalize out the \(\varvec{\pi }_i\) parameter from each step to yield the D-M likelihood, thereby offering more flexibility for count data. The D-M probability mass function for the shape parameter \(\textbf{a}\) is given by:
A well-known advantage of marginalizing out \(\pi _i\) is that the integrated D-M model has a larger variance term to account for over-dispersion, a typical feature of most count data. Our hierarchical model is thus given by:
where, Dir denotes the Dirichlet distribution, and a is the shape parameter for the Dirichlet distribution. Now, covariates can be incorporated into the setup via a logit link on the D-M shape parameter a, such that:
where, \(\beta _{0j}\) is the intercept parameter, and \(\beta _{lj}\) is the parameter corresponding to the covariate \(X_{il}\). The parameters are given Normal priors, with various hyper-priors such as \(\pi _{\lambda }(\cdot )\) and \(\pi _{\tau }(\cdot )\) that induces varying shrinkage on the regression coefficients. In particular, the popular horseshoe prior (Carvalho et al. 2010) corresponds to:
where, \(\mathcal {C}a^+(.,.)\) denotes the half-Cauchy density. The Bayesian Lasso corresponds to a double exponential prior (Carlin and Polson 1991; Park and Casella 2008; Hans 2009):
where, Exp(.) and \(\text{ IG }(.,.)\) denotes the exponential and inverse-gamma density, respectively. Finally, for the horseshoe+ prior (Bhadra et al. 2017), one needs a product-Cauchy prior with the prior on \(\lambda _{lj}\) taking the form:
which is same as adding another layer of hierarchy:
3 Application: NYC-Hanes Data
3.1 Data Description
The 2013-14 New York City Health and Nutrition Examination Survey, henceforth NYC-HANES-II (Thorpe et al. 2015), is a population-based survey of 1, 575 non-institutionalized adults residing in New York City. The NYC-HANES followed the design of the National Health and Nutrition Examination Survey (NHANES; NHANES 2017) conducted by the United States Centers for Disease Control and Prevention, and employed a tri-level cluster household probability sampling approach to select participants from the target population of all non-institutionalized adults, aged \(\ge\) 18 years. NYC HANES-II is the second health examination survey conducted on NYC residents with a goal of assessing effect of the health policies implemented since the first NYC-HANES survey in 2004 and major changes in population health data. Analysis of sub-samples (Beghini et al. 2019; Renson et al. 2019) revealed oral microbiomes of people with varying smoking habits, diets, and oral health behaviors differed, suggesting that the oral microbiome could be a factor in health disparities. In particular, exploration of the association between tobacco exposure and oral microbiome (Beghini et al. 2019) in a sub-sample of 297 individuals revealed impaired balance of oxygen-utilizing bacteria leading to negative health outcomes. A follow-up study (Renson et al. 2019) established differential abundance of operational taxonomic units (OTU) for sociodemographic variables, such as race/ethnicity, socio-economic status (SES), marital status and others. These results shed more light on the role of microbiota as a mediator/confounder in association studies of social environment and health outcomes, although the mechanism of such differential abundance is not known yet.
For our analysis, we analyze a sub-sample of full dataset by filtering out the rows with missing values and pregnant individuals, resulting in 264 samples and 11 sociodemographic variables (with multiple categories in several variables). The dataset and R code are available in the R package nychanesmicrobiome (Waldron 2023). We focus on the following sociodemographic variables: (a) Age (denoted as SPAGE), (b) BMI, (c) Gender (denoted as GENDER; Male, Female)), (d) Race (denoted as RACE; White, Hispanic, Black, and Asian), (e) Education categories (denoted as EDU4CAT; less than high school diploma, high school graduate/GED, some college or bachelor degree, college graduate or more), (f) Income category (denoted as INCOMECAT; \(\le\) 20k, 20k-45k, and \(\ge\) 45k), (g) self-reported smoking status (denoted as smokingstatus; Cigarette, Alternative, SecondHand, and Former), (h) Diabetes, based on FPG, HbA1C, or self-report (denoted as DBTS-NEW; Yes/No), (i) Answer to the survey question, "does respondent think respondent has gum disease?" (denoted by OHQ3; Yes/No), (j) Answer to the survey question, “Has SP ever been told by a doctor or other health professional that SP had hypertension, also called high blood pressure?" (denoted by BPQ2; Yes/No), and (k) Answer to the survey question, “Have you taken any aspirin, steroids, or other non-steroidal anti-inflammatory medicine in the past week?” (denoted by US-NSAID; Yes/No). Our response variable read count is at the genus level, limited to genera present in \(>20\%\) samples, resulting in 64 distinct OTUs. The taxonomic composition of the present subsample is similar to what is reported in previous papers (Beghini et al. 2019), in particular, the dominant genera are Streptococcus and Prevotella, with other commonly found genera, such as Rothia, Veillonella and Gemella. Figure 2 presents the taxonomic composition heatmap of the top 5% OTUs in the data, while Fig. 3 shows the Spearman’s rank correlation heatmap between the sociodemographic variables and all the 64 genera present. This correlation plot suggests quantifiable (marginal) relationship between the variables and the OTUs.
3.2 Model Fitting and Findings
We now fit the D-M hierarchical shrinkage model to the NYC-Hanes data with three candidate priors: horseshoe (Carvalho et al. 2010), horseshoe+ (Bhadra et al. 2017) and the Bayesian Lasso (Park and Casella 2008). We use the popular stan software to implement posterior sampling, as described in Section B. For selecting non-zero associations, we use a criterion of 95% credible intervals for \(\beta _{ij}\) containing zero, although other methods could be used, such as the 2-means strategy by Li and Pati (2017). As expected, horseshoe and horseshoe+ leads to a sparser association recovery than Bayesian Lasso (or Laplace); the number of selected \(\beta\)’s by the above criteria for the three priors were horseshoe: 51, horseshoe+: 56 and Bayesian Lasso: 79. Figure 4 displays the selected \(\beta _{ij}\)’s obtained from the competing methods; these can be interpreted as the recovered associations between the socio-demographic variables and the genera in the NYC-Hanes data. Figure 4 also shows both the sparsity patterns as well as the commonality of the performances of the shrinkage priors. As noted earlier, the Bayesian Lasso tends to select more associations due to its less aggressive shrinkage near zero, as opposed to horseshoe-type priors with a spike at the origin.
We now interpret the selected genera corresponding to the smoking status variable. Table 1 presents the selected genera, when ‘smokingstatus = cigarette’ (with non-smokers as baseline). We observe that Prevotella, Haemophilus, and Neisseria are among the selected genera; these were also identified in earlier work (Beghini et al. 2019). In smokers, a reduction in Bacteroides and Proteobacteria were observed, accompanied by a decrease in genera also identified by this model, for example, Prevotella, Haemophilus, Neisseria etc., aligning with findings from prior studies (Charlson et al. 2010; Morris and Tang 2011; Wu et al. 2016). Morris and Tang (2011) compared microbial composition in lung and oral cavity for 64 individuals, and noted that while the microbial communities resemble each other between lung and oral cavity, there exists notable differences, and in particular, the distribution of Neisseria, Gemella and Porphyromonas differed in the oral cavity of smokers versus non-smokers. Notably, the diminished presence of Proteobacteria appears to be significant, given its association with PD observed in individuals compared to their healthy counterparts (Griffen et al. 2012). Charlson et al. (2010) reported that smoking leads to a simultaneous reduction in commensal (harmless) microbes (e.g. Prevotella) and increase or enrichment of potential pathogens like Streptococcus and Haemophilus. The reduced abundance of normal oral microbiota, such as Fusobacterium and Neisseria in smokers compared to non-smokers, as reported by Charlson et al. (2010), is aligned with the negative associations reported in our integrated DM model (see, Table 1). Additional results on the selected microbial genera by the three candidate priors corresponding to the BMI and Gender predictors are presented in Appendix A.
3.3 Multivariate Posterior Predictive Model Assessment
Posterior predictive p-values (Meng 1994; Gelman et al. 1996, 2014) are often used as the primary diagnostic tool for model assessment, under the Bayesian paradigm. The underlying intuition here is that if the model fits the data well, then the original observations will be ‘similar’ to the replicates from the posterior distribution under the model. More formally, if \(p(y^{rep} \mid y)\) denotes the posterior predictive distribution given by: \(p(y^{rep} \mid y) = \int p(y^{rep} \mid \Theta ) p(\Theta \mid y) d\Theta\), then the posterior predictive p-value (PPP) is given by \(p(T(y^{rep} > T(y) \mid y)\), where the integration is with respect to the posterior predictive distribution \(p(y^{rep} \mid y)\). A PPV value close to 0 or 1 is indicative of a poor fit.
However, in this paper, the response variable \(\textbf{y}\) lies in a K dimensional simplex \(\S _K\), and a suitable metric for model checking that accounts for multivariate outcomes is needed. We adopt a natural extension of the posterior predictive model checking to multivariate outcomes, proposed by Crespi and Boscardin (2009) using dissimilarity measures. Here, the primary idea is to calculate pairwise distances between the observed and the posterior predictive replicates (say \(d^{obs}_{j} = d(\textbf{y}^{obs}, \textbf{y}_j^{rep})\), for \(j = 1, \ldots , J\)) as well as pairwise distances between the posterior predictive replicates (say \(d^{rep}_{ij} = d(\textbf{y}_i^{rep}, \textbf{y}_j^{rep})\), for \(i \ne j = 1, \ldots , J\)), producing two sets of dissimilarity measures. Then, if the model captures the multivariate \(\textbf{y}\) well, the values \(d^{rep}_{ij}\) and \(d^{obs}_j\) should be stochastically similar.
Figure 5 presents the posterior density plots for the observed and replicate dissimilarity measures for the three candidate priors: horseshoe, horseshoe+ and Bayesian Lasso (or, Laplace) in the D-M model. The plots reveal that the degree of overlap is much smaller for the Laplace prior compared to the two other shrinkage priors, indicating better fit for the latter methods.
4 Simulation Studies
In this section, we evaluate the finite sample performances of our proposed method in terms of the accuracy of variable selection, as well as comparisons between the horseshoe, horseshoe+, and Bayesian lasso prior assumptions via simulation studies with synthetic data.
4.1 Scheme I(a): Evaluation of Finite-Sample Performance and Estimation Accuracy
To evaluate the variable selection performance of model (2.3) – (2.4), we generate the true \(\beta\) from a mixture of uniforms and a point mass at zero, given by \(\beta _{jl} \sim \pi ~U(L,U) + (1-\pi )~ \delta _{\{0\}}\). We draw X from a multivariate Normal, such that \(X \sim \mathcal {N}(0, \Sigma )\), \(\Sigma = \rho ^{|i-j|}\), and \(\rho = 0.4\) as in Wadsworth et al. (2017). The count matrix Y is generated from a multinomial distribution, i.e., \(Y_i \sim \text {Multinomial}(y_{i+}, p_i)\), where row-total \(y_{i+}\) follows a discrete uniform \(\text {Unif}(\{1000, 5000\})\). Finally, \(\varvec{\pi }_i\) follow a Dirichlet distribution with an overdispersion parameter \(\psi\), given by \(\varvec{\pi }_i \sim \text {Dir}\left( \frac{\psi }{1-\psi } (a_{i1}, \ldots , a_{iJ}) \right)\), where the probabilities are linked to the predictor variables as: \(p_{i} = \text {softmax}(\eta \frac{\psi }{1-\psi }),\) where \(\text {softmax}(z) = \exp (\textbf{z})/\sum (\exp (\textbf{z}))\) and \(\eta = X\beta\).
We first examine the performance of the horseshoe prior vis-a-vis horseshoe+ and Bayesian Lasso prior. For this simulation study, a single model fit incorporates four chains with 2000 iterations each, where first 1000 iterations of each chain are discarded as warm-up leaving in total, 4000 post-warm-up samples. We observed that a smaller sample size and fewer chains make no qualitative difference to the posterior estimates. The number of observations, number of predictors and number of taxa are fixed at \(n = 50\), \(p = 20\) and \(q = 20\) respectively, leading to the number of estimable parameters in \(\varvec{\beta }\) matrix to be \(p \times q = 400\) (which is more than the number of observations). We also set the number of relevant predictors (\(p_0\)) and number of relevant taxa (\(q_0\)) both at 5, i.e. the proportion of non-zero parameters in \(\varvec{\beta }\) is \(p_0q_0/pq = 0.0625\). This set-up is chosen to induce sparsity in the generating model. Figure 6 (left panel) suggests that we can recover the true non-zero associations from the data, and the posterior inclusion probability concentrates to a higher value for the true association values. The two red dots indicate false discoveries, as expected, by the Bayesian lasso. For the same data, we compare the estimation accuracy performances of the candidate variable selection methods via 95% posterior credible intervals. This was done following the suggestions in van der Pas et al. (2016); Wei (2017); Tadesse and Vannucci (2019), where variable selection is performed by checking whether the posterior credible interval contains zero or not, at a nominal level. As shown in Fig. 6 (right panel), the Bayesian Lasso misses two of true non-zero \(\beta\)’s (corresponding to the two red dots on the left panel of Fig. 6), but correctly shrinks all null \(\beta\)’s to zero. This is expected, since the Bayesian Lasso does not shrink to zero as strongly as horseshoe or horseshoe+, because of its lack of large prior mass at zero (Polson and Scott 2010).
4.2 Scheme I(b): Evaluation of Variable Selection Strategies
As suggested by a referee, we now compare the 95% credible interval (CrI) approach for variable selection with the two-means (2 M) algorithm as used by Bhattacharya et al. (2015). We follow the simulation set-up as Scheme I above, but under a weaker correlation \(\rho = 0.1\). Figure 7 shows the variable selection performance for the 95% CrI approach alongside the 2 M approach. It seems that the 2 M approach performs worse than the CrI approach for all but the Bayesian Lasso. In particular, out of the 25 true non-zero entries in \(\varvec{\beta }\), the horseshoe prior selects the maximum number (23) of non-zero associations using the 95% CrI approach, as shown in Table 2. It is interesting to note that none of the candidate priors result in any false discoveries.
While there is no consensus over a single established method for variable selection with continuous shrinkage priors, several approaches exist to identify the final set of relevant variables. These include:
-
(i)
Credible intervals (at a pre-defined nominal level) covering zero as suggested by van der Pas et al. (2017) or Tadesse and Vanucci (Tadesse and Vannucci 2019). This is a more conservative (few discoveries) and ideal for minimizing false positives. We have used this approach for variable selection.
-
(ii)
Thresholding shrinkage factors as in Tang and Chen (2018): This is limited to scenarios where the number of variables (p) is less than the number of samples (n), and, to our knowledge, only defined in terms of linear regression and sequence model.
-
(iii)
Decoupling shrinkage and selection (DSS) (Hahn and Lopes 2014): This method aims to create a sparse posterior mean that explains most of the predictive variability. However, its focus on prediction might not be optimal for problems estimating regression coefficients with correlated regressors.
-
(iv)
Penalized credible regions by Zhang et al. (2016): This approach seeks the sparsest model within a specific credible region, viz., the \(100\times (1-\alpha )\)% joint elliptical credible region.
-
(v)
Two-means clustering, used in Bhattacharya et al. (2015), and further investigated by Li and Pati (2017). This approach proceeds via post processing of the posterior samples. Initially, a posterior distribution of the number of signals is obtained by clustering the signal and the noise coefficients, which eventually leads to estimating the signals from the posterior median. While this remains an attractive method due to its generalist and tuning-free nature, it may not always be reliable for consistent selection, as in our case, shown above.
Additionally, if a sparse point estimator or variable selection is the primary goal, using the posterior mode estimator with the horseshoe prior could be a more suitable strategy as it leads to exact zeros for selected variables. For joint posterior mode calculation with horseshoe-like priors, an approximate algorithm was proposed in Bhadra et al. (2021). A thorough comparison of all the variable selection strategies for a D-M model would be an interesting future study.
4.3 Scheme II: Comparison Between shrinkage Priors
Next, we compare the performance of the three candidate shrinkage priors, viz. horseshoe, horseshoe+ and Bayesian Lasso. There is now a large and growing list of continuous shrinkage priors for inducing sparsity in parameter estimates, and we chose these priors as they offer state-of-the-art solution without the need for tuning any hyper-parameters. We also excluded spike-and-slab priors or its variants or mixtures from this comparison as our goal is producing posterior samples in reasonable time, and not just obtaining a point-estimate using an EM-type algorithm.
The simulation design here mimics Scheme I, yet, under weak dependence assumption between the predictors, i.e. the correlation coefficient is now fixed at \(\rho = 0.1\). We compare the estimation and classification accuracies for the three candidate priors over 50 replicated datasets, where the accuracy of an estimator is measured as \(\Vert \hat{\varvec{\beta }} - \varvec{\beta }_{0}\Vert _{F}/\Vert \varvec{\beta }_0\Vert _{F}\), with \(\Vert \textbf{x}\Vert _F\) as the Frobenius norm of \(\textbf{x}\). Figure 8 present boxplots of estimation (left panel) and misclassification (right panel) errors, corresponding to the three shrinkage priors, while Table 3 presents summary statistics (mean, median and standard deviation) corresponding to Fig. 8. We observe that horseshoe+ beats both horseshoe and Bayesian Lasso in terms of estimation (the error mean and median for the horseshoe+, although closer to the horseshoe, is the lowest, and much lower than the Bayesian Lasso), while the horseshoe outperforms the others (having the lowest mean and median error estimates) in regards to misclassification error. This seems to be an outcome of the fact that the horseshoe induces a non-convex penalization on the sparse parameter vector unlike the Bayes Lasso (Bhadra et al. 2021), and this non-convexity protects against both low signal-to-noise ratio as well as predictor dependence (Mazumder et al. 2012).
5 Conclusions
In this paper, we presented a Bayesian inferential approach to the D-M compositional regression model with horseshoe, horseshoe+ and Bayesian lasso prior choices for efficient variable selection. with illustration via application to the NYC-Hanes II oral microbiome data. We also performed a simulation study to compare the relative performances of the three priors, in terms of true signals recovery. We observe that both the horseshoe and horseshoe+ priors outperform the Bayesian Lasso. While a theoretical investigation is beyond scope of this paper, we plan to take this up on a future endeavor. Our conjecture is that the heavy tails of global–local shrinkage priors coupled with the spike at zero are responsible for the superior performance, compared to the Bayesian Lasso.
There have been some attempts to generalize the Dirichlet distribution to yield a more flexible and rich parametric family containing the simple Dirichlet. For example, Connor and Mosimann (1969) introduced the generalized Dirichlet (GD) distribution to yield a more flexible covariance structure while maintaining conjugacy, thereby making it more practical and useful (see, Kotz et al. 2000, pp. 520-521) and Wong (1998). The GD distribution is as follows:
The Dirichlet distribution can be derived as a special case of the GD distribution if \(b_{j-1} = a_j + b_j\); in particular, the symmetric Dirichlet density \(\text {Dir}(\alpha /K, \ldots , \alpha /K)\) results if \(a_j = \alpha /K, b_j = \alpha (1-j/K)\). The GD distribution has a more general covariance structure compared to the Dirichlet, and it maintains the nice properties of Dirichlet, such as conjugacy to multinomial likelihood and complete neutrality (Connor and Mosimann 1969). An alternative parametric form to (5.1) proposed by Wong (1998) is as follows:
From (5.2), it is obvious that imposing symmetry via \(\gamma _i = 0\) reduces the Connor–Mosimann construction (5.1) to the Dirichlet distribution.
A natural extension of (2.3) – (2.4) is to use the GD prior to model the simplex-valued \(\varvec{\pi }_i\). Under the GD prior, our hierarchical model would become:
Then, we incorporate the covariates into the GD model using a log-linear regression approach via the log-shape parameter. Our hierarchical model is thus given as:
A potential issue with the GD modeling framework is over-parametrization; the GD distribution has almost twice as many parameters compared to a typical Dirichlet distribution. However, there exists other extensions of the GD regression framework, such as the zero-inflated GD (ZIGD; Tang and Chen 2018) regression, which has been proposed as a flexible alternative to handle the presence of excess ‘structural’ zeroes among the multivariate taxa counts in compositional data regression. Under the Bayesian paradigm, exploring and comparing the efficiency of the family of continuous shrinkage priors (to the usual spike-and-slab alternatives) now under the GD and ZIGD frameworks are credible extensions. Also, in microbiome studies, relative abundance of species could vary with time, and considering the effect of time within a longitudinal compositional regression framework seems worthwhile. However, such dynamic modeling will require additional modification to the Bayesian variable selection strategy proposed here. These will be pursued elsewhere.
References
Armagan A, Clyde M, Dunson DB (2011) Generalized beta mixtures of Gaussians. Adv Neural Inform Proc Syst 24:523–531
Armagan A, Dunson DB, Lee J (2013) Generalized double Pareto shrinkage. Stat Sin 23(1):119–143
Beghini F, Renson A, Zolnik CP, Geistlinger L, Usyk M, Moody TU, Thorpe L, Dowd JB, Burk R, Segata N et al (2019) Tobacco exposure associated with oral microbiota oxygen utilization in the new york city health and nutrition examination study. Ann Epidemiol 34:18–25
Betancourt M, Byrne S, Livingstone S, Girolami M (2017) The geometric foundations of Hamiltonian Monte Carlo. Bernoulli 23(4A):2257–2298. https://doi.org/10.3150/16-BEJ810
Bhadra A, Datta J, Polson NG, Willard B (2016) Default bayesian analysis with global-local shrinkage priors. Biometrika 103(4):955–969
Bhadra A, Datta J, Polson NG, Willard B (2017) The horseshoe+ estimator of ultra-sparse signals. Bayesian Anal 12(4):1105–1131
Bhadra A, Datta J, Polson NG, Willard B et al (2017) The horseshoe+ estimator of ultra-sparse signals. Bayes Anal 12(4):1105–1131
Bhadra A, Datta J, Li Y, Polson NG, Willard BT (2019) Prediction risk for the horseshoe regression. J Mach Learn Res 20(78):1–39
Bhadra A, Datta J, Polson NG, Willard BT (2021) The Horseshoe-like regularization for feature subset selection. Sankhya B 83(1):185–214
Bhattacharya A, Pati D, Pillai NS, Dunson DB (2015) Dirichlet-Laplace priors for optimal shrinkage. J Am Statist Assoc 110:1479–1490
Bhattacharya A, Chakraborty A, Mallick BK (2016) Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika 103(4):985–991
Carlin BP, Polson NG (1991) Inference for nonconjugate bayesian models using the gibbs sampler. Canad J Stat 19(4):399–405
Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker MA, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Statist Softw 76:17
Carvalho CM, Polson NG, Scott JG (2010) The horseshoe estimator for sparse signals. Biometrika 97:465–480
Castillo I, Schmidt-Hieber J, van der Vaart A (2015) Bayesian linear regression with sparse priors. Ann Statist 43(5):1986–2018
Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, Sinha R, Hwang J, Bushman FD, Collman RG (2010) Disordered microbial communities in the upper respiratory tract of cigarette smokers. PloS one 5(12):15216
Chen J, Li H (2013) Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann Appl Stat 7(1):418–442
Connor RJ, Mosimann JE (1969) Concepts of independence for proportions with a generalization of the dirichlet distribution. J Am Stat Assoc 64(325):194–206
Crespi CM, Boscardin WJ (2009) Bayesian model checking for multivariate outcome data. Computat Statist Data Anal 53(11):3765–3772
Datta J, Ghosh JK (2013) Asymptotic properties of Bayes risk for the horseshoe prior. Bayes Anal 8(1):111–132
De Luca F, Shoenfeld Y (2019) The microbiome in autoimmune diseases. Clin Experim Immunol 195(1):74–85
Di Stefano M, Polizzi A, Santonocito S, Romano A, Lombardi T, Isola G (2022) Impact of oral microbiome in periodontal health and periodontitis: a critical review on prevention and treatment. Int J Mol Sci 23(9):5142
Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized discrepancies. Stat Sin 1:733–760
Gelman A, Hwang J, Vehtari A (2014) Understanding predictive information criteria for bayesian models. Statist Comput 24(6):997–1016
Ghosh P, Tang X, Ghosh M, Chakrabarti A (2016) Asymptotic properties of Bayes risk of a general class of shrinkage priors in multiple hypothesis testing under sparsity. Bayes Anal 11(3):753–796
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ (2017) Microbiome datasets are compositional: and this is not optional. Front Microbiol 8:2224
Griffen AL, Beall CJ, Campbell JH, Firestone ND, Kumar PS, Yang ZK, Podar M, Leys EJ (2012) Distinct and complex bacterial profiles in human periodontitis and health revealed by 16s pyrosequencing. ISME J 6(6):1176–1185
Griffin JE, Brown PJ (2010) Inference with normal-gamma prior distributions in regression problems. Bayes Anal 5(1):171–188
Hahn PR, Lopes H (2014) Shrinkage priors for linear instrumental variable models with many instruments. arXiv preprint arXiv:1408.0462
Hans C (2009) Bayesian lasso regression. Biometrika 96(4):835–845
Johndrow J, Orenstein P, Bhattacharya A (2020) Scalable approximate mcmc algorithms for the horseshoe prior. J Mach Learn Res 21(73):1–61
Kandalai S, Li H, Zhang N, Peng H, Zheng Q (2023) The human microbiome and cancer: a diagnostic and therapeutic perspective. Cancer Biol Therapy 24(1):2240084
Kotz S, Balakrishnan N, Johnson NL (2000) Continuous Multivariate Distributions. Vol. 1, 2nd edn. Wiley Series in Probability and Statistics: Applied Probability and Statistics, p. 722. Wiley-Interscience, New York, NY. https://doi.org/10.1002/0471722065 . Models and applications
Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, Knight R (2012) Experimental and analytical tools for studying the human microbiome. Nature Rev Genet 13(1):47–58
Li H, Pati D (2017) Variable selection using shrinkage priors. Computat Stat Data Anal 107:107–119
Lin W, Shi P, Feng R, Li H (2014) Variable selection in regression with compositional covariates. Biometrika 101(4):785–797
Love M, Anders S, Huber W (2014) Differential analysis of count data-the deseq2 package. Genome Biol 15(550):1–54
Mazumder R, Friedman JH, Hastie T (2012) SparseNet: coordinate descent with nonconvex penalties. J Am Statist Assoc 106:1125–1138
Meng X-L et al (1994) Posterior predictive \(p\)-values. Ann Stat 22(3):1142–1160
Morris C, Tang R et al (2011) Estimating random effects via adjustment for density maximization. Statist Sci 26(2):271–287
Neal RM et al (2011) Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2(11):2
NHANES - about the national health and nutrition examination survey (2017) Accessed: July 23, 2018. https://www.cdc.gov/nchs/nhanes/about_nhanes.htm
Papaspiliopoulos O, Roberts GO, Sköld M (2007) A general framework for the parametrization of hierarchical models. Statist Sci 1:59–73
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103(482):681–686
Polson NG, Scott JG (2010) Large-scale simultaneous testing with hypergeometric inverted-beta priors. arXiv preprint arXiv:1010.5223
Polson NG, Scott JG (2011) Shrink globally, act locally: sparse bayesian regularization and prediction. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian statistics 9. Oxford University Press, Oxford, UK, pp 501–538
Polson NG, Scott JG (2012) Local shrinkage rules, lévy processes and regularized regression. J Royal Statist Soc Ser B (Stat Methodol) 74(2):287–311
Renson A, Jones HE, Beghini F, Segata N, Zolnik CP, Usyk M, Moody TU, Thorpe L, Burk R, Waldron L et al (2019) Sociodemographic variation in the oral microbiome. Ann Epidemiol 35:73–80
Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140
Ročková V, George EI (2016) The spike-and-slab lasso. J Am Stat Assoc (just-accepted)
Tadesse MG, Vannucci M (2019) Handbook of Bayesian variable selection. Chapman and Hall/CRC, Boca Raton, FL
Tang Z-Z, Chen G (2018) Zero-inflated generalized dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics
Thorpe LE, Greene C, Freeman A, Snell E, Rodriguez-Lopez JS, Frankel M, Punsalang A Jr, Chernov C, Lurie E, Friedman M et al (2015) Rationale, design and respondent characteristics of the 2013–2014 new york city health and nutrition examination survey (nyc hanes 2013–2014). Prevent Med Rep 2:580–585
van der Pas S, Szabó B, van der Vaart A (2016) How many needles in the haystack? Adaptive inference and uncertainty quantification for the horseshoe. arXiv:1607.01892
van der Pas S, Szabó B, van der Vaart A (2017) Adaptive posterior contraction rates for the horseshoe. arXiv:1702.03698
Wadsworth WD, Argiento R, Guindani M, Galloway-Pena J, Shelburne SA, Vannucci M (2017) An integrative bayesian dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinform 18(1):1–12
Waldron ARFBL (2023) Nychanesmicrobiome: analysis of the NYC-HANES Microbiome Specimens. R package version 0.1.2. http://waldronlab.io/nychanesmicrobiome/
Wei R (2017) Bayesian variable selection using continuous shrinkage priors for nonparametric models and non-gaussian data. PhD thesis, North Carolina State University
Wong T-T (1998) Generalized Dirichlet distribution in Bayesian analysis. Appl Math Comput 97(2–3):165–181
Wu J, Peters BA, Dominianni C, Zhang Y, Pei Z, Yang L, Ma Y, Purdue MP, Jacobs EJ, Gapstur SM et al (2016) Cigarette smoking and the oral microbiome in a large study of American adults. ISME J 10(10):2435–2446
Zhang Y, Reich BJ, Bondell HD (2016) High Dimensional Linear Regression via the R2-D2 Shrinkage Prior. arXiv preprint arXiv:1609.00046
Acknowledgements
Bandyopadhyay acknowledges partial research support from Grants R21DE031879 and R01DE031134 awarded by the United States National Institutes of Health.
Funding
Foundation for the National Institutes of Health (R21DE031879; R01DE031134).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
NYC-Hanes Data Application: Additional Results
Here, we present the selected clinically important associations for two other important predictors viz., Body-Mass Index (BMI) and Gender. As Table 4 shows, Prevotella, and Streptococcus were among the selected genera for association with BMI for all three candidate priors. On the other hand, for ‘gender = female’, Porphyromonas was the only selected genera across all priors, and Rothia was present for both horseshoe and horseshoe+. Streptococcus and Prevotella were also among the most abundant genera, as reported by Renson et al. (2019).
R-Stan Implementation
stan(Carpenter et al. 2017) is an efficient probabilistic programming language for specifying and fitting complex statistical models under a Bayesian paradigm. One of stan’s key strengths lies in its ability to efficiently handle both small and large datasets, enabling users to specify and fit a wide range of statistical models. stan uses the No-U-Turn Sampler (NUTS) for HMC sampling, which is a state-of-the-art algorithm that often outperforms traditional methods, such as Gibbs sampler or Metropolis–Hastings for complex posteriors (Neal 2011) in terms of speed and accuracy. stan also supports a wide variety of other computational tools, such as variational Bayes or expectation propagation by providing easy access to log-densities and their gradient, hessians and related quantities. With a growing community of users and extensive documentation, as well as interfaces with most programming languages (such as rstan or pystan, stan has become a popular choice for researchers, data scientists, and statisticians who seek a powerful and user-friendly tool for tackling challenging statistical problems using probabilistic programming. Here, we provide the outline for implementing the D-M model in stan in the spirit of open-source programming. The r as well as stan implementations for other candidate shrinkage priors are available at the GitHub link: https://github.com/dattahub/compshrink.
Horseshoe prior: We first show the stan implementation for the Bayesian hierarchical integrated D-Mmodel, as specified in equations (2.3) – (2.4), with the horseshoe prior for variable selection. The stan program below is typically divided into four main blocks: data, parameters, transformed parameters, and model. The first optional block functions defines user-defined functions, and we have defined the marginalized D-M distribution here that one can obtain by integrating out \(\varvec{\pi }\).
Horseshoe+ prior: The horseshoe+ prior implementation in stan is very similar to the horseshoe implementation above, and we have omitted the data chunk as it is identical as above. We use non-centered parametrization (Papaspiliopoulos et al. 2007) using the optional transformed parameters block to reduce correlation between lower-level parameters and increase efficiency in presence of relatively smaller sample size compared to model dimensions. To do this, we define \(\tilde{\lambda }\) (lambda_tilde) as the parameter with a \(\mathcal {C}a^+(0,1)\) prior and define \(\lambda\) (lambda) as \(\lambda = \tilde{\lambda } \eta\).
Bayesian Lasso or Laplace prior: Finally, we present the stan code for the Bayesian Lasso (Park and Casella 2008; Hans 2009), without the data chunk. Similar to the other shrinkage priors used here, we use a non-centered parametrization.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Datta, J., Bandyopadhyay, D. Bayesian Variable Shrinkage and Selection in Compositional Data Regression: Application to Oral Microbiome. J Indian Soc Probab Stat (2024). https://doi.org/10.1007/s41096-024-00194-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s41096-024-00194-9