Abstract
In Bayesian inference, we are usually interested in the numerical approximation of integrals that are posterior expectations or marginal likelihoods (a.k.a., Bayesian evidence). In this paper, we focus on the computation of the posterior expectation of a function \(f(\textbf{x})\). We consider a target-aware scenario where \(f(\textbf{x})\) is known in advance and can be exploited in order to improve the estimation of the posterior expectation. In this scenario, this task can be reduced to perform several independent marginal likelihood estimation tasks. The idea of using a path of tempered posterior distributions has been widely applied in the literature for the computation of marginal likelihoods. Thermodynamic integration, path sampling and annealing importance sampling are well-known examples of algorithms belonging to this family of methods. In this work, we introduce a generalized thermodynamic integration (GTI) scheme which is able to perform a target-aware Bayesian inference, i.e., GTI can approximate the posterior expectation of a given function. Several scenarios of application of GTI are discussed and different numerical simulations are provided.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Bayesian methods have become very popular in many domains of science and engineering over the last years, as they allow for obtaining estimates of parameters of interest as well as comparing competing models in a principled way (Robert and Casella 2004; Luengo et al. 2020). The Bayesian quantities can generally be expressed as integrals involving the posterior density. They can be divided in two main categories: posterior expectations and marginal likelihoods (useful for model selection purposes).
Generally, computational methods are required for the approximation of these integrals, e.g., Monte Carlo algorithms such as Markov chain Monte Carlo (MCMC) and importance sampling (IS) (Robert and Casella 2004; Luengo et al. 2020; Rainforth et al. 2020). Typically, practitioners apply an MCMC or IS algorithm to approximate the posterior \(\bar{\pi }(\textbf{x})\) density by a set of samples, which is used in turn to estimate posterior expectations \(\mathbb {E}_{\bar{\pi }}[f(\textbf{x})]\) of some function \(f(\textbf{x})\). Although it this is a sensible strategy when \(f(\textbf{x})\) is not known in advance and/or we are interested in computing several posterior expectations with respect to different functions, this strategy is suboptimal when the target function \(f(\textbf{x})\) is known in advance since it is completely agnostic to \(f(\textbf{x})\). Incorporating knowledge of \(f(\textbf{x})\) for the estimation of \(\mathbb {E}_{\bar{\pi }}[f(\textbf{x})]\) is known as target-aware Bayesian inference or TABI (Rainforth et al. 2020). TABI proposes to break down the estimation of the posterior expectation into several independent estimation tasks. Specifically, in TABI we require to estimate three marginal likelihoods (or normalizing constants) independently, and then recombine the estimates in order to form the approximation of the posterior expectation. The target function \(f(\textbf{x})\) features in two out of the three marginal likelihoods that have to be estimated. Hence, the TABI framework provides means of improving the estimation of a posterior expectation by making use explicitly of \(f(\textbf{x})\) and leveraging the use of algorithms for marginal likelihood computation.
The computation of the marginal likelihoods is particularly complicated, specially with MCMC outputs (Newton and Raftery 1994; Llorente et al. 2020a, 2021a). IS techniques are the most popular for this task. The basic IS algorithm provides with a straightforward estimation of the marginal likelihood. However, designing a good proposal pdf that approximates the target density is not easy (Llorente et al. 2020a). For this reason, sophisticated and powerful schemes have been specifically designed (Llorente et al. 2020a; Friel and Wyse 2012). The most powerful techniques involve the idea of the so-called tempering of the posterior (Neal 2001; Lartillot and Philippe 2006; Friel and Pettitt 2008). The tempering effect is commonly employed in order to foster the exploration and improve the efficiency of MCMC chains (Neal 1996; Martino et al. 2021). State-of-the-art methods for computing marginal likelihoods consider tempered transitions (i.e. sequence of tempered distributions), such as annealed IS (An-IS) (Neal 2001), sequential Monte Carlo (SMC) (Moral et al. 2006), thermodynamic integration (TI), a.k.a., path sampling (PS) or “power posteriors” (PP) in the statistics literature (Lartillot and Philippe 2006; Friel and Pettitt 2008; Gelman and Meng 1998), and stepping stones (SS) sampling (Xie et al. 2010). An-IS is a special case of SMC framework, PP is a special case of TI/PS, and SS sampling present similar features to An-IS and PP. For more details, see (Llorente et al. 2020a). It is worth to mention that TI has been introduced in the physics literature for computing free-energy differences (Frenkel 1986; Gelman and Meng 1998).
In this work, we extend the TI method, introducing the generalized thermodynamic integration (GTI) technique, for computing posterior expectations of a function \(f(\textbf{x})\). In this sense, GTI is a target-aware algorithm that incorporates information \(f(\textbf{x})\) within the marginal likelihood estimation technique TI. The extension of TI for the computation of \(\mathbb {E}_{\bar{\pi }}\left[ f(\textbf{x})\right]\) is not straightforward, since it requires to build a continuous path between densities with possibly different support. In the case of a geometric path (which is the default choice in practice Friel and Pettitt 2008; Lartillot and Philippe 2006), the generalization of TI needs a careful look at the support of the negative and positive parts of \(f(\textbf{x})\). We discuss the application of GTI for the computation of posterior expectations of generic real-valued function \(f(\textbf{x})\), and also describe the case of vector-valued function \(\textbf{f}(\textbf{x})\). The benefits of GTI are clearly shown by illustrative numerical simulations.
The structure of the paper is the following. In Sect. 2, we introduce the Bayesian inference setting and describe the thermodynamic method for the computation of the marginal likelihood. In Sect. 3, we introduce the GTI procedure. More specifically, we discuss first the case when \(f(\textbf{x})\) is strictly positive or negative in Sect. 3.2, and then consider the general case of a real-valued \(f(\textbf{x})\) in Sect. 3.3. In Sect. 4, we discuss some computational details of the approach, and the application of GTI for vector-valued functions \(\textbf{f}(\textbf{x})\). We show the benefits of GTI in two numerical experiments in Sect. 5. Finally, Sect. 6 contains the conclusions.
2 Background
2.1 Bayesian inference
In many real-world applications, the goal is to infer a parameter of interest given a set of data (Robert and Casella 2004). Let us denote the parameter of interest by \(\textbf{x}\in \mathcal {X}\subseteq \mathbb {R}^{D}\), and let \(\textbf{y}\in \mathbb {R}^{d_y}\) be the observed data. In a Bayesian analysis, all the statistical information is contained in the posterior distribution, which is given by
where \(\ell (\textbf{y}|\textbf{x})\) is the likelihood function, \(g(\textbf{x})\) is the prior pdf, and \(Z(\textbf{y})\) is the Bayesian model evidence (a.k.a. marginal likelihood). Generally, \(Z(\textbf{y})\) is unknown, so we are able to evaluate the unnormalized target function, \(\pi (\textbf{x})=\ell (\textbf{y}|\textbf{x}) g(\textbf{x})\). The analytical computation of the posterior density \(\bar{\pi }(\textbf{x}) \propto \pi (\textbf{x})\) is often unfeasible, hence numerical approximations are needed. The interest lies in in the approximation of integrals of the form
where \(f(\textbf{x})\) is some integrable function, and
The quantity Z is called marginal likelihood (a.k.a., Bayesian evidence) and is useful for model selection purpose (Llorente et al. 2020a). Generally, I are Z analytically intractable and we need to resort to numerical algorithms such as Markov chain Monte Carlo (MCMC) and importance sampling (IS) algorithms. In this work, we consider \(f(\textbf{x})\) is known in advance, and we aim at exploiting it in order to apply thermodynamic integration for computing the posterior expectation I, namely, perform target-aware Bayesian inference (TABI).
2.2 Computation of marginal likelihoods for parameter estimation: The TABI framework
The focus of this work is on parameter estimation, namely, we are interested in the computation of the posterior expectation in Eq. (2) of some function \(f(\textbf{x})\). Recently, the authors in Rainforth et al. (2020) propose a framework called target-aware Bayesian inference (TABI) that aims at improving the Monte Carlo estimation of I when the target \(f(\textbf{x})\) is known in advance. The TABI framework is based on decomposing I into several terms and estimate them separately, leveraging the information in \(f(\textbf{x})\). Let \(f_+(\textbf{x}) = \max \{0,f(\textbf{x})\}\) and \(f_-(\textbf{x}) = \max \{0,-f(\textbf{x})\}\), so \(f(\textbf{x}) = f_+(\textbf{x}) - f_-(\textbf{x})\). Hence, TABI rewrites the posterior expectation I as
where \(c_+ = \int f_+(\textbf{x})\pi (\textbf{x})d\textbf{x}\) and \(c_- = \int f_-(\textbf{x})\pi (\textbf{x})d\textbf{x}\). Note that \(c_+\), \(c_-\) and Z are integrals of non-negative functions, namely, they are marginal likelihoods (or normalizing constants). The three unnormalized densities of interest hence are \(\pi (\textbf{x})\), \(f_+(\textbf{x})\pi (\textbf{x})\) and \(f_-(\textbf{x})\pi (\textbf{x})\). Note that two out of the three (unnormalized) densities incorporate information about \(f(\textbf{x})\). The general TABI estimator is then
where \(\widehat{c}_+\), \(\widehat{c}_-\) and \(\widehat{Z}\) are estimates obtained independently. These estimates can be obtained by any marginal likelihood estimation method. The original TABI framework is motivated in the IS context. This is due to the fact that marginal likelihoods (i.e., integrals of non-negative functions) can be estimated arbitrarily well with IS (Llorente et al. 2020a, 2021a; Rainforth et al. 2020). Namely, using the optimal proposals the estimates \(\widehat{c}_+\), \(\widehat{c}_-\) and \(\widehat{Z}\) coincide with the exact values regardless of the sample size. Note that the direct estimation of I via MCMC or IS cannot produce zero-variance estimators for a finite sample size (Robert and Casella 2004).
The TABI framework improves the estimation of I by converting the initial task in that of computing three marginal likelihoods, \(c_+\), \(c_-\) and Z. In Rainforth et al. (2020), the authors test the application of two popular marginal likelihood estimators within TABI, namely, annealed IS (AnIS) Neal (1996) and nested sampling (NS) Skilling (2006), resulting in the target-aware algorithms called target-aware AnIS (TAAnIS) and target-aware NS (TANS). The use of AnIS for independently computing \(c_+\), \(c_-\) and Z represents an improvement over IS. Although the IS estimation of \(c_+\), \(c_-\) and Z can have virtually zero-variance, this is only true when we employ the optimal proposals. In general, the performance of IS depends on how ‘close’ is the proposal pdf to the target density whose normalizing constant we aim to estimate. It can be shown that the variance of IS scales with the Pearson divergence between target and proposal (Llorente et al. 2020a). When this distance is large, then it is more efficient to sample from another proposal that is ‘in between’, i.e., an ‘intermediate’ density. This is the motivation behind many state-of-the-art marginal likelihood estimation methods that employ a sequence of densities bridging an easy-to-work-with proposal and the target density (Llorente et al. 2020a). In this work, we introduce thermodynamic integration (TI) for performing target-aware inference, hence enabling the computation of the posterior expectation of a function \(f(\textbf{x})\). TI is a powerful marginal likelihood estimation technique that also leverages the use of a sequence of distributions, but has several advantages over other methods based on tempered transitions, such as improved stability thanks to working in logarithm scale and applying deterministic quadrature (Friel and Pettitt 2008; Friel and Wyse 2012). TI for computing marginal likelihoods is reviewed in the next section. Then, in Sect. 3 we introduce generalized TI (GTI) for the computation of posterior expectations, that is based on rewriting I as the difference of two ratios of normalizing constants.
2.3 Thermodynamic integration for estimating Z
Thermodynamic integration (TI) is a powerful technique that has been proposed in literature for computing ratios of constants (Frenkel 1986; Gelman and Meng 1998; Lartillot and Philippe 2006). Here, for simplicity, we focus on the approximation of just one constant, the marginal likelihood Z. More precisely, TI produces an estimation of \(\log Z\). Let us consider a family of (generally unnormalized) densities
such that \(\pi (\textbf{x}|0)=g(\textbf{x})\) is the prior and \(\pi (\textbf{x}|1)=\pi (\textbf{x})\) is the unnormalized posterior distribution. An example is the so-called geometric path \(\pi (\textbf{x}|\beta ) = g(\textbf{x})^{1-\beta }\pi (\textbf{x})^\beta\), with \(\beta \in [0,1]\) (Neal 1993). The corresponding normalized densities in the family are denoted as
Then, the main TI identity is (Llorente et al. 2020a)
where the expectation is with respect to (w.r.t.) \(\bar{\pi }(\textbf{x}|\beta ) = \frac{\pi (\textbf{x}|\beta )}{c(\beta )}\).
TI estimator Using an ordered sequence of discrete values \(\{\beta _i\}_{i=1}^N\) (e.g. \(\beta _i\)’s uniformly in [0, 1]), one can approximate the integral in Eq. (8) via quadrature w.r.t. \(\beta\), and then approximate the inner expectation with a Monte Carlo estimator using samples from \(\bar{\pi }(\textbf{x}|\beta _i)\) for \(i=1,\dots ,N\). Namely, defining \(U(\textbf{x}) = \frac{\partial \log \pi (\textbf{x}|\beta )}{\partial \beta }\) and \(E(\beta ) =\mathbb {E}_{\bar{\pi }(\textbf{x}|\beta )}\left[ U(\textbf{x})\right]\), the resulting estimator of Eq. (8) is given by
where
Note that we used the simplest quadrature rule in Eq. (9), but others can be used such as Trapezoidal, Simpson’s, etc. (Friel and Pettitt 2008; Lartillot and Philippe 2006).
The power posteriors (PP) method Let us consider the specific case of a geometric path between prior \(g(\textbf{x})\) and unnomalized posterior \(\pi (\textbf{x})\),
where we have used \(\pi (\textbf{x})=\ell (\textbf{y}|\textbf{x}) g(\textbf{x})\). Note that, in this scenario,Footnote 1
Hence, the identity in Eq. (8) can be also written as
The power posteriors (PP) method is a special case of TI which considers (a) the geometric path and (b) trapezoidal quadrature rule for integrating w.r.t. the variable \(\beta\) (Friel and Pettitt 2008). Namely, letting \(\beta _1=0< \cdots < \beta _N = 1\) denote a fixed temperature schedule, an approximation of Eq. (14) can be obtained via the trapezoidal rule
where the the expectations are generally substituted with MCMC estimates as in Eq. (10). TI and PP are popular methods for computing marginal likelihoods (even in high-dimensional spaces) due to their reliability. Theoretical properties are studied in Gelman and Meng (1998), Calderhead and Girolami (2009), and empirical validation is provided in several works, e.g., (Friel and Pettitt 2008; Lartillot and Philippe 2006). Different extensions and improvements on the method have also been proposed (Oates et al. 2016; Friel et al. 2014; Calderhead and Girolami 2009).
Remark 1
Note that, in order to ensure that the integrand in Eq. (14) is finite, so that the estimator in Eq. (15) can be applied, we need that (a) \(\ell (\textbf{y}|\textbf{x})\) is strictly positive everywhere, or (b) \(\ell (\textbf{y}|\textbf{x})=0\) only whenever \(g(\textbf{x})=0\) (i.e., they have the same support).
Goal We have seen that the TI method has been proposed for computing \(\log Z\) (or log-ratios of constants). Our goal is to extend the TI scheme in order to perform target-aware Bayesian inference. Namely, we generalize the idea of these methods (thermodynamic integration, power posteriors, etc.) to the computation of posterior expectations for a given \(f(\textbf{x})\).
3 Generalized TI (GTI) for Bayesian inference
In this section, we extend the TI method for computing the posterior expectation of a given \(f(\textbf{x})\). As in TABI, the basic idea, as we show below, is the formulation I in terms of ratios of normalizing constants. First, we consider the case \(f(\textbf{x})> 0\) for all \(\textbf{x}\) and then the case of a generic real-valued \(f(\textbf{x})\).
3.1 General approach
In order to apply TI, we need to formulate the posterior expectation I as a ratio of two constants. Since \(f(\textbf{x})\) can be positive or negative, let us consider the positive and negative parts, \(f_+(\textbf{x}) = \max (0,f(\textbf{x}))\) and \(f_-(\textbf{x}) = \min (0,-f(\textbf{x}))\), such that \(f(\textbf{x}) = f_+(\textbf{x}) - f_-(\textbf{x})\), where \(f_+(\textbf{x})\) and \(f_-(\textbf{x})\) are non-negative functions. Similarly to Eq. (4), we rewrite the integral I in terms of ratios of constants,
where \(c_+= \int _{\mathcal {X}} \varphi _+(\textbf{x})d\textbf{x}\) are \(c_-= \int _{\mathcal {X}} \varphi _-(\textbf{x})d\textbf{x}\) are respectively the normalizing constants of \(\varphi _+(\textbf{x}) = f_+(\textbf{x})\pi (\textbf{x})\), and \(\varphi _-(\textbf{x}) = f_-(\textbf{x})\pi (\textbf{x})\).
Proposed scheme Denoting \(\eta _+ = \log \frac{c_+}{Z}\) and \(\eta _- = \log \frac{c_-}{Z}\) in the case of a generic \(f(\textbf{x})\), we propose to obtain estimates of these quantities using thermodynamic integration. Then, we can obtain the final estimator as
In the next section, we give details on how to compute \(\widehat{\eta }_+\), \(\widehat{\eta }_-\) by using a generalized TI method.
Remark 2
Note that in Eq. (16) we express I as the difference of two ratios, and we propose GTI to estimate them directly as per Eq. (17) Hence, differently from Eq. (5), we do not aim at estimating each constant separately. This amounts to bridging the posterior with the function-scaled posterior, as we show below.
3.2 GTI for strictly positive or strictly negative \(f(\textbf{x})\)
Let us consider the scenario where \(f(\textbf{x})>0\) for all \(\textbf{x}\in \mathcal {X}\). In this scenario, we can set
Note that, with respect to Eq. (17), we only consider the first term. We link the unnormalized pdfs \(\pi (\textbf{x})\) and \(\varphi _+(\textbf{x})=f_+(\textbf{x})\pi (\textbf{x})\) with a geometric path, by defining
Hence, we have \(\bar{\varphi }_+(\textbf{x}|0)=\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }_+(\textbf{x}|1) = \frac{1}{c_+}f_+(\textbf{x})\pi (\textbf{x})\) where \(c_+=\int _{\mathcal {X}} f_+(\textbf{x})\pi (\textbf{x}) d\textbf{x}\). The Eq. (8) is thus
Letting \(\beta _1=0< \cdots < \beta _N = 1\) denote a fixed temperature schedule, the estimator (using the Trapezoidal rule) is thus
where we use MCMC estimates for the terms
for \(i=1,\dots ,N\). The case of a strictly negative \(f(\textbf{x})\), i.e., \(f_-(\textbf{x})=-f(\textbf{x})\), is equivalent.
Function \(f(\textbf{x})\) with zeros with null measure So far, we have considered strictly positive or strictly negative \(f(\textbf{x})\). This case could be extended to a positive (or negative) \(f(\textbf{x})\) with zeros in a null measure set. Indeed, note that the identity in Eq. (19) requires that \(\mathbb {E}_{\bar{\varphi }(\textbf{x}|\beta )}[\log f(\textbf{x})]<\infty\) for all \(\beta \in [0,1]\). If the zeros of \(f(\textbf{x})\) has null measure and the improper integral converges, the procedure above is also suitable. Table 1 summarizes the Generalized TI (GTI) steps for \(f(\textbf{x})\) that are strictly positive. We discuss other scenarios in the next section.
3.3 GTI for generic \(f(\textbf{x})\)
Using the results from previous section, we apply GTI to a real-valued function \(f(\textbf{x})\), namely, it can be positive and negative, as well as having zero-valued regions with a non-null measure. Here, we desire to connect the posterior \(\pi (\textbf{x})\) with the \(f_+(\textbf{x})\pi (\textbf{x})\) and \(f_-(\textbf{x})\pi (\textbf{x})\) with two continuous paths. However, a requirement for the validity of the approach is that \(\pi (\textbf{x})\) is zero whenever \(f_+(\textbf{x})\pi (\textbf{x})\) or \(f_-(\textbf{x})\pi (\textbf{x})\) is zero, which does not generally fulfills as \(f(\textbf{x})\) can have a smaller support than \(\pi (\textbf{x})\). This fact enforces the computation of correction factors to keep the validity of the approach. More details can be found in Appendix B. Therefore, we need to define the unnormalized restricted posteriors densities
where \(\mathbbm {1}_{\mathcal {X}_+}(\textbf{x})\) is the indicator function over the set \(\mathcal {X}_+ = \{\textbf{x}\in \mathcal {X}: f_+(\textbf{x})>0\}\) and \(\mathbbm {1}_{\mathcal {X}_-}(\textbf{x})\) is the indicator function over the set \(\mathcal {X}_- = \{\textbf{x}\in \mathcal {X}: f_-(\textbf{x})>0\}\). The idea is to connect with a path \(\pi _+(\textbf{x})\) and \(f_+(\textbf{x})\pi (\textbf{x})\), and \(\pi _-(\textbf{x})\) with \(f_-(\textbf{x})\pi (\textbf{x})\), by the densities
Note that it is equivalent to write \(f_{\pm }(\textbf{x})\pi _{\pm }(\textbf{x}) = f_{\pm }(\textbf{x})\pi (\textbf{x})\), since \(\pi _{\pm }(\textbf{x}) = \pi (\textbf{x})\) whenever \(f_{\pm }(\textbf{x})>0\), and they only differ when \(f_{\pm }(\textbf{x}) = 0\), in which case we also have \(f_{\pm }(\textbf{x})\pi _{\pm }(\textbf{x}) = f_{\pm }(\textbf{x})\pi (\textbf{x}) = 0\). Defining also
and recalling
the idea is to apply separately TI for approximating \(\eta ^\text {res}_+=\log \frac{c_+}{Z_+}\) and \(\eta ^\text {res}_-=\log \frac{c_-}{Z_-}\), where we denote with res to account that we consider the restricted components \(Z_+\) and \(Z_-\). Hence, two correction factors \(R_+\) and \(R_{-}\) are also required, in order to obtain \(R_+\exp \left( {\eta ^\text {res}_+}\right) =\frac{c_+}{Z}\) and \(R_{-}\exp \left( {\eta ^\text {res}_-}\right) =\frac{c_-}{Z}\). Below, we also show how to estimate the correction factors at a final stage and combine them to the estimations of \(\eta ^\text {res}_+\) and \(\eta ^\text {res}_-\). We can approximate the quantities
using the estimators
where
When comparing the estimators in Eqs. (27)–(28) with respect to the GTI estimator in Eq. (20), here the only difference is that the expectation at \(\beta =0\) is approximated by using samples from the restricted posteriors, \(\pi _+(\textbf{x})\) and \(\pi _-(\textbf{x})\), instead of the posterior \(\pi (\textbf{x})\).Footnote 2 To obtain an approximation of the true quantities of interest \(\eta _+\), \(\eta _-\) (instead of \(\eta ^\text {res}_+\) and \(\eta ^\text {res}_-\)), we compute two correction factors from a single set of K samples from \(\bar{\pi }(\textbf{x})\) as follows
where \(\mathbbm {1}_{\mathcal {X}_+}(\textbf{x}_i)=1\) if \(f_+(\textbf{x}_i)>0\), \(\mathbbm {1}_{\mathcal {X}_-}(\textbf{x}_i)=1\) if \(f_-(\textbf{x}_i)>0\), and both zero otherwise. The final estimator of I is
including the two correction factors. Table 2 provides all the details of GTI in this scenario.
Remark 3
Standard TI as special case of GTI: Note that the GTI scheme contains TI as a special case if we set \(f(\textbf{x})=\ell (\textbf{y}|\textbf{x})\) (i.e., the likelihood function) and let the prior \(g(\textbf{x})\) play the role of \(\pi (\textbf{x})\). Since the likelihood \(\ell (\textbf{y}|\textbf{x})\) is non-negative we have \(\eta _-=-\infty\) (then, \(\exp \left( \eta _-\right) =0\)), hence we only have to consider the estimation of \(\eta _+\). Moreover, if \(\ell (\textbf{y}|\textbf{x})\) is strictly positive we do not need to compute the correction factor.
Remark 4
The GTI procedure, described above, also allows the application of the standard TI for computing marginal likelihoods when the likelihood function is not strictly positive, by applying a correction factor in the same fashion (in this case, considering a restricted prior pdf).
4 Computational considerations and other extensions
In this section, we discuss computational details, different scenarios and further extensions, that are listed below.
4.1 Acceleration schemes
In order to apply GTI, the user must set N and M, so that the total number of samples/evaluations of \(f(\textbf{x})\) in Table 1 is \(E=NM\). The evaluations of \(f(\textbf{x})\) in Table 2 are \(E=2NM + K\). We can reduce the cost of algorithm in Table 2 to \(E=NM + K\) with an acceleration scheme. Instead of running separate MCMC algorithms for \(\bar{\varphi }_+(\textbf{x}|\beta ) \propto f_+(\textbf{x})^\beta \pi _+(\textbf{x})\) and \(\bar{\varphi }_-(\textbf{x}|\beta ) \propto f_-(\textbf{x})^\beta \pi _-(\textbf{x})\), we use a single run targeting
We can obtain two MCMC samples, one from \(\bar{\varphi }_+(\textbf{x}|\beta )\) and one from \(\bar{\varphi }_-(\textbf{x}|\beta )\), by separating the sample into two: samples with positive value of \(f(\textbf{x})\), and samples with negative value of \(f(\textbf{x})\), respectively. The procedure can be repeated until obtaining the desired number of samples from each density, \(\bar{\varphi }_+(\textbf{x}|\beta )\) and \(\bar{\varphi }_-(\textbf{x}|\beta )\).
Moreover, note that in Table 2 we need to draw samples from \(\pi _+(\textbf{x})\), \(\pi _-(\textbf{x})\) and \(\pi (\textbf{x})\). Instead of sampling each one separately, we can use the following procedure. Obtain a set of samples from \(\pi (\textbf{x})\) and then apply rejection sampling (i.e. discard samples with \(f_\pm (\textbf{x})=0\)) in order to obtain samples from \(\pi _\pm (\textbf{x})\). Combining this idea with the acceleration scheme above reduces the cost of Table 2 to \(E=MN\).
4.2 Parallelization
Note that steps 1 and 2 in Table 1 and Table 2 are amenable to parallelization. In other words, those steps need not be performed sequentially but can be done using embarrassingly parallel MCMC chains (i.e. with no communication among N, or 2N, workers). Only step 3 requires communicating to a central node and combining the estimates. With this procedure, the number of evaluations E is the same but the computation time is reduced by \(\frac{1}{N}\) (or \(\frac{1}{2N}\)) factor. On the other hand, population MCMC techniques can be used, but parallelization speedups are lower since communication among workers occurs every so often, in order to foster the exploration of the chains (Martino et al. 2016; Calderhead and Girolami 2009).
4.3 Vector-valued functions \({f}(\textbf{x})\)
In Bayesian inference, one is often interested in computing moments of the posterior, i.e.,
In this case \(\textbf{I}\) is a vector and \(\textbf{f}(\textbf{x})=\textbf{x}^{\alpha }\). When \(\alpha =1\), \(\textbf{I}\) represents the minimum mean square error (MMSE) estimator. More generally, we can have a vector-valued function,
hence the integral of interest is a vector \(\textbf{I}=[I_1,\dots ,I_{d_f}]^\top\) where \(I_i = \int _\mathcal {X}f_i(\textbf{x})\bar{\pi }(\textbf{x})d\textbf{x}\). In this scenario, we need to apply the GTI scheme to each component of \(\textbf{I}\) separately, obtaining estimates \(\widehat{I}_i\) of the form in Eq. (33).
4.4 TI within the TABI framework: TATI
We have seen that we can apply GTI to compute the posterior expectation of a generic \(f(\textbf{x})\), that can be positive, negative and have zero-valued regions. For doing this, we connected with a tempered path, \(\pi _+(\textbf{x})\) and \(\pi _-(\textbf{x})\), to \(f_+(\textbf{x})\pi (\textbf{x})\) and \(f_-(\textbf{x})\pi (\textbf{x})\) respectively and then apply correction factors.
An alternative procedure is to use the TABI identity in Eq. (4), rather than Eq. (16) and use reference distributions for computing separately \(c_+\), \(c_-\) and Z in Eq. (16). This target-aware TI (TATI) differ from GTI in that we need to apply TI three times, and bridge three reference distributions to the target densities \(f_+(\textbf{x})\pi (\textbf{x})\), \(f_-(\textbf{x})\pi (\textbf{x})\) and \(\pi (\textbf{x})\). Let us define as
three unnormalized reference densities with normalizing constants,
Then, the idea is to apply TI for obtaining estimates of \(\log \frac{c_+}{Z^\text {ref}_1}\), \(\log \frac{c_-}{Z^\text {ref}_2}\) and \(\log \frac{Z}{Z^\text {ref}_3}\). A requirement is that \(p^\text {ref}_1(\textbf{x})\) is zero where \(f_+(\textbf{x})\pi (\textbf{x})\) is zero, \(p^\text {ref}_2(\textbf{x})\) is zero where \(f_-(\textbf{x})\pi (\textbf{x})\) is zero, and \(p^\text {ref}_3(\textbf{x})\) is zero where \(\pi (\textbf{x})\) is zero. Namely, we need to be able to build a continuous path between the reference distributions and the corresponding unnormalized pdf of interest. With this procedure, we do not need to apply correction factors, but we just need to apply the algorithm in Table 1 three times. The performance of TATI is expected to be better than GTI if we are able to choose three reference distributions that are ‘closer’ to the corresponding target densities, than what \(\pi (\textbf{x})\) is to \(f_+(\textbf{x})\pi (\textbf{x})\) or \(f_-(\textbf{x})\pi (\textbf{x})\) (Llorente et al. 2020a). For instance, we can obtain the reference pdfs by building nonparametric approximations to each target density (Llorente et al. 2021b).
5 Numerical experiments
In this section, we illustrate the performance of the proposed scheme in two numerical experiments which consider different kind of densities \(\bar{\pi }\) with different features and different dimensions, and also different function \(f(\textbf{x})\). In the first example, \(f(\textbf{x})\) is strictly positive so we apply the algorithm described in Table 1. In the second example, we consider \(f(\textbf{x})\) to have zero-valued regions, and hence we apply the algorithm in Table 2. Notice we consider the same setup as in Rainforth et al. (2020) in order to compare with respect to instances of TABI algorithms.
5.1 First numerical analysis
Let us consider the following Gaussian model (Rainforth et al. 2020)
where D is the dimensionality, \(\textbf{I}_D\) is the identity matrix, \(\textbf{0}_D\) and \(\textbf{1}_D\) are D-vectors containing only zeros or ones respectively, and y is a scalar value that represents the radial distance of the observation \(\textbf{y}=-\frac{y}{\sqrt{D}}\textbf{1}_D\) to the origin. We are interested in the estimation of \(I=\int _{\mathcal {X}} f(\textbf{x})\bar{\pi }(\textbf{x})d\textbf{x}\). Thus this problem consists in computing the posterior predictive density, under the above model, at the point \(\frac{y}{\sqrt{D}}\textbf{1}_D\). In this toy example, the posterior and the function-scaled posteriors can be obtained in closed-form, that is,
and
The ground-truth is known, and can be written as a Gaussian density evaluated at \(\frac{y}{\sqrt{D}}\textbf{1}_D\), more specifically, \(I = \mathcal {N}\left( \frac{y}{\sqrt{D}}\textbf{1}_D \Big | -\frac{1}{2}\frac{y}{\sqrt{D}}\textbf{1}_D, \textbf{I}_D \right)\).
We test the values \(y \in \{2,3.5,5\}\) and \(D \in \{10,25,50\}\). Note that, as we increase y, the posterior \(\bar{\pi }(\textbf{x})\) and the density \(\bar{\varphi }(\textbf{x}|1)\propto f(\textbf{x})\pi (\textbf{x})\) become further apart.
5.1.1 Comparison with other target-aware approaches
We aim to compare GTI with a MCMC baseline and other target-aware algorithms, that make use of \(f(\textbf{x})\). Specifically, we compare against two extreme cases of self-normalized IS (SNIS) estimators \(\widehat{I}_{\text {SNIS}} =\frac{1}{\sum _{j=1}^{M_\text {tot}}w_j} \sum _{i=1}^{M_\text {tot}}w_i f(\textbf{x}_i)\), where \(\textbf{x}_i\sim q(\textbf{x})\) and \(w_i = \frac{\pi (\textbf{x}_i)}{q(\textbf{x}_i)}\) is the IS weight. Namely, (1) SNIS using samples from the posterior (SNIS1), i.e., \(q(\textbf{x}) = \bar{\pi }(\textbf{x})\) (hence, SNIS1 coincides with MCMC), and (2) SNIS using samples from \(q(\textbf{x}) = \bar{\varphi }(\textbf{x}|1)\propto f(\textbf{x})\pi (\textbf{x})\) (SNIS2), which corresponds to setting \(\beta =1\) in Eq. (37). These choices are optimal for estimating, respectively, the denominator and the numerator of the right hand side of Eq. (2) (Robert and Casella 2004). Note that SNIS2 can be considered as a first “primitive” target-aware algorithm, since it employs samples from \(\bar{\varphi }(\textbf{x}|1) \propto f(\textbf{x})\pi (\textbf{x})\).
A second target-aware approach can be obtained by recycling the samples generated in SNIS1 and SNIS2 (that is, from \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\)), and is called (3) bridge sampling (BS) (Llorente et al. 2020a). This estimator can be viewed as if we use the mixture of \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\) as proposal pdf. More details about (optimal) BS can be found in Appendix A. Finally, we also aim to compare against the target-aware versions of two popular marginal likelihood estimators, namely, (4) target-aware annealed IS (TAAnIS) and (5) target-aware nested sampling (TANS), that also make use of the \(f(\textbf{x})\) (Rainforth et al. 2020).
In order to keep the comparisons fair, we consider the same number of likelihood evaluations E in all of the methods. Note that evaluating the likelihood is usually the most costly step in many real-world scenarios. Hence, in SNIS1 and SNIS2 we draw \(M_\text {tot} = E\) samples from \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\), respectively, via MCMC; BS employ \(\frac{E}{2}\) samples from \(\bar{\pi }(\textbf{x})\) and \(\frac{E}{2}\) from \(\bar{\varphi }(\textbf{x}|1)\). For TAAnIS and TANS we use the same parameters as in Rainforth et al. (2020). Namely, for TAAnIS we employ \(N=200\) intermediate distributions, with \(n_\text {MCMC}= 5\) iterations of Metropolis-Hastings (MH) algorithm, which allow for a total number of particles \(n_\text {par} = \lfloor \frac{E}{(N-1)n_\text {MCMC} + N-1} \rfloor\), where half of the particles are used to estimate the numerator, and the other half to estimate the denominator on the right-hand side of Eq. (2). For TANS, we employ \(n_\text {MCMC}= 20\) iterations of MH and \(n_\text {par} = \lfloor \frac{E}{1 + \lambda n_\text {MCMC}} \rfloor\) particles, where \(\lambda = 250\) and \(T = \lambda n_\text {par}\) iterations. Again, TANS employs one half of the particles for estimating the numerator and the other half for the denominator. Finally, in GTI we set also \(N=200\), hence we draw \(M = \lfloor \frac{E}{N}\rfloor\) samples from each \(\bar{\varphi }(\textbf{x}|\beta _i),\ i=1,\dots ,N\). Note that we are setting the same number of intermediate distributions in GTI and TAAnIS, however, the paths are not identical, since TAAnIS aims at bridging the prior with \(\pi (\textbf{x})\) and \(\varphi (\textbf{x}|1)\), while GTI directly bridges \(\pi (\textbf{x})\) with \(\varphi (\textbf{x}|1)\). All the iterations of the MH algorithm use a Gaussian random-walk proposal with covariance matrix equal to \(\Sigma = 0.1225 \textbf{I}\), \(\Sigma = 0.04 \textbf{I}\) and \(\Sigma = 0.01 \textbf{I}\), for \(D=10,25,50\) respectively. Following (Rainforth et al. 2020), for TANS we use instead \(\Sigma = \textbf{I}\), \(\Sigma = 0.09 \textbf{I}\) and \(\Sigma = 0.01 \textbf{I}\). For choosing \(\beta _i\) in TAAnIS and GTI, we use the powered fraction schedule, \(\beta _i = \left( \frac{i-1}{N-1}\right) ^5\) for \(i=1,\dots ,N\) (Friel and Pettitt 2008; Xie et al. 2010).
5.1.2 Results
The results are given in Fig. 1, which show, for each pair (y, D), the median relative square error along with the 25% and 75% quantiles (over 100 simulations) versus the number of total likelihood evaluations E, up to \(E=10^7\). We see that GTI is the first or second best overall, in terms of relative squared error for all (y, D). In fact, the performance of GTI seems rather insensitive to increasing the dimension D and y. We see that for low dimension and when the distance between \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\) is small (i.e. \(y=2\) and \(D=10\)), the target-aware algorithms do not produce large gains with respect to the MCMC baseline (SNIS1). On the contrary, for \(y=3.5,5\) (second and third row), we see that the target-aware algorithms, GTI, TAAnIS and BS, outperform the MCMC baseline. This performance gain with larger y is expected, since this represents a larger mismatch between \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\propto f(\textbf{x})\pi (\textbf{x})\), which is a scenario where the target-aware approaches are well suited. Comparing the target-aware algorithms, we see that TAAnIS performs also as well as our GTI in low dimensions (\(D=10\)), but it breaks down as we increase the dimension, being outperformed by TANS in \(D=25,50\), confirming the results of Rainforth et al. (2020) where TANS is preferable over TAAnIS in high dimension. It is worth noticing the very good performance of BS, given its simplicity and that it can be computed at almost no extra cost once we have computed SNIS1 and SNIS2. Indeed, its performance matches that of GTI, and actually outperforms GTI when the separation is not too high. This is also expected since, when \(y=2\), both \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\) are good pdfs for estimating both numerator and denominator of the right hand side of Eq. (2). In this sense, having only one “bridge” is better than having \(n=200\) intermediate distributions. However, GTI outperforms BS when \(y=3.5,5\), especially when the dimension is high.
In summary, our proposed GTI is able to produce good estimates in the range of values of (y, D) considered. The performance gains with respect to a MCMC baseline are higher when the discrepancy between \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}|1)\) is large. As compared to other target-aware approaches, GTI produce better estimates (especially in high dimension) and is also able to perform well when the discrepancy is low, matching the performance of BS, that is a simpler and more direct target-aware algorithm.
5.2 Second numerical analysis
We consider the following two-dimensional banana-shaped density (which is a benchmark example Rainforth et al. 2020; Martino and Read 2013; Haario et al. 2001),
where \(\mathbbm {1}\left( \textbf{x}\in \mathcal {B}\right)\) is the prior, where \(\mathcal {B} = \{\textbf{x}:\ -25< x_1< 25,\ -40< x_2 <20 \}\), and the function is
We compare GTI using \(N\in \{10,50, 100\}\) against TAAnIS and TANS, in the estimation of \(\mathbb {E}_{\bar{\pi }}[f(\textbf{x})]\) allowing a budget of \(E=10^6\). We also consider a baseline MCMC chain targeting \(\pi (\textbf{x})\) with the same number of likelihood evaluations.
The main difference with respect to previous experiment is that \(f(\textbf{x})\) here has a zero-valued region, so, in order to apply GTI, we need to use the algorithm in Table 2. Hence, for GTI, we run \(N+1\) chains for \(M=\frac{E}{N+1}\) iterations. The first N chains address a different tempered distribution \(f(\textbf{x})^{\beta _i}\pi (\textbf{x})\), and the last chain is used to compute the correction factor. It is also important to notice here that TAAnIS, which also uses a geometric path to bridge \(f(\textbf{x})\pi (\textbf{x})\) and the prior, requires also the computation of a correction factor, accounting for the fact we connect a prior restricted to where \(f(\textbf{x})\ne 0\). This amounts to multiply by a factor \(\frac{1}{2}\) the final estimate returned by TAAnIS (Fig. 2).
All the MCMC algorithms use a Gaussian random-walk proposal with covariance \(\Sigma =3\textbf{I}_2\). The budget of likelihood evaluations is \(E=10^6\), for all the compared schemes. We use again the powered fraction schedule: \(\beta _i = \left( \frac{i-1}{N-1}\right) ^5\) for \(i=1,\dots ,N\).
Results The results are shown in Table 3. We show the median relative square error of the methods over 100 independent simulations. For the sample size considered, GTI performs better than MCMC baseline and the other target-aware algorithms. TAAnIS performs slightly better than the MCMC baseline, while TANS completely fails at estimating the posterior expectation in this example. For \(N=100\), the performance gains of GTI are almost of one order of magnitude over MCMC. However, note that GTI with the choice \(N=10\) is worse than the MCMC baseline due to the discretization error, i.e., there are not enough quadrature nodes, so the estimation in Eq. (20) has considerable bias. In that situation, increasing the sample size would not translate into a significant performance gain. This contrasts with TAAnIS, where increasing N produces only small improvement on the final estimate, since TAAnIS is unbiased regardless of the choice of N.
6 Conclusions
We have extended the powerful thermodynamic integration technique for performing a target-aware Bayesian inference. Namely, GTI allows the computation of posterior expectations of real-valued functions \(f(\textbf{x})\), and also vector-valued functions \(\textbf{f}(\textbf{x})\). GTI contains the standard TI as special case. Even for the estimation of the marginal likelihood, this work provides a way for extending the application of the standard TI avoiding the assumption of strictly positive likelihood functions (see Remarks 1- 3). Several computational considerations and variants are discussed. The advantages of GTI over other target-aware algorithms are shown in different numerical comparisons. As a future research line, we plan to study new continuous paths for linking densities with different support, avoiding the need of the correction terms. Alternatively, as discussed in Sect. 4, another approach would be to design suitable approximations of \(\varphi _+(\textbf{x})\), \(\varphi _-(\textbf{x})\) and \(\pi (\textbf{x})\) (see the end of Sect. 4) using, e.g., regression techniques (Llorente et al. 2021b, 2020b).
Notes
From Eq. (12), we can write \(\log \pi (\textbf{x}|\beta )= \log g(\textbf{x})+\beta \log \ell (\textbf{y}|\textbf{x})\). Hence, \(\frac{\partial \log \pi (\textbf{x}|\beta )}{\partial \beta }=\log \ell (\textbf{y}|\textbf{x})\).
In order to obtain samples from \(\pi _\pm (\textbf{x})\), we just need to consider \(\pi _\pm (\textbf{x})\) as target density instead of \(\pi (\textbf{x})\), in the MCMC steps. A similar alternative procedure is to apply rejection sampling, discarding the samples from \(\pi (\textbf{x})\) such that \(f_\pm (\textbf{x})=0\).
References
Calderhead B, Girolami M (2009) Estimating Bayes factors via thermodynamic integration and population MCMC. Comput Stat Data Anal 53(12):4028–4045
Frenkel D (1986) Free-energy computation and first-order phase transitions
Friel N, Pettitt AN (2008) Marginal likelihood estimation via power posteriors. J R Stat Soc Ser B (Stat Methodol) 70(3):589–607
Friel N, Wyse J (2012) Estimating the evidence—a review. Stat Neerl 66(3):288–308
Friel N, Hurn M, Wyse J (2014) Improving power posterior estimation of statistical evidence. Stat Comput 24(5):709–723
Gelman A, Meng XL (1998) Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat Sci 1998:163–185
Haario H, Saksman E, Tamminen J (2001) An adaptive Metropolis algorithm. Bernoulli 7(2):223–242
Lartillot N, Philippe H (2006) Computing Bayes factors using thermodynamic integration. Syst Biol 55(2):195–207
Llorente F, Martino L, Elvira V, Delgado D, López-Santiago J (2020b) Adaptive quadrature schemes for Bayesian inference via active learning. IEEE Access 8:208 462-208 483
Llorente F, Martino L, Delgado-Gómez D, Camps-Valls G (2021b) Deep importance sampling based on regression for model inversion and emulation. Digit Signal Process 116:103104
Llorente F, Martino L, Delgado D, Lopez-Santiago J (2020a) Marginal likelihood computation for model selection and hypothesis testing: an extensive review. arXiv:2005.08334
Llorente F, Martino L, Delgado D, López-Santiago J (2021a) On the computation of marginal likelihood via MCMC for model selection and hypothesis testing. In: 28th European signal processing conference (EUSIPCO), pp 2373–2377
Luengo D, Martino L, Bugallo M, Elvira VSS (2020) A survey of Monte Carlo methods for parameter estimation. EURASIP J Adv Signal Process 25:1–62
Martino L, Read J (2013) On the flexibility of the design of multiple try Metropolis schemes. Comput Stat 28(6):2797–2823
Martino L, Elvira V, Luengo D, Corander J, Louzada F (2016) Orthogonal parallel MCMC methods for sampling and optimization. Digit Signal Process 58:64–84
Martino L, Llorente F, Cuberlo E, López-Santiago J, Míguez J (2021) Automatic tempered posterior distributions for Bayesian inversion problems. Mathematics 9(7):784
Moral PD, Doucet A, Jasra A (2006) Sequential Monte Carlo samplers. J R Stat Soc Ser B (Stat Methodol) 68(3):411–436
Neal RM (1993) Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto, Toronto
Neal RM (1996) Sampling from multimodal distributions using tempered transitions. Stat Comput 6(4):353–366
Neal RM (2001) Annealed importance sampling. Stat Comput 11(2):125–139
Newton MA, Raftery AE (1994) Approximate Bayesian inference with the weighted likelihood bootstrap. J R Stat Soc Ser B (Methodol) 56(1):3–26
Oates CJ, Papamarkou T, Girolami M (2016) The controlled thermodynamic integral for Bayesian model evidence evaluation. J Am Stat Assoc 111(514):634–645
Rainforth T, Goliński A, Wood F, Zaidi S (2020) Target-aware Bayesian inference: how to beat optimal conventional estimators. J Mach Learn Res 21(88):3428–3481
Robert CP, Casella G (2004) Monte Carlo statistical methods. Springer, Berlin
Skilling J (2006) Nested sampling for general Bayesian computation. Bayesian Anal 1(4):833–859
Xie W, Lewis PO, Fan Y, Kuo L, Chen MH (2010) Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst Biol 60(2):150–160
Acknowledgements
The work was partially supported by the Young Researchers R &D Project, ref. num. F861 (AUTO-BA- GRAPH) funded by Community of Madrid and Rey Juan Carlos University, by Agencia Estatal de Investigación AEI (project SP-GRAPH, ref. num. PID2019-105032GB-I00), and by Spanish government via grant FPU19/00815.
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendices
1.1 Appendix A: Bridge sampling
The estimator tested in Sect. 5.1 is an instance of bridge sampling. Bridge sampling (BS) is an importance sampling approach for computing the ratio of normalizing constants of two unnormalized pdfs using samples from both densities (Llorente et al. 2021a). Here, the two unnormalized pdfs of interest are \(\pi (\textbf{x})\) and \(\varphi (\textbf{x})=f(\textbf{x})\pi (\textbf{x})\), and the ratio \(\frac{c}{Z}\) corresponds to the posterior expectation of interest, namely, \(I = \frac{c}{Z}\). Hence, BS can be viewed as a target-aware approach. In order to implement the optimal bridge sampling estimator, an iterative scheme is required.
Let \(\{\textbf{x}_i\}_{i=1}^{N_1},\ \{\textbf{z}_i\}_{i=1}^{N_2}\) denote sets of MCMC samples from \(\bar{\pi }(\textbf{x})\) and \(\bar{\varphi }(\textbf{x}) = \frac{\varphi (\textbf{x})}{c}\). Let \(\widehat{I}^{(0)}\) be an initial estimate of I, the optimal BS estimator is computed by refining this estimate through the following loop. For \(t=1,\dots ,T\):
In the experiments, just a couple of iterations were needed for \(\widehat{I}^{(t)}\) to converge. As initial estimate, we take
1.2 Appendix B: TI when \(f(\textbf{x})\) has zero-valued regions
We have seen that the case of \(f(\textbf{x})\) being strictly positive (or strictly negative) is apt for thermodynamic integration since \(\mathbb {E}_{\bar{\varphi }(\textbf{x}|\beta )}[\log f(\textbf{x})]<\infty\) for all \(\beta\), so the integral in the r.h.s. of Eq. (19) is finite. In other words, the integrand function, \(\mathbb {E}_{\bar{\varphi }(\textbf{x}|\beta )}[\log f(\textbf{x})]\), is continuous and bounded for \(\beta \in [0,1]\). Now, we discuss the case where this does not hold. For instance, when \(f(\textbf{x})\) has zero-valued regions within the support \(\mathcal {X}\). In that case, the integrand of Eq. (19) at \(\beta =0\) diverges,
because the integral will sum over regions where \(\log f(\textbf{x}) = -\infty\). Note that the integrand
since the expectations,
are w.r.t. densities that take into account \(f(\textbf{x})\), and then the effective support is \(\mathcal {X}\backslash \{\textbf{x}:\ f(\textbf{x})=0\}\).
Improper integral In this case, the integral in Eq. (19) is thus improper and has to be rewritten as
If this limit exists, the integral is convergent and it is safe to apply quadrature (Riemann sums) to calculate it, taking a very small \(\beta _0\).
Behavior near \(\beta =0\) Paying attention to the behavior of \(\mathbb {E}_{\bar{\varphi }(\textbf{x}|\beta )}[\log f(\textbf{x})]\) near \(\beta =0\), we should notice that \(\mathbb {E}_{\bar{\varphi }(\textbf{x}|\beta )}[\log f(\textbf{x})]\) will not diverge to \(-\infty\) as we get close to \(\beta =0\). On the contrary, there is a lower limit on its value as we approach \(\beta =0\). Consider, for an infinitesimal \(\epsilon\), the integral
where \(p_\epsilon (\textbf{x}) \propto f(\textbf{x})^\epsilon \pi (\textbf{x})\) coincides with \(\bar{\pi }(\textbf{x})\) in \(\mathcal {X}\backslash \{\textbf{x}: f(\textbf{x})=0\}\), and is different only in that \(p_\epsilon (\textbf{x}) =0\) whenever \(f(\textbf{x})=0\), that is,
This integral effectively corresponds to
where \(\mathcal {X}_0 = \mathcal {X}\backslash \{\textbf{x}: f(\textbf{x})=0\}\), that is, the expectation is w.r.t. \(\bar{\pi }_\text {res}(\textbf{x})\), the posterior restricted to regions where \(f(\textbf{x})>0\). Hence, we can summarize this as follows:
In summary, the integrand has a jump at \(\beta =0\) since
Then, by using Eq. (45), are we actually estimating
instead of the integral of interest \(I=\frac{c}{Z}\). We need to apply a correction factor to our estimator as follows
where the last term can be approximated from a posterior sample as follows:
where \(\mathbbm {1}_{\mathcal {X}_0}(\textbf{x})\) is the indicator function in \(\mathcal {X}_0\), i.e., \(\mathbbm {1}_{\mathcal {X}_0}(\textbf{x}_i)=1\) if \(f(\textbf{x}_i)>0\) and zero otherwise.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Llorente, F., Martino, L. & Delgado, D. Target-aware Bayesian inference via generalized thermodynamic integration. Comput Stat 38, 2097–2119 (2023). https://doi.org/10.1007/s00180-023-01358-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01358-0