Regularized Stein Variational Gradient Flow

He, Ye; Balasubramanian, Krishnakumar; Sriperumbudur, Bharath K.; Lu, Jianfeng

doi:10.1007/s10208-024-09663-w

Regularized Stein Variational Gradient Flow

Open access
Published: 17 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Foundations of Computational Mathematics Aims and scope Submit manuscript

Regularized Stein Variational Gradient Flow

Download PDF

Ye He¹,
Krishnakumar Balasubramanian²,
Bharath K. Sriperumbudur³ &
…
Jianfeng Lu⁴

449 Accesses
Explore all metrics

Abstract

The stein variational gradient descent (SVGD) algorithm is a deterministic particle method for sampling. However, a mean-field analysis reveals that the gradient flow corresponding to the SVGD algorithm (i.e., the Stein Variational Gradient Flow) only provides a constant-order approximation to the Wasserstein gradient flow corresponding to the KL-divergence minimization. In this work, we propose the Regularized Stein Variational Gradient Flow, which interpolates between the Stein Variational Gradient Flow and the Wasserstein gradient flow. We establish various theoretical properties of the Regularized Stein Variational Gradient Flow (and its time-discretization) including convergence to equilibrium, existence and uniqueness of weak solutions, and stability of the solutions. We provide preliminary numerical evidence of the improved performance offered by the regularization.

Convergent numerical approximation of the stochastic total variation flow

Article Open access 29 March 2020

A variational finite volume scheme for Wasserstein gradient flows

Article 08 October 2020

Variational approach to coarse-graining of generalized gradient flows

Article Open access 28 June 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Given a potential function $V:{\mathbb {R}}^d \rightarrow {\mathbb {R}}$, the sampling problem involves generating samples from the density

$$\begin{aligned} \pi (x) {:}{=}Z^{-1}e^{-V(x)},~\qquad \text {with}\qquad Z{:}{=}\int e^{-V(x)}dx \end{aligned}$$

(1)

being the normalization constant, which is typically assumed to be unknown or hard to compute. The task of sampling arises in several fields of applied mathematics, including Bayesian statistics and machine learning in the context of numerical integration. There are two widely-used approaches for sampling: (i) diffusion-based randomized algorithms, which are based on discretizations of certain diffusion processes, and (ii) particle-based deterministic algorithms, which are discretizations of certain approximate gradient flows. A central idea connecting the two approaches is the seminal work by [23] which provided a variational interpretation of the Langevin diffusion as the Wasserstein gradient flow (WGF),

$$\begin{aligned} \partial _t \mu _t = \nabla \cdot \left( \mu _t ~\nabla _{W_2}F(\mu _t) \right) = \nabla \cdot \left( \mu _t ~\nabla \log \frac{\mu _t}{\pi } \right) , \end{aligned}$$

(2)

where the term $\nabla _{W_2}F(\mu _t)= \nabla \log \frac{\mu _t}{\pi } $ could be interpreted as the Wasserstein gradient^{Footnote 1} of the relative entropy functional (also called as the Kullback–Leibler divergence), defined by

$$\begin{aligned} F(\mu _t) = \text {KL}(\mu _t|\pi ){:}{=}\int _{\mathbb {R}^d} \log \frac{\mu _t(x)}{\pi (x)} \mu _t(x) dx, \end{aligned}$$

evaluated at $\mu _t$. This leads to the idea that sampling could be viewed as optimization on the space of densities/measures, a viewpoint that has provided a deeper understanding of the sampling problem [44, 50].

There are several merits and disadvantages to both the randomized and deterministic discretizations of the (approximate) WGF. First, note that obtaining exact space-time discretization of the WGF in (2) is not possible. Indeed, due to the presence of the diffusion term, when initialized with a N-particle based empirical measure, the particles do not remain as particles for any time $t >0$. Hence, on the one hand, randomized discretizations like the Langevin Monte Carlo algorithm, are used as implementable space-time discretizations of the WGF. On the other hand, motivated by applications where the randomness in the discretization is undesirable, in the applied mathematics literature, other discretizations of approximate WGF were developed. Such methods are predominantly based on using mollifiers and we refer the reader to [7, 14, 16, 34, 36] for a partial list and to [10], for a comprehensive overview.

Recently, in the machine learning community, the stein variational gradient descent [25, 27] was proposed as another deterministic discretization of approximate WGF, and has gathered significant attention due to applications in reinforcement learning [29], graphical modeling [48], measure quantization [51], and other fields of machine learning and applied mathematics [12, 12, 24, 47]. Due to the use of the reproducing kernels, the stein variational gradient descent (SVGD) algorithm provides a space-time discretization of the following approximate Wasserstein gradient flow (which we refer to as the Stein Variational Gradient Flow (SVGF) for simplicity)

$$\begin{aligned} \partial _t \mu _t = \nabla \cdot \left( \mu _t ~ {\mathcal {T}}_{k,\mu _t}\nabla \log \frac{\mu _t}{\pi } \right) , \end{aligned}$$

(3)

where ${\mathcal {T}}_{k,\mu }: L^d_2(\mu ) \rightarrow L^d_2(\mu )$^{Footnote 2} is the integral operator defined as ${\mathcal {T}}_{k,\mu }f(x) = \int k(x,y) f(y) \mu (y) dy $ for any function $f\in L^d_2(\mu )$, and for a kernel $k:{\mathbb {R}}^d\times {\mathbb {R}}^d \rightarrow {\mathbb {R}}$; see, for example [30]. Hence, SVGD (which is based on the SVGF), in this context, while being deterministic only provides a discretization of a constant-order approximation to the Wasserstein gradient flow due to the presence of the kernel integral operator. Indeed, if $\text {supp}(\mu _t)={\mathbb {R}}^d$ and k is a bounded continuous translation invariant characteristic kernel [41] on ${\mathbb {R}}^d$ (e.g., Gaussian, Laplacian kernels), then

$$\begin{aligned} \Vert {\mathcal {T}}_{k,\mu _t}-I\Vert _{\text {op}}&=\sup \{\Vert {\mathcal {T}}_{k,\mu _t}f-f\Vert _{L^d_2(\mu _t)}:\Vert f\Vert _{L^d_2(\mu _t)}=1\}\ge \Vert {\mathcal {T}}_{k,\mu _t}{\textbf{1}}-{\textbf{1}}\Vert _{L^d_2(\mu _t)} \\&\ge \Vert 1- \smallint k(\cdot ,x)\mu _t(x)\,dx\Vert _{L_2(\mu _t)}>0, \end{aligned}$$

where ${\textbf{1}}=(1,{\ldots },1)^\intercal \in \mathbb {R}^d$. This shows that the order of the error is crucially dependent on the choice of the kernel k.

To overcome the above issue with the SVGF, in this work, we propose the Regularized Stein Variational Gradient Flow (R-SVGF). To motivate the proposed flow, we first note that the Wasserstein gradient $\nabla \log (\mu _t/\pi )$ lives in $L^d_2(\mu _t)$, while the kernelized Wasserstein gradient ${\mathcal {T}}_{k,\mu _t}\nabla \log (\mu _t/\pi )$ morally lives in the reproducing kernel Hilbert space^{Footnote 3}${\mathcal {H}}^d_k \subset L^d_2(\mu _t)$. If $\nabla \log (\mu _t/\pi )\in \overline{\text {Ran}({\mathcal {T}}_{k,\mu _t})}$, then it is easy to verify that

$$\begin{aligned} \Vert ((1- \nu ){\mathcal {T}}_{k,\mu _t}+\nu I)^{-1}{\mathcal {T}}_{k,\mu _t}\nabla \log (\mu _t/\pi )- \nabla \log (\mu _t/\pi )\Vert _{L_2^d(\mu _t)}\rightarrow 0,\ \text {as}\ \nu \rightarrow 0. \end{aligned}$$

Additionally, if $\nabla \log (\mu _t/\pi ) $ is sufficiently smooth, i.e., there exists $\gamma \in \left( 0,\frac{1}{2}\right] $ such that $\nabla \log (\mu _t/\pi ) = {\mathcal {T}}^\gamma _{k,\mu _t} h$, for some $h\in L_2^d(\mu _t)$ (see, for example, [15]), then

$$\begin{aligned}{} & {} \Vert ((1- \nu ){\mathcal {T}}_{k,\mu _t}+\nu I)^{-1}{\mathcal {T}}_{k,\mu _t}\nabla \log (\mu _t/\pi )- \nabla \log (\mu _t/\pi )\Vert _{L_2^d(\mu _t)}=O(\nu ^{2\gamma }),\ \text {as}\ \nu \rightarrow 0. \end{aligned}$$

In other words, $ ((1-\nu ){\mathcal {T}}_{k,\mu _t}+\nu I)^{-1}{\mathcal {T}}_{k,\mu _t}\nabla \log (\mu _t/\pi )$ is a good approximation to $\nabla \log (\mu _t/\pi )$ for small $\nu $. With this motivation, we propose the following R-SVGF given by

$$\begin{aligned} \partial _t \mu _t = \nabla \cdot \left( \mu _t ~ \left( (1-\nu ){\mathcal {T}}_{k,\mu _t}+\nu I \right) ^{-1} {\mathcal {T}}_{k,\mu _t}\left( \nabla \log \frac{\mu _t}{\pi } \right) \right) , \end{aligned}$$

(4)

for some regularization parameter $\nu \in (0,1]$, where R-SVGF arbitrarily approximates the WGF as $\nu \rightarrow 0$. It is important to note that in the case of $\gamma =1/2$, we have $\nabla \log (\mu _t/\pi ) \in {\mathcal {H}}_k^d$, yet, (3) suffers from the drawback of providing only a constant-order approximation to (2).

1.1 Summary of Contributions

Our contributions in this work are as follows:

(1)
We propose the Regularized SVGF (R-SVGF) that interpolates between the Wasserstein gradient flow and the SVGF. The advantage of the proposed flow is that one could obtain an implementable space-time discretization as long as the regularization parameter is bounded away from zero. The main intuition behind the proposed flow is to pick an appropriately small regularization parameter so that we could arbitrarily approximate the WGF (Theorems 1 and 3).
(2)
For the R-SVGF, we provide rates of convergence to the equilibrium density in two cases: (i) in the Fisher Information metric under no further assumptions on the target $\pi \in {\mathcal {P}}({\mathbb {R}}^d)$ (Theorem 2) and (ii) in the KL-divergence metric under an LSI assumption on the target (Theorem 4). We also establish similar results for the time-discretized R-SVGF (Theorems 5 and 6).
(3)
We characterize the existence and uniqueness (Theorem 7), and stability (Theorem 8) of solutions to the R-SVGF.
(4)
We provide preliminary numerical experiments demonstrating the advantage of the space-time discretization of the R-SVGF, which we call the Regularized stein variational gradient descent (R-SVGD) algorithm, over the standard SVGD algorithm.

1.2 Organization

The rest of the paper is organized as follows. In Sect. 1.3, we introduce the notations used in the rest of the paper. In Sect. 2, we provide the preliminaries on reproducing kernel Hilbert spaces required for our work. In Sect. 3, we introduce the R-SVGF, along with the notion of regularized Stein-Fisher information, required for our analysis. Due to the technical nature of the proofs, we postpone the results on existence and uniqueness of the R-SVGF, and related stability results respectively to Sects. 5 and 6. In Sect. 4, we provide convergence results on the R-SVGF and its time-discretized version. We conclude in Sect. 7 with a space-time discretization which provides a practically implementable algorithm, and provide preliminary empirical results.

1.3 Notations

We use the following notations throughout this work:

For a matrix, $\Vert \cdot \Vert _2$ denotes the matrix 2-norm (spectral norm) and $\Vert \cdot \Vert _{HS}$ denotes the Hilbert-Schmidt norm which is defined as $\Vert A \Vert _{HS}^2=\sum _{i,j=1}^d |a_{ij}|^2$ for any matrix $A=(a_{ij})_{i,j\in [d]}$.
The term id denotes the $d\times d$ identity matrix. $I_d$ corresponds to the identity operator in the RKHS. I corresponds to the identity operator in $L_2(\mu )$.
$\left( L_\infty (\mathbb {R}^d),\left\Vert \cdot \right\Vert _{L_\infty (\mathbb {R}^d)}\right) $ denotes the space of all essentially bounded measurable functions on $\mathbb {R}^d$ with $\left\Vert f \right\Vert _{L_\infty (\mathbb {R}^d)}:= \inf \{ C: |f(x)|\le C \ \text {for almost every }x\in \mathbb {R}^d \} $ for any $f\in L_\infty (\mathbb {R}^d)$.
$\mathcal {C}(\mathbb {R}^d;\mathbb {R}^d)$ is the space of all $\mathbb {R}^d$-valued continuous functions on $\mathbb {R}^d$.
$\mathcal {C}_0^\infty ([0,\infty )\times \mathbb {R}^d)$ is the space of all $\mathbb {R}$-valued measurable functions on $[0,\infty )\times \mathbb {R}^d$ that vanish at infinity, i.e., for any $f\in \mathcal {C}_0^\infty ([0,\infty )\times \mathbb {R}^d)$, $f(t,x)\rightarrow 0$ as $t\rightarrow \infty $ and $f(t,x)\rightarrow 0$ as $x\rightarrow \infty $.
For any function space $\mathcal {H}$ on $\mathbb {R}^d$, $\mathcal {C}([0,T];\mathcal {H})$ is the space of functions f such that for any fixed $t\in [0,T]$, $f(t,\cdot )\in \mathcal {H}$ and for any fixed $x\in \mathbb {R}^d$, $f(\cdot ,x)$ is a continuous function on [0, T]. $\mathcal {C}^1([0,T];\mathcal {H})$ is the space of functions f such that for any fixed $t\in [0,T]$, $f(t,\cdot )\in \mathcal {H}$ and for any fixed $x\in \mathbb {R}^d$, $f(\cdot ,x)$ is a continuous function with a continuous first order derivative on [0, T].
Let $\mathcal {P}(\mathbb {R}^d)$ denote the space of all probability densities (with respect to the Lebesgue measure) on $\mathbb {R}^d$ that are twice continuously differentiable with full supports on $\mathbb {R}^d$ and finite second moments.
For any $\mu _1,\mu _2\in \mathcal {P}(\mathbb {R}^d)$ and any $p\ge 1$, $\mathcal {W}_p(\mu _1,\mu _2)$ denotes the Wasserstein-p distance between $\mu _1$ and $\mu _2$ defined as $\mathcal {W}_p(\mu _1,\mu _2){:}{=}\inf _{X\sim \mu _1,Y\sim \mu _2} \mathbb {E}[|X-Y|^p]^{\frac{1}{p}}$.
For any $\mu \in \mathcal {P}(\mathbb {R}^d)$, $\left( L_2(\mu ), \left\Vert \cdot \right\Vert _{L_2(\mu )} \right) $ is the space of all $\mu $-square integrable measurable function on $\mathbb {R}^d$ with $\left\Vert f\right\Vert _{L_2(\mu )}^2:=\int _{\mathbb {R}^d} |f(x)|^2 \mu (x) dx$.
For any function space ${\mathcal {H}}$, we say $f\in \mathcal {H}^d$ if $f=[f_1,\cdots ,f_d]^\intercal $ such that $f_i\in \mathcal {H}$ for all $i\in [d]$. If the function space $(\mathcal {H},\left\Vert \cdot \right\Vert _{\mathcal {H}})$ is further a Hilbert space, $(\mathcal {H}^d, \left\Vert \cdot \right\Vert _{\mathcal {H}^d})$ denotes the Hilbert space of all vector valued functions whose entries are in $(\mathcal {H},\left\Vert \cdot \right\Vert _{\mathcal {H}})$ with $\left\Vert f \right\Vert _{\mathcal {H}^d}^2{:}{=}\sum _{i=1}^d \left\Vert f_i \right\Vert _{\mathcal {H}}^2$ for any $f=[f_1,\cdots ,f_d]^\intercal \in \mathcal {H}^d$.
Let $\left( {\mathcal {H}},\left\Vert \cdot \right\Vert _{\mathcal {H}}\right) $ and $\left( {\mathcal {G}},\left\Vert \cdot \right\Vert _{\mathcal {G}}\right) $ be two function spaces. For an operator $A:{\mathcal {H}} \rightarrow {\mathcal {G}} $, we denote the adjoint operator of A by $A^*$. We denote the operator norm by $\Vert A\Vert _{{\mathcal {H}} \rightarrow {\mathcal {G}}}$, which is defined as $\left\Vert A \right\Vert _{\mathcal {H}\rightarrow \mathcal {G}}:=\sup _{\left\Vert u \right\Vert _{\mathcal {H}}\le 1} \left\Vert A u \right\Vert _{\mathcal {G}}$. When we don’t emphasize the spaces, we denote the operator norm of A by $\left\Vert A \right\Vert _{\text {op}}$ for simplicity.
Let $\left( {\mathcal {H}},\Vert \cdot \Vert _{{\mathcal {H}}} \right) $ and $\left( {\mathcal {G}},\Vert \cdot \Vert _{{\mathcal {G}}} \right) $ be two Hilbert spaces. For an operator $A:{\mathcal {H}}\rightarrow {\mathcal {G}}$, we denote the Hilbert-Schmidt norm by $\Vert A \Vert _{HS}$ which is defined as $\Vert A \Vert _{HS}^2:=\sum _{i\in I} \Vert A e_i\Vert _{{{\mathcal {G}}}}^2 $ where $\{e_i\}_{i\in I}$ is an orthonormal basis of ${\mathcal {H}}$. We denote the nuclear norm by $\Vert A\Vert _{nuc}$ which is defined as $\left\Vert A \right\Vert _{nuc}:=\sum _{i\in I} \langle (A ^* A)^{\frac{1}{2}}e_i, e_i \rangle _{\mathcal {H}} $ where $\{e_i\}_{i\in I}$ is an orthonormal basis of ${\mathcal {H}}$.
For a smooth function $f:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}$, $\nabla _1 f $ denotes the gradient of f in the first variable and $\nabla _2 f $ denotes the gradient of f in the second variable.
For a map $\phi : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d$, $\phi _i$ denotes the i-th component of the function value and $\nabla \phi $ denotes the Jacobian, i.e., $(\nabla \phi )_{ij} = \partial _j\phi _i $ for all $i,j\in [d]$.
$T_{\#}\rho $ represents the push-forward of the density $\rho $ under a map T.
$\langle \cdot , \cdot \rangle _{\mathcal {H}}$ denotes inner-product in the Hilbert space $\mathcal {H}$. $\langle \cdot , \cdot \rangle $ denotes inner-product in the Euclidean space $\mathbb {R}^d$.

2 Preliminaries on Reproducing Kernel Hilbert Space

In this section, we introduce some properties of RKHS which would be used later in the formulation and analysis of R-SVGF. We refer the reader to [6, 33, 42] for the basics of RKHS. We let $\mathcal {H}_k$ be a separable RKHS over $\mathbb {R}^d$ with the reproducing kernel $k:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}_{>0}$ and with $\Vert \cdot \Vert _{\mathcal {H}_k}$ denoting the associated RKHS norm.

We make the following assumption on the kernel function k throughout the paper.

Assumption A1

The kernel function $k:\mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}$ is symmetric, strictly positive definite and continuous.

Additional assumptions on the kernel will be introduced when they are required. The following results are essentially based on [42, Lemma 4.23, and Theorems 4.26 and 4.27].

Proposition 1

[ [42]] The following holds.

(i)
The kernel function k is bounded if and only if every $f\in \mathcal {H}_k$ is bounded. Moreover, the inclusion $\iota _d:\mathcal {H}_k\rightarrow L_\infty (\mathbb {R}^d)$ is continuous and $\left\Vert \iota _d \right\Vert _{\mathcal {H}_k\rightarrow L_\infty (\mathbb {R}^d)}=\left\Vert k \right\Vert _\infty $, where $\left\Vert k \right\Vert _\infty :=\sup _{x\in \mathbb {R}^d} \sqrt{k(x,x)} $.
(ii)
Let $\mu $ be a $\sigma $-finite measure on $\mathbb {R}^d$ and k be a measurable kernel. Assume that
$$\begin{aligned} \left\Vert k \right\Vert _{L_2(\mu )}:=\left( \int _{\mathbb {R}^d} {k(x,x)} d\mu (x) \right) ^{\frac{1}{2}}<\infty . \end{aligned}$$
Then $\mathcal {H}_k$ consists of 2-integrable functions and the inclusion $\iota _{k,\mu }: \mathcal {H}_k\rightarrow L_2(\mu )$ is continuous with $\left\Vert \iota _{k,\mu }\right\Vert _{\mathcal {H}_k\rightarrow L_2(\mu ) }\le \left\Vert k \right\Vert _{L_2(\mu )} $. Moreover, the adjoint of this inclusion is the operator $\iota _{k,\mu }^*:L_2(\mu )\rightarrow \mathcal {H}_k $ defined by
$$\begin{aligned} \iota _{k,\mu }^* g(x):=\int _{\mathbb {R}^d} k(x,y) g(y) d\mu (y) ,\qquad g\in L_2(\mu ),\ x\in \mathbb {R}^d. \end{aligned}$$
(iii)
Under the assumptions in (ii), $\mathcal {H}_k$ is dense in $L_2(\mu )$ if and only if $\iota _{k,\mu }^*:L_2(\mu )\rightarrow \mathcal {H}_k$ is injective. Alternatively, $\iota _{k,\mu }^*:L_2(\mu )\rightarrow \mathcal {H}_k$ has a dense image if and only if $\iota _{k,\mu }:\mathcal {H}_k\rightarrow L_2(\mu )$ is injective.
(iv)
Under the assumptions in (ii), $\iota _{k,\mu }:\mathcal {H}_k\rightarrow L_2(\mu )$ is a Hilbert-Schmidt operator with $\left\Vert \iota _{k,\mu } \right\Vert _{HS}=\left\Vert k \right\Vert _{L_2(\mu )}$. Moreover, the integral operator $\mathcal {T}_{k,\mu }=\iota _{k,\mu }\iota _{k,\mu }^*:L_2(\mu )\rightarrow L_2(\mu )$ is compact, positive, self-adjoint, and nuclear with $\left\Vert \mathcal {T}_{k,\mu } \right\Vert _{nuc}=\left\Vert \iota _{k,\mu } \right\Vert _{HS}=\left\Vert k \right\Vert _{L_2(\mu )}$.

When $f\in \mathcal {H}_k^d$ with $f=[f_1,\cdots , f_d]^\intercal $ and $g\in \mathcal {H}_k$, we define $\langle f,g \rangle _{\mathcal {H}_k}$ as a vector in $\mathbb {R}^d$ and $\left( \langle f,g \rangle _{\mathcal {H}_k} \right) _i=\langle f_i,g \rangle _{\mathcal {H}_k}$ for all $i\in [d]$. When $f\in L_2^d(\mu )$ with $f=[f_1,\cdots , f_d]^\intercal $ and $g\in L_2(\mu )$, we define $\langle f,g \rangle _{L_2(\mu )}$ as a vector in $\mathbb {R}^d$ and $\left( \langle f,g \rangle _{L_2(\mu )} \right) _i=\langle f_i,g \rangle _{L_2(\mu )}$ for all $i\in [d]$. Note also that $\mathcal {T}_{k,\mu }=\iota _{k,\mu }\iota ^*_{k,\mu }$ and $\text {Ran}({\mathcal {T}_{k,\mu }^{1/2}})={\mathcal {H}}_k\subset L_2(\mu )$. We refer the interested reader to [15] for more details.

Finally, we remark that by letting $(\lambda _i,e_i)_{i=1}^\infty $ be the set of eigenvalues and eigenfunctions of the operator ${\mathcal {T}_{k,\mu }}$ where $\lambda _1\ge \lambda _2\ge \cdots >0$ and $(e_i)_{i=1}^\infty $ form an orthonormal system in $\text {Ran}(\mathcal {T}_{k,\mu })$, we have the following spectral representation that, for all $f\in L_2(\mu )$,

$$\begin{aligned} \begin{aligned} \mathcal {T}_{k,\mu }f=\sum _{i=1}^\infty \lambda _i \left\langle f, e_i \right\rangle _{L_2(\mu )} e_i. \end{aligned} \end{aligned}$$

(5)

Computing the spectral representation, in general for any given $\mu $ and kernel k is a non-trivial task. Results are only known on a case-by-case basis; see, for example, [2, 8, 31, 39]. However, we use the decomposition only in our analysis. For the purely practical algorithm that we describe eventually in Sect. 7, we do not need to know the decomposition explicitly.

Remark 1

Strictly speaking, the above notation implicitly assumes that the operator ${\mathcal {T}}_{k,\mu }$ has a trivial null space, in which case $\overline{\text {Ran}({\mathcal {T}}_{k,\mu })} \equiv L_2(\mu )$ and hence the eigenfunctions $(e_i)_{i=1}^\infty $ form an orthonormal basis to $L_2(\mu )$. However, our analysis does not require this condition on ${\mathcal {T}}_{k,\mu }$. In particular, if ${\mathcal {T}}_{k,\mu }$ has a non-trivial null-space, then $\overline{\text {Ran}({\mathcal {T}}_{k,\mu })} \subset L_2(\mu )$. In this case, our analysis still holds true. For example, with a slight abuse of notation, if we let $e_i$, for certain values of i, also denote the basis of the null-space of ${\mathcal {T}}_{k,\mu }$, conclusions similar to our results hold.

3 Regularized SVGF

We now introduce the formulation of the Regularized-SVGF and discuss its connection with the SVGF and the WGF. Recall that in the mean-field limit, the SVGF in (3) only provides a constant order approximation to the WGF in (2), due to the presence of the operator ${\mathcal {T}}_{k,\mu }$. As the operator ${\mathcal {T}}_{k,\mu }$ is not invertible, we seek to obtain a regularized inverse so that we end up with the following Regularized-SVGF, as in (4), for some regularization parameter $\nu \in (0,1]$. Note in particular that, as $\nu \rightarrow 0$, the Regularized-SVGF gets arbitrarily close to the WGF. Our goal in this section is to derive the above-mentioned R-SVGF from first principles.

The central operator required in our formulation is the following Stein operator, which is defined for all $p\in \mathcal {P}(\mathbb {R}^d)$, and for all smooth maps $\phi :\mathbb {R}^d\rightarrow \mathbb {R}^d$, as

$$\begin{aligned} \mathcal {A}_p\phi (x)= \phi (x) \otimes \nabla \log p(x)+\nabla \phi (x), \end{aligned}$$

where $\otimes $ denotes the outer-product. Generally speaking, the Stein operator can also be defined for some densities that are outside of $\mathcal {P}(\mathbb {R}^d)$. We restrict to $p\in \mathcal {P}(\mathbb {R}^d)$ in the paper for the purpose of our analysis. Now, the Wasserstein gradient flow in (2) could be thought of as follows. Consider moving a particle $x \sim \rho $ (for some $\rho \in {\mathcal {P}}({\mathbb {R}}^d)$) based on the mapping $x \mapsto T(x){:}{=}x+h\phi (x)$, where $h>0$ is a step-size parameter, and $\phi $ is a vector-field chosen so that the KL-divergence between the pushforward of $\rho $ according to T, denoted as $T_{\#}\rho $, and the target density $\pi $ in minimal. Liu and Wang [27, Theorem 3.1], showed that

$$\begin{aligned} \nabla _h \textsc {KL}(T_{\#}\rho |\pi )|_{h=0}=-\mathbb {E}_{x\sim \rho }[\text {trace}(\mathcal {A}_\pi \phi (x))]. \end{aligned}$$

We also refer to [23] for an earlier version of the same result. Based on this observation, if we try to find the vector-field $\phi $ in the unit-ball of $L_2^d(\rho )$ that maximizes the quantity $[\mathbb {E}_{x\sim \rho }[\text {trace}(\mathcal {A}_\pi \phi (x) )]]^2$, a straight-forward calculation based on integration-by-parts, results in the optimal $\phi $ being the Wasserstein gradient $\nabla \log \frac{\rho }{\pi }$. To have a practical implementation, [27] considered maximizing $[\mathbb {E}_{x\sim \rho }[\text {trace}(\mathcal {A}_\pi \phi (x) )]]^2$ over the unit-ball in the RKHS $\mathcal {H}_k^d$, which results in the optimal vector-field being equal to ${\mathcal {T}}_{k,\rho }\nabla \log \frac{\rho }{\pi }$, and correspondingly results in the SVGF in (3).

In this work, we propose to find the vector field $\phi $ that maximizes $\left[ \mathbb {E}_{x\sim \rho } \left[ \text {trace}(\mathcal {A}_\pi \phi (x))\right] \right] ^2$ over the unit-ball with respect to an interpolated norm between $L_2^d(\rho )$ and $\mathcal {H}_k^d$. Specifically, the interpolation norm that we consider is of the form $\nu \left\Vert \cdot \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \cdot \right\Vert _{L_2^d(\rho )}^2$, for some regularization parameter $\nu \in (0,1]$, which trades-off between $\Vert \cdot \Vert ^2_{{\mathcal {H}}^d_k}$ and $\Vert \cdot \Vert ^2_{L^d_2(\rho )}$. We also remark here that a similar idea has been leveraged in the context of RKHS-based statistical hypothesis testing [5]. Formally, for $\rho , \pi \in \mathcal {P}(\mathbb {R}^d)$, we consider the following optimization problem.

$$\begin{aligned} \begin{aligned} S(\rho , \pi ):=\max _{\phi \in \mathcal {H}_k^d} \left\{ \left[ \mathbb {E}_{x\sim {\rho }} \left[ \text {trace}(\mathcal {A}_\pi \phi (x))\right] \right] ^2 \qquad \text {such that} \quad \nu \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \phi \right\Vert _{L_2^d(\rho )}^2\le 1 \right\} . \end{aligned} \end{aligned}$$

For any $\rho \in \mathcal {P}(\mathbb {R}^d)$, the optimal vector field, $\phi $ that minimizes $\textsc {KL}(T_{\#}\rho |\pi )$ can be described via the following result. To make the statements more consistent with the analysis (and proofs), we will denote $\mathcal {T}_{k,\cdot }$ explicitly as $\iota _{k,\cdot }\iota ^*_{k,\cdot }$ in the rest of the paper. Before we proceed, we also recall that the Fisher information between two densities $\rho , \pi \in {\mathcal {P}}({\mathbb {R}}^d)$, is defined as

$$\begin{aligned} I(\rho |\pi ){:}{=}\left\Vert \nabla \log \frac{\rho }{\pi } \right\Vert _{L_2(\rho )}^2 =\sum _{i=1}^\infty \left| \left\langle \nabla \log \frac{\rho }{\pi }, e_i \right\rangle _{L_2(\rho )}\right| ^2, \end{aligned}$$

(6)

with $(e_i)_{i=1}^\infty $ being an orthonormal basis to $L_2(\rho )$.

Proposition 2

Let $T(x)=x+h\phi (x)$ and $T_{\#}\rho (z)$ be the density of $z=T(x)$ for any $\rho \in {\mathcal {P}}({\mathbb {R}}^d)$. For $\nu \in (0,1]$, define

$$\begin{aligned} \mathcal {B}:=\{ \phi \in \mathcal {H}_k^d: \nu \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \phi \right\Vert _{L_2^d(\rho )}^2 \le 1 \}. \end{aligned}$$

Let $\pi \propto e^{-V}$ and $\pi \in \mathcal {P}(\mathbb {R}^d)$. If $I(\rho |\pi )<\infty $ and k satisfies Assumption A1 with $\int k(x,x)\rho (x)dx<\infty $, then the direction of steepest descent in $\mathcal {B}$ that maximizes

$$\begin{aligned} -\nabla _h KL (T_{\#}\rho |\pi )|_{h=0} \end{aligned}$$

is given by

$$\begin{aligned} \begin{aligned} \phi _{\rho ,\pi }^*(\cdot )\propto \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-1}\mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )], \end{aligned} \end{aligned}$$

where $\iota _{k,\rho }:\mathcal {H}_k^d\rightarrow L_2^d(\rho )$ is the inclusion operator and $\iota _{k,\rho }^*$ is its adjoint, as in Proposition 1. Furthermore, under the optimal vector field $\phi _{\rho ,\pi }^*$, we have $-\nabla _h KL (T_{\#}\rho |\pi )|_{h=0}=S(\rho ,\pi )$.

Proof

First note that according to [27, Theorem 3.1], we have

$$\begin{aligned} \nabla _h \textsc {KL}(T_{\#}\rho |\pi )|_{h=0}=-\mathbb {E}_{x\sim {\rho }}[\text {trace}(\mathcal {A}_\pi \phi (x) )]. \end{aligned}$$

Therefore, we have

$$\begin{aligned} \phi _{\rho ,\pi }^*=\underset{\phi \in \mathcal {H}_k^d}{\arg \max } \left\{ [\mathbb {E}_{x\sim \rho } [\text {trace}(\mathcal {A}_\pi \phi (x))] ]^2 \qquad \text {such that} \quad \nu \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \phi \right\Vert _{L_2^d(\rho )}^2\le 1 \right\} . \end{aligned}$$

Next, observe that we have

$$\begin{aligned} \mathbb {E}_{x\sim \rho } [\text {trace}(\mathcal {A}_{\pi }\phi (x))]&=\sum _{i=1}^d \mathbb {E}_{x\sim \rho } [-\partial _i V(x)\phi _i(x)+\partial _i \phi _i(x)]\\&=\sum _{i=1}^d \mathbb {E}_{x\sim \rho } [-\partial _i V(x)\langle \phi _i, k(x,\cdot )\rangle _{\mathcal {H}_k} + \langle \phi _i, \partial _i k(x,\cdot )\rangle _{\mathcal {H}_k} ] \\&=\langle \phi , \mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )] \rangle _{\mathcal {H}_k^d}. \end{aligned}$$

The term $\mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )]$ is finite for all $x\in \mathbb {R}^d$ due to the following inequality:

$$\begin{aligned} |\mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )]|&=|\iota _{k,\rho }^*\nabla \log \frac{\rho }{\pi } (x)|\\&= \Big | \int k(x,y) \nabla \log \frac{\rho (y)}{\pi (y)} \rho (y)dx \Big |\\&\le I(\rho |\pi )^{\frac{1}{2}} \sqrt{k(x,x)}\Big (\int k(y,y)\rho (y)dy\Big )^{\frac{1}{2}}<\infty , \end{aligned}$$

where the first inequality follows from the Cauchy-Schwartz inequality and the reproducing property of the RKHS. The second inequality follows from conditions on k and $\rho $. Meanwhile, the constraint can be written as

$$\begin{aligned} \nu \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \phi \right\Vert _{L_2^d(\rho )}^2&=\nu \langle \phi ,\phi \rangle _{\mathcal {H}_k^d}+(1-\nu )\langle \iota _{k,\rho }\phi , \iota _{k,\rho }\phi \rangle _{L_2^d(\rho )} \\&=\langle \left( \nu I_d+(1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }\right) \phi ,\phi \rangle _{\mathcal {H}_k^d}\\&=\left\Vert \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{\frac{1}{2}} \phi \right\Vert _{\mathcal {H}_k^d}^2, \end{aligned}$$

where $I_d:\mathcal {H}_k\rightarrow \mathcal {H}_k$ is the identity operator. Now, note that $\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{\frac{1}{2}}$ is well-defined since $\iota _{k,\rho }^*\iota _{k,\rho }:\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d$ is positive, compact and self-adjoint. Therefore, based on the above display, the constraint $\{ \phi \in \mathcal {H}_k^d: \nu \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \phi \right\Vert _{L_2^d(\rho )}^2\le 1 \}$ is equivalent to

$$\begin{aligned} \{ \phi \in \mathcal {H}_{k}^d: \psi =\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d\right) ^{\frac{1}{2}}\phi \quad and\quad \left\Vert \psi \right\Vert _{\mathcal {H}_k^d}\le 1 \}. \end{aligned}$$

Since the spectrum of $\iota _{k,\rho }^*\iota _{k,\rho }$ is positive and $\nu \in (0,1]$, $(1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d$ is invertible. For all $\phi \in \mathcal {H}_k^d$, there exists a unique $\psi \in \mathcal {H}_k^d$ such that $\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-\frac{1}{2}}\psi =\phi $. Applying this fact along with the equivalent form of the constraint, we have

$$\begin{aligned} \mathbb {E}_{x\sim \rho } (\text {trace}(\mathcal {A}_{\pi }\phi (x)))&=\left\langle \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d\right) ^{-\frac{1}{2}}\psi , \mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )]\right\rangle _{\mathcal {H}_k^d} \\&=\left\langle \psi , \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-\frac{1}{2}} \mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )] \right\rangle _{\mathcal {H}_k^d}\\&\le \left\Vert \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-\frac{1}{2}} \mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )] \right\Vert _{\mathcal {H}_k^d} \end{aligned}$$

where the second identity follows from the fact that $\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-\frac{1}{2}}$ is self-adjoint and the upper bound in the last inequality is achieved when

$$\begin{aligned} \psi ^*\propto \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-\frac{1}{2}} \mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )], \end{aligned}$$

and the result hence follows. $\square $

With the optimal-vector field as derived above, we consider the following mean-field partial differential equation (PDE) as the R-SVGF:

$$\begin{aligned} \begin{aligned}&\partial _t \rho _t=\nabla \cdot \left( \rho _t\ \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d \right) ^{-1}\iota _{k,\rho _t}^*\left( \nabla \log \frac{\rho _t}{\pi }\right) \right) . \end{aligned} \end{aligned}$$

(7)

It is important to notice that the R-SVGF interpolates between the SVGF and the WGF. However, the regime of interest for us is when $\nu \rightarrow 0$, as we get arbitrarily close to the WGF. We quantify this statement precisely in the later sections. On the other hand, when $\nu \rightarrow 1$, the R-SVGF becomes the SVGF.

Remark 2

We now make the following remarks about the above result.

(i)
We can alternatively write $\phi _{\rho ,\pi }^*$ from Proposition 2 as
$$\begin{aligned} \phi _{\rho ,\pi }^* \propto -\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-1} \iota _{k,\rho }^* \left( \nabla \log \frac{\rho }{\pi } \right) , \end{aligned}$$
since we have
$$\begin{aligned} \mathbb {E}_{x\sim \rho } [-\nabla V(x)k(x,\cdot )+\nabla k(x,\cdot )] =&\int _{\mathbb {R}^d} k(\cdot , x)\left( -\nabla V(x)-\frac{\nabla \rho (x)}{\rho (x)}\right) \rho (x)dx\\ =&-\iota _{k,\rho }^*\left( \nabla \log \frac{\rho }{\pi }\right) . \end{aligned}$$
(ii)
The operator in (7) has an equivalent expression, as we discuss below. First, we claim that
$$\begin{aligned} \iota _{k,\rho }\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d\right) ^{-1}\iota _{k,\rho }^*=\left( (1-\nu )\iota _{k,\rho }\iota _{k,\rho }^*+\nu I \right) ^{-1}\iota _{k,\rho }\iota _{k,\rho }^*. \end{aligned}$$
To see that, we start with the trivial identity in the first line below and proceed as
$$\begin{aligned}&\left( (1-\nu )\iota _{k,\rho }\iota _{k,\rho }^*+\nu I \right) \iota _{k,\rho }=\iota _{k,\rho }\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ,\\ \implies&\iota _{k,\rho }=\left( (1-\nu )\iota _{k,\rho }\iota _{k,\rho }^*+\nu I \right) ^{-1}\iota _{k,\rho }\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) \\ \implies&\iota _{k,\rho }\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-1}=\left( (1-\nu )\iota _{k,\rho }\iota _{k,\rho }^*+\nu I \right) ^{-1}\iota _{k,\rho } \\ \implies&\iota _{k,\rho }\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-1}\iota _{k,\rho }^*=\left( (1-\nu )\iota _{k,\rho }\iota _{k,\rho }^*+\nu I \right) ^{-1}\iota _{k,\rho }\iota _{k,\rho }^*. \end{aligned}$$
According to this observation, (7) can also be written in the following form
$$\begin{aligned} \begin{aligned} \partial _t \rho _t&=\nabla \cdot \left( \rho _t\ \left( (1-\nu )\iota _{k,\rho _t}\iota _{k,\rho _t}^*+\nu I \right) ^{-1}\iota _{k,\rho _t}\iota _{k,\rho _t}^*\left( \nabla \log \frac{\rho _t}{\pi }\right) \right) \\&= \nabla \cdot \left( \rho _t ~ \left( (1-\nu ){\mathcal {T}}_{k,\rho _t}+\nu I \right) ^{-1} {\mathcal {T}}_{k,\rho _t}\left( \nabla \log \frac{\rho _t}{\pi } \right) \right) ,\\ \end{aligned} \end{aligned}$$
thereby providing the R-SVGF introduced in (4) in Sect. 1.
(iii)
Particle-based spatial discretization. We now describe the spatial discretization of the R-SVGF. Based on the results in Proposition 2 and (ii) in Remark 2, we obtain the following ODE system:
$$\begin{aligned} \left\{ \begin{aligned} \frac{d x_i(t)}{dt}&=-\left( (1-\nu )\iota _{k,\rho _t^N}^*\iota _{k,\rho _t^N}+\nu I_d \right) ^{-1}\\ {}&\left( \frac{1}{N}\sum _{j=1}^N -\nabla _2 k\left( x_i(t),x_j(t)\right) +k\left( x_i(t),x_j(t)\right) \nabla V(x_j(t)) \right) \\ x_i(0)&=x_i^0\in \mathbb {R}^d,\quad i=1,2,\ldots ,N \end{aligned} \right. , \end{aligned}$$
where $\{x_i(t)\}_{i=1}^N$ is the set of N particles. $\rho _t^N=\frac{1}{N}\sum _{j=1}^N \delta _{x_j(t)}$ is the empirical distribution at time t, provides a N-particle spatial discretization of the R-SVGF.
(iv)
Time discretization. We also have the following time-discretization of the R-SVGF. Let $\{h_{n}\}_{n=1}^\infty $ be the sequence of time step-size. We denote the density at the n-th iterate by $\rho ^n$ for all integers $n \ge 1$. Then the time discretization of the R-SVGF can be written as
$$\begin{aligned} \begin{aligned} \rho ^{n+1}=\left( id-h_{n+1} \mathcal {D}_{\nu _{n+1},\rho ^n} \nabla \log \frac{\rho ^n}{\pi }\right) _{\#\rho ^n}, \end{aligned} \end{aligned}$$
(8)
where $\mathcal {D}_{\nu _n,\rho ^n}=\left( (1-\nu _n)\iota _{k,\rho ^n}\iota ^*_{k,\rho ^n}+\nu _n I_d \right) ^{-1}\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*$.
(v)
The parameter $\nu $ can also be made to be dependent on t or n; in fact, in our analysis, we pick a time-varying regularization parameter.

4 Convergence Results in Continuous and Discrete Time

Our goal in this section is to derive convergence guarantees for the R-SVGF. Before we proceed, we introduce the notion of Regularized Stein-Fisher information (or Regularized Kernel Stein Discrepancy).

4.1 Regularized Stein-Fisher Information and its Properties

Related to the Fisher information defined in (6), several works, for example [19, 24, 37], used the notion of Stein-Fisher Information to understand the convergence properties of the SVGD algorithm. The Stein-Fisher information was introduced in [13, 22, 26] under the name Kernel Stein Discrepancy. It is defined as

$$\begin{aligned} I_{Stein}(\rho |\pi ):&=\left\Vert \iota ^*_{k,\rho }\nabla \log \frac{\rho }{\pi } \right\Vert _{{\mathcal {H}}^d_k}^2=\left\langle \nabla \log \frac{\rho }{\pi }, \iota _{k,\rho }\iota _{k,\rho }^* \nabla \log \frac{\rho }{\pi } \right\rangle _{L^d_2(\rho )}\\&=\sum _{i=1}^\infty \lambda _i \left| \left\langle \nabla \log \frac{\rho }{\pi }, e_i \right\rangle _{L_2(\rho )}\right| ^2, \end{aligned}$$

where $(\lambda _i,e_i)_{i=1}^\infty $ are the set of eigenvalues and eigenvectors of the operator $\iota _{k,\rho }\iota _{k,\rho }^*$, with $\lambda _1\ge \lambda _2\ge \cdots >0$. A drawback of the Stein-Fisher information is that it is a weaker metric, for example in comparison to the Fisher information metric in (6); see [21, 22, 40]. Indeed, comparing (6) and the above definition for the Stein-Fisher information, it is immediately clear that the Stein-Fisher information is severely restrictive, in particular when the eigenvalues of the chosen RKHS decay fast. To counter this effect, we introduce the following regularized Stein-Fisher information and show that when the regularization parameter is chosen appropriately, the regularized Stein-Fisher information upper and lower bounds the Fisher information.

Definition 1

[Regularized Stein-Fisher Information] For any ${\rho \in \mathcal {P}(\mathbb {R}^d)}$, the regularized Stein Fisher information from $\rho $ to $\pi $, denoted as $I_{\nu ,Stein}(\rho |\pi )$, is defined as

$$\begin{aligned} I_{\nu ,Stein}(\rho |\pi ):=\left\langle \iota _{k,\rho }^* \nabla \log \frac{\rho }{\pi }, \left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) ^{-1} \iota _{k,\rho }^* \nabla \log \frac{\rho }{\pi }\right\rangle _{\mathcal {H}_k^d}. \end{aligned}$$

(9)

The regularized Stein Fisher information in (9) is well-defined because the operator

$$\begin{aligned} (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d: \mathcal {H}_k^d\rightarrow \mathcal {H}_k^d \end{aligned}$$

is positive and for any $f\in \mathcal {H}_k^d$, $\left( (1-\nu )\iota _{k,\rho }^*\iota _{k,\rho }+\nu I_d \right) f =0$ if and only if $f=0$.

Remark 3

The regularized Stein Fisher information has the following alternative representation:

$$\begin{aligned} I_{\nu ,Stein}(\rho |\pi )=\sum _{i=1}^\infty \frac{\lambda _i}{(1-\nu )\lambda _i+\nu } \left| \left\langle \nabla \log \frac{\rho }{\pi }, e_i \right\rangle _{L_2(\rho )}\right| ^2. \end{aligned}$$

(10)

For $\nu >0$, with the fact that $\lambda _i$ decreases to zero as $i\rightarrow \infty $, the regularized Stein Fisher information and the Stein Fisher information both encode the spectral decay information of $\iota _{k,\rho }\iota _{k,\rho }^*$. However, note that the regularized Stein Fisher information tends to the Fisher information as $\nu \rightarrow 0$. Hypothetically speaking, if $\nu $ is set to zero, then the regularized Stein Fisher information actually becomes the Fisher information. In our analysis, we will take advantage of the relation between the regularized Stein Fisher information and the Fisher information, while studying the convergence properties of R-SVGF under Log-Sobolev inequality assumptions on the target $\pi $. A precise relation between the regularized Stein Fisher information and the Fisher information is stated in the following result. Before stating the result, we introduce the following notation for convenience. For $\gamma \in (0,\frac{1}{2}]$, we denote the pre-image of $\nabla \log \frac{\rho }{\pi }\in L^d_2(\rho )$ under $(\iota _{k,\rho }\iota _{k,\rho }^*)^{\gamma }$ as

$$\begin{aligned} {\mathfrak {I}}(\rho , \gamma ):=(\iota _{k,\rho }\iota _{k,\rho }^*)^{-\gamma }\nabla \log \frac{\rho }{\pi }. \end{aligned}$$

Note that $\Vert {\mathfrak {I}}(\rho , \gamma )\Vert _{L^d_2(\rho )}$ is finite if and only if $\nabla \log \frac{\rho }{\pi }\in \text {Ran}((\iota _{k,\rho }\iota _{k,\rho }^*)^{\gamma })$.

Proposition 3

[Equivalence relation between $I(\rho |\pi )$ and $I_{\nu ,Stein}(\rho |\pi )$] Let $\rho ,\pi \in \mathcal {P}(\mathbb {R}^d)$ and suppose there exists $\gamma \in (0,\frac{1}{2}]$ such that $\left\Vert {\mathfrak {I}}(\rho , \gamma ) \right\Vert _{L_2^d(\rho )}<\infty .$ If the regularization parameter is chosen to satisfy the following condition,

$$\begin{aligned} \frac{\nu }{1-\nu }\le \left( \frac{I(\rho |\pi )}{2\left\Vert {\mathfrak {I}}(\rho , \gamma ) \right\Vert _{L_2^d(\rho )}^2} \right) ^{\frac{1}{2\gamma }}, \end{aligned}$$

(11)

then we have that

$$\begin{aligned} \frac{1}{2}(1-\nu )^{-1} I(\rho |\pi )\le I_{\nu ,Stein}(\rho |\pi )\le (1-\nu )^{-1}I(\rho |\pi ). \end{aligned}$$

Proof of Proposition 3

According to (10), we have

$$\begin{aligned} I_{\nu ,Stein}(\rho |\pi )&=\sum _{i=1}^\infty \frac{\lambda _i}{(1-\nu )\lambda _i+\nu } \left| \left\langle \nabla \log \frac{\rho }{\pi }, e_i \right\rangle _{L_2(\rho )}\right| ^2 \\&\le (1-\nu )^{-1}\sum _{i=1}^\infty \left| \left\langle \nabla \log \frac{\rho }{\pi }, e_i \right\rangle _{L_2(\rho )}\right| ^2\le (1-\nu )^{-1}I(\mu |\pi ). \end{aligned}$$

On the other hand, since $\left\Vert {\mathfrak {I}}(\rho , \gamma ) \right\Vert _{L_2^d(\rho )}<\infty $ for some $\gamma \in (0,\frac{1}{2}]$, there exists $h={\mathfrak {I}}(\rho , \gamma )\in L_2^d(\rho )$ such that

$$\begin{aligned} \nabla \log \frac{\rho }{\pi }=(\iota _{k,\rho }\iota _{k,\rho }^*)^{\gamma }h. \end{aligned}$$

Therefore

$$\begin{aligned} (1-\nu )^{-1}I(\rho |\pi )-I_{\nu ,Stein}(\rho |\pi )&=\sum _{i=1}^\infty \frac{(1-\nu )^{-1}\nu }{(1-\nu )\lambda _i+\nu } \left| \langle (\iota _{k,\rho }\iota _{k,\rho }^*)^{\gamma }h,e_i \rangle _{L_2(\rho )}\right| ^2\nonumber \\&=\sum _{i=1}^\infty \frac{(1-\nu )^{-1}\nu \lambda _i^{2\gamma }}{(1-\nu )\lambda _i+\nu } |\langle h,e_i \rangle _{L_2(\rho )}|^2\nonumber \\&\le (1-\nu )^{-1-2\gamma }\nu ^{2\gamma } \left\Vert {\mathfrak {I}}(\rho , \gamma ) \right\Vert _{L_2^d(\rho )}^2\nonumber \\&\le \frac{1}{2}(1-\nu )^{-1}I(\rho |\pi ), \end{aligned}$$

(12)

where the second to last inequality follows from the fact that

$$\begin{aligned} \sup _i \left( \frac{(1-\nu )^{-1}\nu \lambda _i^{2\gamma }}{(1-\nu )\lambda _i+\nu }\right)&=(1-\nu )^{-1-2\gamma }\nu ^{2\gamma }\sup _i \left( \frac{(1-\nu )\lambda _i}{(1-\nu )\lambda _i+\nu }\right) ^{2\gamma } \left( \frac{\nu }{(1-\nu )\lambda _i+\nu }\right) ^{1-2\gamma }\\&\le (1-\nu )^{-1{-2\gamma }}\nu ^{2\gamma }, \end{aligned}$$

and the last inequality follows from the condition in (11).

4.2 Convergence Results for R-SVGF

4.2.1 Relationship Between R-SVGF and WGF

Assuming the existence and uniqueness of the R-SVGF (see Sect. 5) and WGF (see [23] for sufficient conditions) we now provide the relationship between the R-SVGF and the WGF in various metrics. We first start with the relationship in the Fisher information metric, without any stringent assumptions on the target density (thereby allowing for multi-modal and complex densities that arise in practice). Note that the Fisher information metric corresponds to the first-order stationarity metric for the WGF, obtained by minimizing the KL divergence. This metric has been recently proposed as a meaningful metric to consider in the case of sampling from general non-log-concave densities in [4]. Note in particular under mild conditions on q (e.g., connected support) that having the Fisher information $I(p|q) = 0$ implies $p\equiv q$. However, even when $I(p|q) \le \epsilon $, for some $\epsilon >0$, we have that the modes of the two densities are well-aligned, as argued in [4]. In order to state our next result, we denote by $(\lambda _{i,t},e_{i,t})_{i=1}^\infty $, the set of eigenvalues and eigenvectors of the operator $\iota _{k,\rho _t}\iota _{k,\rho _t}^*$ for any $t\ge 0$, with $\lambda _{1,t}\ge \lambda _{2,t}\ge \cdots >0$.

Theorem 1

[Relation to the WGF in Relative Fisher Information] Let $(\rho _t)$ be the solution to (7) and $(\mu _t)$ be the solution to the WGF, i.e.,

$$\begin{aligned} \left\{ \begin{aligned}&\partial _t \mu = \nabla \cdot \left( \mu \nabla \log \frac{\mu }{\pi } \right) ,\\&\mu (0,\cdot )=\mu _0(\cdot ). \end{aligned} \right. \end{aligned}$$

(13)

For any $t>0$, suppose there exists $\gamma _t\in (0,\frac{1}{2}]$ such that $\left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}<\infty .$ Then, for any initial density $\mu _0 \in {\mathcal {P}}({\mathbb {R}}^d)$, and for any $T\in (0,\infty )$, we have

$$\begin{aligned} \int _0^T I(\rho _t|\mu _t) dt \le \frac{4}{3} \textsc {KL}(\rho _0|\mu _0) +\frac{4}{3} \int _0^T {(\lambda _{1,t}\vee 1)^2}\nu ^{2\gamma _t}(1-\nu )^{-2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 dt, \end{aligned}$$

(14)

where $\lambda _{1,t}$ is the largest eigenvalue of $\iota _{k,\rho _t}\iota ^*_{k,\rho _t}$ for all $t\ge 0$.

Proof of Theorem 1

First note that we have the following upper bound on $ \frac{d}{dt}\textsc {KL}(\rho _t|\mu _t)$:

$$\begin{aligned}&\frac{d}{dt}\textsc {KL}(\rho _t|\mu _t)\\&\quad =\frac{d}{dt} \int _{\mathbb {R}^d} \log \frac{\rho _t(x)}{\mu _t(x)} \rho _t(x)dx \\&\quad =\int _{\mathbb {R}^d} \partial _t \rho _t(x) \log \frac{\rho _t(x)}{\mu _t(x)} dx+\int _{\mathbb {R}^d} \left( \frac{\partial _t \rho _t(x)}{\mu _t(x)}+\rho _t(x)\partial _t\left( \frac{1}{\mu _t(x)}\right) \right) \frac{\mu _t(x)}{\rho _t(x)}\rho _t(x)dx \\&\quad =\int _{\mathbb {R}^d} \partial _t \rho _t(x) \log \frac{\rho _t(x)}{\mu _t(x)} dx+\int _{\mathbb {R}^d} \partial _t \rho _t(x) dx -\int _{\mathbb {R}^d} \partial _t \mu _t(x) \frac{\rho _t(x)}{\mu _t(x)} dx \\&\quad =-\int _{\mathbb {R}^d} \left\langle \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^* \nabla \log \frac{\rho _t(x)}{\pi (x)},\nabla \log \frac{\rho _t(x)}{\mu _t(x)} \right\rangle \rho _t(x)dx \\&\qquad + 0+\int _{\mathbb {R}^d} \left\langle \mu _t(x)\nabla \log \frac{\mu _t(x)}{\pi (x)}, \nabla \left( \frac{\rho _t(x)}{\mu _t(x)} \right) \right\rangle dx \\&\quad = -\int _{\mathbb {R}^d} \left\langle \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^* \nabla \log \frac{\rho _t(x)}{\pi (x)},\nabla \log \frac{\rho _t(x)}{\mu _t(x)} \right\rangle \rho _t(x)dx \\&\qquad + \int _{\mathbb {R}^d} \left\langle \nabla \log \frac{\rho _t(x)}{\mu _t(x)},\nabla \log \frac{\mu _t(x)}{\pi (x)} \right\rangle \rho _t(x)dx \\&\quad =-\int _{\mathbb {R}^d} \left\langle \nabla \log \frac{\rho _t(x)}{\mu _t(x)}, \nabla \log \frac{\rho _t(x)}{\pi (x)}-\nabla \log \frac{\mu _t(x)}{\pi (x)} \right\rangle \rho _t(x)dx\\&\qquad - \int _{\mathbb {R}^d} \left\langle \left( \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^*-I\right) \nabla \log \frac{\rho _t(x)}{\pi (x)},\nabla \log \frac{\rho _t(x)}{\mu _t(x)} \right\rangle \rho _t(x)dx \\&\quad \le -\int _{\mathbb {R}^d} \left| \nabla \log \frac{\rho _t(x)}{\mu _t(x)} \right| ^2 \rho _t(x)dx\\&\qquad +\frac{1}{4}\int _{\mathbb {R}^d} \left| \nabla \log \frac{\rho _t(x)}{\mu _t(x)} \right| ^2 \rho _t(x)dx\\&\qquad +\left\Vert \left( \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^*-I\right) \nabla \log \frac{\rho _t}{\pi } \right\Vert _{L_2^d(\rho _t)}^2 \\&\quad =-\frac{3}{4} I(\rho _t|\mu _t)+\sum _{i=1}^\infty \frac{(1-\lambda _{i,t})^2\nu ^2}{\left( (1-\nu )\lambda _{i,t}+\nu \right) ^2} \left| \left\langle \nabla \log \frac{\rho _t}{\pi },e_{i,t} \right\rangle _{L_2(\rho _t)}\right| ^2. \end{aligned}$$

In the above calculation, the fourth equality follows by integration-by-parts. The inequality follows by Young’s inequality for the inner product (i.e., $\langle p,q \rangle \le \frac{1}{2}c|p|^2+\frac{1}{2c}|q|^2$ for any $p,q\in \mathbb {R}^d$) and the last equality follows from the proof of Proposition 3. Since $\nabla \log \frac{\rho _t}{\pi }=(\iota _{k,\rho _t}\iota _{k,\rho _t}^*)^{\gamma _t} h_t$ for some $\gamma _t\in (0,1/2]$ with $h_t:={\mathfrak {I}}(\rho _t, \gamma _t)\in L_2^d(\rho _t)$, we obtain

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\mu _t)&\le -\frac{3}{4}I(\rho _t|\mu _t)+ \left( \max _{i} (1-\lambda _{i,t})^2 \right) \left( \max _i \frac{\lambda _{i,t} ^{\gamma _t} \nu }{{(1-\nu )}\lambda _{i,t}+\nu }\right) ^2 \left\Vert h_t \right\Vert _{L_2^d(\rho _t)}^2 \\&\le -\frac{3}{4}I(\rho _t|\mu _t)+ {(\lambda _{1,t}\vee 1)^2} \nu ^{2\gamma _t}{(1-\nu )^{-2\gamma _t}} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 , \end{aligned}$$

where the last inequality follows from the facts that

$$\begin{aligned} \max _i (1-\lambda _{i,t})^2\le (\max _i \lambda _{i,t}\vee 1)^2\quad \text {and}\quad \frac{\lambda _{i,t}^{\gamma _t}\nu }{(1-\nu )\lambda _{i,t}+\nu }\le \nu ^{\gamma _t}(1-\nu )^{-\gamma _t}. \end{aligned}$$

Integrating from $t=0$ to $t=T$, we get

$$\begin{aligned}&\textsc {KL}(\rho _T|\mu _T)-\textsc {KL}(\rho _0|\mu _0)\le -\frac{3}{4} \int _0^T I(\rho _t|\mu _t) dt \\&+ \int _0^T {(\lambda _{1,t}\vee 1)^2}\nu ^{2\gamma _t}{(1-\nu )^{-2\gamma _t}} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 dt. \end{aligned}$$

Since KL-divergence is non-negative, (14) is proved.

Remark 4

The sequence $\{\lambda _{1,t}\}_{t\ge 0}$ in Theorem 1 is the largest eigenvalue of $\iota _{k,\rho _t}\iota _{k,\rho _t}^*$ for all $t\ge 0$. Note that it depends on both the kernel k and the solution $(\rho _t)$ to (7). If the kernel function k is assumed to be uniformly bounded, according to Proposition 1, $\lambda _{1,t}^2$ is uniformly upper bounded by $\sup _{x\in \mathbb {R}^d} k(x,x) $ for all $t\ge 0$. For specific choices of kernel function, initial and target distributions, it is required to track $\lambda _{1,t}$ for all $t\ge 0$. We will illustrate this further with an example in Remark 10.

Remark 5

The above result shows that as long as $\rho _0 = \mu _0$, i.e., both the WGF and the R-SVGF are initialized with the same density, and $\nu $ is chosen such that

$$\begin{aligned} T^{-1}\int _0^T {(\lambda _{1,t}\vee 1)^2}\nu ^{2\gamma _t}(1-\nu )^{-2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 dt \rightarrow 0, \end{aligned}$$

the averaged Fisher information along the path tends to zero. This shows the benefit of regularizing the SVGF – it enables one to closely approximate the WGF with appropriate choice of the regularization parameters.

4.2.2 Convergence to Equilibrium Along the Fisher Information

We now provide results on the convergence to equilibrium along the Fisher information for the R-SVGF. We re-emphasize here that our result provided below holds as long as the target $\pi \in \mathcal {P}(\mathbb {R}^d)$, without additional structural assumptions (via, say, functional inequalities).

Theorem 2

[Convergence of Fisher information] Let $(\rho _t)$ be the solution to (7). For any $t>0$, suppose there exists $\gamma _t\in (0,\frac{1}{2}]$ such that $\left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}<\infty .$ Then

$$\begin{aligned} \int _0^\infty I(\rho _t|\pi ) dt\le (1-\nu )\textsc {KL}(\rho _0|\pi )+\int _0^\infty \nu ^{2\gamma _t}(1-\nu )^{-2\gamma _t} \left\Vert {\mathfrak {I}}({\rho }_t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 dt, \end{aligned}$$

Furthermore, if $\int _0^\infty \nu ^{2\gamma _t}(1-\nu )^{-2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 dt<\infty $, then we get $I({\bar{\rho }_t}|\pi )\rightarrow 0$ as $t\rightarrow \infty $, where $\bar{\rho }_t{:}{=}\frac{1}{t}\int _0^t \rho _s ds$ is the averaged density of $(\rho _s)_{0\le s\le t}$.

Before proving the above theorem, we introduce a few intermediate results.

Proposition 4

[Decay of the KL-divergence] For the solution $(\rho _t)_{t\ge 0}$ to the PDE (7), it holds that

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\pi )=-I_{\nu ,Stein}(\rho _t|\pi ), \end{aligned}$$

(15)

and consequently

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\pi )\le 0. \end{aligned}$$

(16)

Proof of Proposition 4

Note that

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\pi )&=\frac{d}{dt} \int _{\mathbb {R}^d} \rho _t \log \frac{\rho _t}{\pi } dx \\&=\int _{\mathbb {R}^d} \partial _t \rho _t \log \frac{\rho _t}{\pi } dx+\int _{\mathbb {R}^d} \partial _t \rho _t dx \\&=-\int _{\mathbb {R}^d} \left\langle \nabla \log \frac{\rho _t}{\pi }(x) , \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^*\left( \nabla \log \frac{\rho _t}{\pi }\right) (x) \right\rangle \rho _t(x) dx+0\\&=-\left\langle \nabla \log \frac{\rho _t}{\pi }, \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^*\left( \nabla \log \frac{\rho _t}{\pi }\right) \right\rangle _{L_2^d(\rho _t)}\\&=-\left\langle \iota _{k,\rho _t}^*\nabla \log \frac{\rho _t}{\pi } , \left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^* \left( \nabla \log \frac{\rho _t}{\pi }\right) \right\rangle _{\mathcal {H}_k^d}. \end{aligned}$$

It suffices to show that for all $\nu > 0$, $\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}$ is a positive operator from $\mathcal {H}_k^d$ to $\mathcal {H}_k^d$. By the definition of $\iota _{k,\rho _t}$, for any $f\in \mathcal {H}_k^d$ with $\left\Vert f \right\Vert _{\mathcal {H}_k^d}=1$,

$$\begin{aligned} \langle f, \left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) f \rangle _{\mathcal {H}_k^d}&=(1-\nu )\langle \iota _{k,\rho _t}f, \iota _{k,\rho _t}f \rangle _{L_2^d(\rho _t)}+\nu \left\Vert f \right\Vert _{\mathcal {H}_k^d}^2 \\&=(1-\nu )\left\Vert \iota _{k,\rho _t} f \right\Vert _{L_2^d(\rho _t)}^2+\nu > 0 \end{aligned}$$

for all $\nu > 0$. Therefore, $ (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d$ is a positive operator from $\mathcal {H}_k^d$ to $\mathcal {H}_k^d$. So is the operator $\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}$. Hence, we have (16). The claim in (15) follows directly from (16), (5) and Definition 1.

We now provide the proof of Theorem 2. For the proof, we recall that we use $(\lambda _{i,t},e_{i,t})_{i=1}^\infty $ to denote the set of eigenvalues and eigenvectors of the operator $\iota _{k,\rho _t}\iota _{k,\rho _t}^*$ for any $t\ge 0$, with $\lambda _{1,t}\ge \lambda _{2,t}\ge \cdots >0$.

Proof of Theorem 2

From Proposition 4 and (12), we know that

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\pi )&=-(1-\nu )^{-1}I(\rho _t|\pi )+\sum _{i=1}^\infty \frac{(1-\nu )^{-1}\nu }{(1-\nu )\lambda _{i,t}+\nu } \left| \left\langle \nabla \log \frac{\rho _t}{\pi }, e_{i,t} \right\rangle _{L_2(\rho _{t})}\right| ^2, \end{aligned}$$

where $\nabla \log \frac{\rho _t}{\pi }=(\iota _{k,\rho _t}\iota _{k,\rho _t}^*)^{\gamma _t}h_t$ for some $\gamma _t\in (0,\frac{1}{2}]$ with $h_t:= {\mathfrak {I}}(\rho _t, \gamma _t)\in L_2^d(\rho _t)$. Therefore, we have

$$\begin{aligned}&\frac{d}{dt}\textsc {KL}(\rho _t|\pi )\\&\quad =-(1-\nu )^{-1}I(\rho _t|\pi )+\sum _{i=1}^\infty \frac{(1-\nu )^{-1}\nu }{(1-\nu )\lambda _{i,t}+\nu } \left| \left\langle (\iota _{k,\rho _t}\iota _{k,\rho _t}^*)^{\gamma _t}h_t , e_{i,t} \right\rangle _{L_2(\rho _t)}\right| ^2 \\&\quad =-(1-\nu )^{-1}I(\rho _t|\pi )+\sum _{i=1}^\infty \frac{(1-\nu )^{-1}\nu \lambda _{i,t}^{2\gamma _t}}{(1-\nu )\lambda _{i,t}+\nu } \left| \left\langle h_t , e_{i,t} \right\rangle _{L_2(\rho _t)}\right| ^2\\&\quad =\sum _{i=1}^\infty (1-\nu )^{-1-2\gamma _t}\nu ^{2\gamma _t}\left( \frac{(1-\nu )\lambda _{i,t}}{(1-\nu )\lambda _{i,t} +\nu }\right) ^{2\gamma _t}\left( \frac{\nu }{(1-\nu )\lambda _{i,t}+\nu }\right) ^{1-2\gamma _t} \left| \left\langle h_t , e_{i,t} \right\rangle _{L_2(\rho _t)}\right| ^2\\&\qquad -(1-\nu )^{-1}I(\rho _t|\pi )\\&\quad \le -(1-\nu )^{-1}I(\rho _t|\pi )+(1-\nu )^{-1-2\gamma _t}\nu ^{2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2. \end{aligned}$$

The result follows by integrating over t and noting that the KL-divergence is non-negative. Now, with $\rho _t$ denoting the solution to (9), we have that $I(\rho _t|\pi )$ is non-negative and continuous in t. The claim of convergence follows from the convexity of $\rho \mapsto I(\rho |\mu )$.

4.2.3 Convergence in KL-Divergence Under LSI

While the previous result was provided for any the target density $\pi \in {\mathcal {P}}({\mathbb {R}}^d)$, in this section, we provide improved convergence results of the R-SVGF under the assumption that the $\pi $ further satisfies the Log-Sobolev Inequality. Recall that we say that $\pi $ satisfies the Log-Sobolev inequality with constant $\lambda >0$ if for all $\mu \in \mathcal {P}(\mathbb {R}^d)$:

$$\begin{aligned} \textsc {KL}(\mu |\pi )\le \frac{1}{2\lambda } I(\mu |\pi ). \end{aligned}$$

Our first result below is a stronger version of the result in Theorem 1, under the assumption that the target $\pi $ satisfies LSI and Assumption 1 on the initialization of the WGF.

Assumption 1

The initial density $\mu _0$ is chosen so that the solution $(\mu _t)$ to (13) also satisfies LSI with parameter $\lambda $, for all $t>0$.

Under the stronger assumption that the target density $\pi $ is strongly log-concave, following the arguments in [45, Theorem 8], it is easy to show that Assumption 1 is satisfied as long as $\mu _0$ is chosen such that it satisfies LSI. We conjecture that the same holds true even when the target density satisfies LSI and additional mild smoothness assumptions (i.e., LSI is preserved along the trajectory as long as the initial density $\mu _0$ satisfies LSI, presumably with additional milder assumptions). However, a proof of this conjecture has eluded us thus far.

Theorem 3

[Relation to the WGF under LSI] Assume $\pi $ satisfies the log-Sobolev inequality with parameter $\lambda $. Let $(\rho _t)$ be the solution to (7). Let $(\mu _t)$ be the solution to the WGF, defined in (13), with $\mu _0$ satisfying Assumption 1. For any $t>0$, suppose there exists $\gamma _t\in (0,\frac{1}{2}]$ such that $ \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}<\infty .$ Then, for any $T\in (0,\infty )$, we have

$$\begin{aligned}&\textsc {KL}(\rho _T|\mu _T)\le e^{-3\lambda T/2 }\textsc {KL}(\rho _0|\mu _0)\nonumber \\&\quad + \int _0^T {(\lambda _{1,t}\vee 1)^2}\nu ^{2\gamma _t}{(1-\nu )^{-2\gamma _t}} e^{-3\lambda (T-t)/2} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 dt, \end{aligned}$$

(17)

where $\lambda _{1,t}$ is the largest eigenvalue of $\iota _{k,\rho _t}\iota ^*_{k,\rho _t}$ for all $t\ge 0$.

Proof of Theorem 3

Following the same arguments as in the proof of Theorem 1, we obtain that for any $t>0$,

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\mu _t)&\le -\frac{3}{4} I(\rho _t|\mu _t)+\sum _{i=1}^\infty \frac{(1-\lambda _i)^2\nu ^2}{\left( (1-\nu )\lambda _i+\nu \right) ^2} \left| \left\langle \nabla \log \frac{\rho _t}{\pi },e_i \right\rangle _{L_2(\rho _t)}\right| ^2\\&\le -\frac{3}{4}I(\rho _t|\mu _t)+ {(\lambda _{1,t}\vee 1)^2}\nu ^{2\gamma _t}{(1-\nu )^{-2\gamma _t}} \nu ^{2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2. \end{aligned}$$

Hence, under Assumption 1 we obtain

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\mu _t)\le -\frac{3\lambda }{2}\textsc {KL}(\rho _t|\mu _t)+ {(\lambda _{1,t}\vee 1)^2}\nu ^{2\gamma _t}{(1-\nu )^{-2\gamma _t}} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2. \end{aligned}$$

Finally, (17) follows from the Gronwall’s inequality.

Our second result is a stronger version of the result in Theorem 2, under the assumption that the target distribution $\pi $ satisfies LSI. We remark that convergence to equilibrium of the related WGF under various functional inequalities is a well-studied topic. We refer the interested reader to [3] for a detailed overview.

Theorem 4

[Decay of KL-divergence under LSI] Assume that $\pi $ satisfies the log-Sobolev inequality with $\lambda >0$. Let $(\rho _t)$ be the solution to (7). For any $t>0$, suppose there exists $\gamma _t\in (0,\frac{1}{2}]$ such that $ \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}<\infty .$ Then, for any $T\in (0,\infty )$, we have

$$\begin{aligned} \textsc {KL}(\rho _T|\pi )&\le e^{-2(1-\nu )^{-1}\lambda T}\textsc {KL}(\rho _0|\pi )\\&\quad + \int _0^T \nu ^{2\gamma _t}(1-\nu )^{-2\gamma _t-1} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 e^{2(1-\nu )^{-1}\lambda (t-T)}dt. \end{aligned}$$

Proof of Theorem 4

From the proof of Theorem 2, we have

$$\begin{aligned} \frac{d}{dt}\textsc {KL}(\rho _t|\pi )&\le -(1-\nu )^{-1}I(\rho _t|\pi )+(1-\nu )^{-1-2\gamma _t}\nu ^{2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2\\&\le -2(1-\nu )^{-1}\lambda \textsc {KL}(\rho _t|\pi )+(1-\nu )^{-1-2\gamma _t}\nu ^{2\gamma _t} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2, \end{aligned}$$

where the last inequality follows the log-Sobolev inequality. The final statement follows from Gronwall’s inequality.

Remark 6

[Exponential Decay of KL-divergence] Yet another way to state the above result is via the introducing the following regularized Stein-LSI, similar to the introduction of Stein-LSI in [19]. However, the introduction of Stein-LSI is quite restrictive in the sense that it couples assumptions on the target and the chosen RKHS. This makes verifying the conditions more delicate. To counter this effect, we now introduce the notion of Regularized Stein-LSI. We say that $\pi \in \mathcal {P}(\mathbb {R}^d)$ satisfies the regularized Stein log-Sobolev inequality with constant $\lambda >0$ if for all $\mu \in \mathcal {P}(\mathbb {R}^d)$:

$$\begin{aligned} \textsc {KL}(\mu |\pi )\le \frac{1}{2\lambda } I_{\nu ,Stein}(\mu |\pi ). \end{aligned}$$

(18)

An advantage of the above condition is that, as $\nu \rightarrow 0$ the regularized Stein-LSI inequality becomes equivalent to the standard LSI inequality. Under the condition that the target density $\pi $ satisfies (18), and letting $(\rho _t)$ be the solution to (7), it holds that

$$\begin{aligned} \textsc {KL}(\rho _t|\pi ) \le e^{-2\lambda t} \textsc {KL}(\rho _0|\pi ). \end{aligned}$$

(19)

The proof of (19) follows immediately from Proposition 4 and (18).

4.3 Convergence Results for Time-Discretized R-SVGF

In this section, we analyze the convergence properties of the time-discretized R-SVGF in (8). To do so, we require the following additional assumptions.

Assumption A2

The following conditions hold:

(1)
There exists a constant $B>0$ such that $\left\Vert \nabla _1 k(x,\cdot ) \right\Vert _{\mathcal {H}_k^d}\le B$ for all $x\in \mathbb {R}^d$.
(2)
The potential function $V:\mathbb {R}^d\rightarrow \mathbb {R}$ is twice continuously differentiable and gradient Lipschitz with parameter L.
(3)
Along the time discretization (8)), $I(\rho ^n|\pi )<\infty $ for all fixed $n\ge 0$.

The smoothness assumptions in points (1) and (2) of Assumption A2 are commonly required in analyzing any discrete-time algorithms, albeit deterministic [24, 37] or randomized [4, 11, 45]. While it could be relaxed (see, for example, [43]), in general it is impossible to completely avoid them as in the case of analyzing the corresponding flows. Before stating our results, we also introduce some convenient notations. We let

$$\begin{aligned} {\mathfrak {S}}_n:=\left( \sup _{i} \frac{{\lambda _i^{(n)}}^{1+2\gamma _n}}{\left( (1-\nu _{n+1}){\lambda _i^{(n)}}+\nu _{n+1}\right) ^2}\right) \quad ~\text {and}\quad R_n:=\left\Vert {\mathfrak {I}}(\rho ^n, \gamma _n) \right\Vert _{L_2^d(\rho ^n)}, \end{aligned}$$

(20)

where the sequence $\{\lambda _i^{(n)}\}_{i\ge 1}$ corresponds to the positive eigenvalues of the operator $\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*$ in the order of decreasing values for all $n\ge 0$.

Theorem 5

[Convergence in Fisher Divergence] Suppose Assumption A2 holds. Let $(\rho ^n)$ be the time discretization of the R-SVGF described in (8) with initial condition $\rho ^0=\rho _0$ such that $\textsc {KL}(\rho _0|\pi )\le R$. For each n, suppose that $\nu _{n+1}$ and the step-size $h_{n+1}$ are chosen such that,

$$\begin{aligned} \begin{aligned} h_{n+1}&< \min \left\{ \frac{1-\nu _{n+1}}{L}, \frac{\alpha -1}{ \alpha B R_n \sqrt{ {\mathfrak {S}}_n}} \right\} , \end{aligned} \end{aligned}$$

(21)

where $\alpha \in (1,2]$ is some constant, and suppose that there exists $\gamma _n\in (0,\frac{1}{2}]$, such that ${\mathfrak {I}}(\rho ^n, \gamma _n) \in L_2^d(\rho ^n)$. Then,

$$\begin{aligned} \sum _{n=0}^\infty \frac{h_{n+1}}{2(1-\nu _{n+1})}I(\rho ^n|\pi )\le \sum _{n=0}^\infty \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n-1} h_{n+1} \left( 1+\frac{1}{2}\nu _{n+1}^{-1}\alpha ^2 B^2 h_{n+1} \right) R_n^2 +R.\nonumber \\ \end{aligned}$$

(22)

Before proving Theorem 5, we first prove the following intermediate result. For the proofs, we let $(\lambda _i^{(n)},e_i^{(n)})_{i=1}^\infty $to denote the set of eigenvalues and eigenvectors of the operator $\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*$, with $\lambda _1^{(n)}\ge \lambda _2^{(n)}\ge \cdots >0$.

Lemma 1

For each $n\ge 1$, define $g=\mathcal {D}_{\nu _{n+1},\rho ^n}\nabla \log \frac{\rho ^n}{\pi }$. Under the conditions in Theorem 5, we have that, for any $x\in \mathbb {R}^d$ and $t\in [0,h_{n+1}]$,

$$\begin{aligned} \left\Vert \nabla g(x) \right\Vert _{HS}^2 \le B^2 R_n^2 {\mathfrak {S}}_n ~~~\text {and}~~~\left\Vert (id-t \nabla g(x))^{-1} \right\Vert _2 \le \alpha . \end{aligned}$$

Proof of Lemma 1

Since for each n, there exists $\gamma _n\in (0,1/2]$ and a function $ h={\mathfrak {I}}(\rho ^n, \gamma _n) \in L_2^d(\rho ^n)$ such that $(\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*)^{2\gamma _n} h_j=\partial _j \log \frac{\rho ^n}{\pi } $, where $h_j$ is the j-th component of the function value of h, we have

$$\begin{aligned} \left\Vert \nabla g(x) \right\Vert _{HS}^2&=\sum _{j,l=1}^d \left| \frac{\partial g_j(x) }{\partial x_l}\right| ^2\\&=\sum _{j,l=1}^d \left( \sum _{i=1}^\infty \frac{\lambda _i^{(n)}}{(1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}} \left\langle \partial _j \log \frac{\rho ^n}{\pi },e_i^{(n)} \right\rangle _{L_2(\rho ^n)}\partial _l e_i^{(n)}(x) \right) ^2 \\&=\sum _{j,l=1}^d \left( \sum _{i=1}^\infty \frac{{\lambda _i^{(n)}}^{1+\gamma _n}}{(1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}} \left\langle h_j,e_i^{(n)} \right\rangle _{L_2(\rho ^n)}\partial _l e_i^{(n)}(x) \right) ^2\\&\le \sum _{j,l=1}^d \left( \sum _{i=1}^\infty \left\langle h_j,e_i^{(n)} \right\rangle _{L_2(\rho ^n)}^2 \right) \left( \sum _{i=1}^\infty \frac{{\lambda _i^{(n)}}^{2+2\gamma _n}}{\left( (1-\nu _{n+1}){\lambda _i^{(n)}}+\nu _{n+1}\right) ^2}\left| \partial _l e_i^{(n)}(x) \right| ^2 \right) \\&=\left( \sum _{i=1}^\infty \left| \left\langle h,e_i^{(n)} \right\rangle _{L_2(\rho ^n)} \right| ^2 \right) \left( \sum _{i=1}^\infty \frac{{\lambda _i^{(n)}}^{2+2\gamma _n}}{\left( (1-\nu _{n+1}){\lambda _i^{(n)}}+\nu _{n+1}\right) ^2}\left| \nabla e_i^{(n)}(x) \right| ^2 \right) \\&\le \sup _{i} \left( \frac{{\lambda _i^{(n)}}^{1+2\gamma _n}}{\left( (1-\nu _{n+1}){\lambda _i^{(n)}}+\nu _{n+1}\right) ^2} \right) \left\Vert \nabla _1 k(x,\cdot ) \right\Vert _{\mathcal {H}_k^d}^2 R_n^2 \\&\le B^2 R_n^2 \sup _{i} \left( \frac{{\lambda _i^{(n)}}^{1+2\gamma _n}}{\left( (1-\nu _{n+1}){\lambda _i^{(n)}}+\nu _{n+1}\right) ^2} \right) . \end{aligned}$$

In the above, the first inequality follows from Cauchy-Schwartz inequality, the second inequality follows from the fact that

$$\begin{aligned} \sum _{i=1}^\infty {\lambda _i^{(n)}}\left| \nabla e_i^{(n)}(x)\right| ^2=\sum _{i=1}^\infty \langle \nabla _1 k(x,\cdot ), {\sqrt{\lambda _i^{(n)}}}e_i^{(n)} \rangle _{\mathcal {H}_k^d}^2 =\left\Vert \nabla _1 k(x,\cdot ) \right\Vert _{\mathcal {H}_k^d}^2, \end{aligned}$$

and the last inequality follows from Assumption A2. Meanwhile, since $\left\Vert \nabla g(x) \right\Vert _{2}\le \left\Vert \nabla g(x) \right\Vert _{HS}$ for all $x\in \mathbb {R}^d$, for every $t\in [0,h_{n+1}]$, we have

$$\begin{aligned} \left\Vert (id-t \nabla g(x))^{-1} \right\Vert _{2}&\le \sum _{m=0}^\infty \left\Vert t \nabla g(x) \right\Vert _{2}^m \le \sum _{m=0}^\infty \left\Vert t \nabla g(x) \right\Vert _{HS}^m \\&\le \sum _{m=0}^\infty \left( h_{n+1} B R_n \sup _{i}\left( \frac{{\lambda _i^{(n)}}^{1+2\gamma _n}}{\left( (1-\nu _{n+1}){\lambda _i^{(n)}}+\nu _{n+1}\right) ^2} \right) ^{\frac{1}{2}} \right) ^m\\&\le \sum _{m=0}^\infty \left( \frac{\alpha -1}{\alpha }\right) ^m =\alpha . \end{aligned}$$

where the last inequality follows from (21).

Proof of Theorem 5

We start from studying the single step along (8). In the following analysis, for each $n\ge 1$, we denote $g=\mathcal {D}_{\nu _{n+1},\rho ^n}\nabla \log \frac{\rho ^n}{\pi }$, $\phi _t(x)=x-t g(x)$ for all $x\in \mathbb {R}^d$, $t\in [0,h_{n+1}]$ and $\tilde{\rho }_t=(\phi _t)_{\#}\rho ^n$. Therefore, we have

$$\begin{aligned} \rho ^n=\tilde{\rho }_0 \quad \text {and} \quad \rho ^{n+1}=({\phi _{h_{n+1}}})_{\#}\rho ^n=\tilde{\rho }_{h_{n+1}}. \end{aligned}$$

The following analysis is motivated by [37, Proposition 3.1]. According to [46, Theorem 5.34], the velocity field ruling the evolution of $\tilde{\rho }_t$ is $\omega _t\in L_2^d(\tilde{\rho }_t)$ and $\omega _t(x)=-g(\phi _t^{-1}(x))$. Define $\psi (t)=\textsc {KL}(\tilde{\rho }_t|\pi )$. According to the chain rule in [46, Section 8.2],

$$\begin{aligned} \psi '(t)&= \left\langle \nabla _{W_2} \textsc {KL}(\tilde{\rho }_t|\pi ), \omega _t \right\rangle _{L_2^d(\tilde{\rho }_t)},\\ \psi ''(t)&=\left\langle \omega _t, \text {Hess}_{\textsc {KL}(\cdot |\pi )}(\tilde{\rho }_t)\omega _t \right\rangle _{L_2^d(\tilde{\rho }_t)}. \end{aligned}$$

where $\text {Hess}_{\textsc {KL}(\cdot |\pi )}(\tilde{\rho }_t)$ is the Wasserstein Hessian of $\textsc {KL}(\cdot |\pi )$ at $\tilde{\rho }_t$. For any $\mu \in \mathcal {P}(\mathbb {R}^d)$ and any v in the Wasserstein tangent space at $\mu $, the Wasserstein Hessian is given by [24],

$$\begin{aligned} \left\langle v, \text {Hess}_{\textsc {KL}(\cdot |\pi )}(\mu )v \right\rangle _{L_2^d(\mu )}&=\left\langle v, \nabla ^2 V v \right\rangle _{L_2^d(\mu )}+\mathbb {E}_{\mu }[\left\Vert \nabla v(X) \right\Vert _{HS}^2]. \end{aligned}$$

Therefore we can expand the difference in KL-divergence between the two consecutive iterations as

$$\begin{aligned}&\psi (h_{n+1})-\psi (0)\nonumber \\&\quad =\psi '(0)h_{n+1}+\int _0^{h_{n+1}} (h_{n+1}-t)\psi ''(t)dt \nonumber \\&\quad =-h_{n+1}\left\langle \nabla _{W_2} \textsc {KL}(\rho ^n |\pi ), g \right\rangle _{L_2^d(\rho ^n)}\nonumber \\ {}&\qquad +\int _0^{h_{n+1}}(h_{n+1}-t)\langle \omega _t, \text {Hess}_{\textsc {KL}(\cdot |\pi )}(\tilde{\rho }_t)\omega _t \rangle _{L_2^d(\tilde{\rho }_t)} dt. \end{aligned}$$

(23)

The first term on the right-hand side of (23) can be studied via the spectrum of the operator $\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*$.

$$\begin{aligned}&-h_{n+1}\left\langle \nabla _{W_2} \textsc {KL}(\rho ^n |\pi ), g \right\rangle _{L_2^d(\rho ^n)}\nonumber \\&\quad =-h_{n+1}\left\langle \nabla \log \frac{\rho ^n}{\pi }, \left( (1-\nu _{n+1})\iota _{k,\rho ^n}^* \iota _{k,\rho ^n}+\nu _{n+1} I_d\right) ^{-1}\iota _{k,\rho ^n}^* \nabla \log \frac{\rho ^n}{\pi } \right\rangle _{L_2^d(\rho ^n)} \nonumber \\&\quad =-h_{n+1}\sum _{i=1}^\infty \frac{\lambda _i^{(n)}}{(1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}}\left| \left\langle \nabla \log \frac{\rho ^n}{\pi }, e_i^{(n)} \right\rangle _{L_2(\rho ^n)}\right| ^2 \nonumber \\&\quad =-h_{n+1} I_{\nu _{n+1},Stein}(\rho ^n|\pi ). \end{aligned}$$

Since $\tilde{\rho }_t=(\phi _t)_{\#}\rho ^n$, for any function h we have $\mathbb {E}_{X\sim \tilde{\rho }_t}\left[ h(X) \right] =\mathbb {E}_{Y\sim \rho {^n}}\left[ h(\phi _t(Y)) \right] $. Hence, for the second term on the right side of (23), we obtain

$$\begin{aligned}&\left\langle \omega _t, \text {Hess}_{\textsc {KL}(\cdot |\pi )}(\tilde{\rho }_t)\omega _t \right\rangle _{L_2^d(\tilde{\rho }_t)} =\left\langle \omega _t, \nabla ^2 V \omega _t \right\rangle _{L_2^d(\tilde{\rho }_t)}+\mathbb {E}_{\tilde{\rho }_t}[\left\Vert \nabla \omega _t(x) \right\Vert _{HS}^2] \\&=\left\langle g(\phi _t^{-1}), \nabla ^2 V g(\phi _t^{-1}) \right\rangle _{L_2^d(\tilde{\rho }_t)}+\mathbb {E}_{\rho ^n} [ \left\Vert \nabla \omega _t \circ \phi _t(x) \right\Vert _{HS}^2 ] \\&=\mathbb {E}_{\rho ^n}\left[ g(x)^T \nabla V^2(\phi _t(x)) g(x)\right] +\mathbb {E}_{\rho ^n} \left[ \left\Vert \nabla g(x) (\nabla \phi _t(x))^{-1} \right\Vert _{HS}^2 \right] \\&\le L\left\Vert g \right\Vert _{L_2^d(\rho ^n)}^2+\mathbb {E}_{\rho ^n} \left[ \left\Vert \nabla g(x) (\nabla \phi _t(x))^{-1} \right\Vert _{HS}^2 \right] , \end{aligned}$$

where the last inequality follows from Assumption A2-(2). Therefore, we obtain

$$\begin{aligned} \textsc {KL}(\rho ^{n+1}|\pi )-\textsc {KL}(\rho ^n|\pi )&\le -h_{n+1}I_{\nu _{n+1},Stein}(\rho ^n|\pi )+\frac{Lh_{n+1}^2}{2}\left\Vert g \right\Vert _{L_2^d(\rho ^n)}^2\\&\quad \quad +\frac{h_{n+1}^2}{2}\max _{t\in [0,h_{n+1}]}\mathbb {E}_{\rho ^n} \left[ \left\Vert \nabla g(x) (\nabla \phi _t(x))^{-1} \right\Vert _{HS}^2 \right] , \end{aligned}$$

where

$$\begin{aligned} \left\Vert g \right\Vert _{L_2^d(\rho ^n)}^2&=\left\Vert \mathcal {D}_{\nu _{n+1},\rho ^n} \nabla \log \frac{\rho ^n}{\pi } \right\Vert _{L_2^d(\rho ^n)}^2 \\&=\left\Vert \left( (1-\nu _{n+1})\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*+\nu _{n+1}I_d\right) ^{-1}\iota _{k,\rho ^n}\iota _{k,\rho ^n}^* \nabla \log \frac{\rho ^n}{\pi } \right\Vert _{L_2^d(\rho ^n)}^2 \\&=\sum _{i=1}^d \left( \frac{{\lambda _i^{(n)}}}{(1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}} \right) ^2 \left| \left\langle \nabla \log \frac{\rho ^n}{\pi }, e_i^{(n)} \right\rangle _{L_2(\rho ^n)} \right| ^2\\&\le (1-\nu _{n+1})^{-2}I(\rho _n|\pi ), \end{aligned}$$

with $(\lambda _i^{(n)}, e_i^{(n)})_{i=1}^\infty $ being the sequence of eigenvalues and eigenvectors to the operator $\iota _{k,\rho ^n}\iota _{k,\rho ^n}^*$ such that $\lambda _1^{(n)}\ge \cdots \ge \lambda _i^{(n)}\ge \cdots >0$ and $(e_i^{(n)})_{i=1}^\infty $ is an orthonormal basis of $L_2(\rho ^n)$. According to Lemma 1 and Assumption A2,

$$\begin{aligned}&\left\Vert \nabla g(x) \right\Vert _{HS}^2\\&\quad \le \sup _{i} \left( \frac{{\lambda _i^{(n)}}^{1+2\gamma _n}}{\left( (1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}\right) ^2}\right) B^2 R_n^2 \\&\quad \le \sup _{i} \left( \frac{\nu _{n+1}^{2\gamma _n-1}}{(1-\nu _{n+1})^{2\gamma _n+1}} \left( \frac{(1-\nu _{n+1})\lambda _i^{(n)}}{(1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}}\right) ^{1+2\gamma _n}\right. \\&\qquad \left. \left( \frac{\nu _{n+1}}{(1-\nu _{n+1})\lambda _i^{(n)}+\nu _{n+1}}\right) ^{1-2\gamma _n} \right) B^2 R_n^2\\&\quad \le \nu _{n+1}^{2\gamma _n-1}(1-\nu _{n+1})^{-2\gamma _n-1} B^2 R_n^2, \end{aligned}$$

and furthermore according to Lemma 1, $\left\Vert (id-t \nabla g(x))^{-1}\right\Vert _2^2\le \alpha ^2$. Therefore, we get

$$\begin{aligned} \textsc {KL}(\rho ^{n+1}|\pi )-\textsc {KL}(\rho ^n|\pi )&\le -h_{n+1}I_{\nu _{n+1},Stein}(\rho ^n|\pi )+\frac{L h_{n+1}^2(1-\nu _{n+1})^{-2}}{2}I(\rho ^n|\pi )\\&\quad +\frac{1}{2}\alpha ^2B^2\nu _{n+1}^{2\gamma _n-1}(1-\nu _{n+1})^{-2\gamma _n-1}R_n^2 h_{n+1}^2\\&\le -h_{n+1}(1-\nu _{n+1})^{-1} \left( 1-\frac{L h_{n+1}(1-\nu _{n+1})^{-1}}{2} \right) I(\rho ^n|\pi )\\&\quad +h_{n+1} \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n-1} R_n^2\left( 1+\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2 \right) \\&\le -\frac{1}{2}h_{n+1}(1-\nu _{n+1})^{-1} I(\rho ^n|\pi )\\&\quad +h_{n+1} \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n-1} R_n^2\left( 1+\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2 \right) , \end{aligned}$$

where the last inequality follows from (21) and the second inequality follows from the fact that

$$\begin{aligned} I(\rho ^n|\pi )-{(1-\nu _{n+1})}I_{\nu _{n+1},Stein}(\rho ^n|\pi )\le \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n} R_n^2, \end{aligned}$$

which is proved in Proposition 3. Lastly, summing over n and we obtain

$$\begin{aligned} \sum _{n=0}^\infty \frac{h_{n+1}}{2(1-\nu _{n+1})} I(\rho ^n|\pi )&\le \sum _{n=0}^\infty \left( \textsc {KL}(\rho ^n|\pi )-\textsc {KL}(\rho ^{n+1}) \right) \\&\quad +\sum _{n=0}^\infty h_{n+1} \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n-1} R_n^2\left( 1+\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2 \right) \\&\le \textsc {KL}(\rho ^0|\pi )+\sum _{n=0}^\infty h_{n+1} \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n-1} \\&\qquad R_n^2\left( 1+\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2 \right) , \end{aligned}$$

where the last inequality follows from the fact that KL divergence is non-negative. Therefore, (22) is proved.

Remark 7

We emphasize that the above result does not make any assumptions on the target density $\pi $, except for $\pi \in \mathcal {P}(\mathbb {R}^d)$ and the Lipschitz gradient assumption. In particular, it holds for multi-modal densities. However, the metric of convergence is the weaker Fisher information metric.

We now provide a stronger result under the LSI assumption.

Theorem 6

Suppose Assumption A2 holds and $\pi $ satisfies the log-Sobolev inequality with parameter $\lambda $. Let $(\rho ^n)$ be as described in (8) with initial condition $\rho ^0=\rho _0$ such that $\textsc {KL}(\rho _0|\pi )\le R$. Assume the regularization parameter and the step-size parameters are chosen such that for all $n\ge 0$, they satisfy

$$\begin{aligned} \begin{aligned} h_{n+1}\le \min \left\{ \frac{1-\nu _{n+1}}{L}, \frac{\alpha -1}{ \alpha B R_n \sqrt{ {\mathfrak {S}}_n}},{\frac{2\nu _{n+1}}{\alpha ^2B^2}} , \frac{2(1-\nu _{n+1})}{\lambda } \right\} , \\ ~~~~ \frac{\nu _{n+1}}{1-\nu _{n+1}}\le \left( \frac{I(\rho ^n|\pi )}{2R_n^2} \right) ^{\frac{1}{2\gamma _n}}, \end{aligned} \end{aligned}$$

(24)

where $\alpha \in (1,2]$ is a constant, $\gamma _n\in (0,\frac{1}{2}]$, and ${\mathfrak {I}}(\rho ^n, \gamma _n) \in L_2^d(\rho ^n)$. Then, for all $n\ge 1$,

$$\begin{aligned} \textsc {KL}(\rho ^n|\pi ) \le R~ \prod _{i=1}^n \left( 1-\frac{1}{2}\lambda (1-\nu _i)^{-1}h_{i}\right) . \end{aligned}$$

(25)

Proof of Theorem 6

From the proof of Theorem 5, we can bound the difference in KL-divergence between two consecutive iterations by

$$\begin{aligned}&\textsc {KL}(\rho ^{n+1}|\pi )-\textsc {KL}(\rho ^n|\pi )\\&\quad \le -\frac{1}{2}h_{n+1}(1-\nu _{n+1})^{-1} I(\rho ^n|\pi ) +h_{n+1} \nu _{n+1}^{2\gamma _n}(1-\nu _{n+1})^{-2\gamma _n-1} R_n^2\left( 1+\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2 \right) \\&\quad =-\frac{1}{4}h_{n+1}(1-\nu _{n+1})^{-1}I(\rho ^n|\pi ) \left( 2-\frac{\nu _{n+1}^{2\gamma _n}}{(1-\nu _{n+1})^{2\gamma _n}}\frac{R_n^2 \left( 1+\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2 \right) }{I(\rho ^n|\pi )} \right) \\&\quad \le -\frac{1}{4}h_{n+1}(1-\nu _{n+1})^{-1}I(\rho ^n|\pi ) \left( 2-\frac{\nu _{n+1}^{2\gamma _n}}{(1-\nu _{n+1})^{2\gamma _n}}\frac{{2}R_n^2 }{I(\rho ^n|\pi )} \right) \\&\quad \le -\frac{1}{4}h_{n+1}(1-\nu _{n+1})^{-1}I(\rho ^n|\pi ), \end{aligned}$$

where the second inequality follows from the fact that $\frac{1}{2}h_{n+1} \nu _{n+1}^{-1} \alpha ^2 B^2\le 1$, and the last inequality follows from (24).

Last, since $\pi $ satisfies the log-Sobolev inequality with parameter $\lambda $, we get

$$\begin{aligned} \textsc {KL}(\rho ^{n+1}|\pi )\le \left( 1-\frac{1}{2}\lambda (1-\nu _{n+1})^{-1} h_{n+1}\right) \textsc {KL}(\rho ^n|\pi ), \end{aligned}$$

and (25) follows from the above recursive inequality.

Remark 8

[Choice of $\{\nu _n\}_{n\ge 1}$] From (24) and (25), we observe that there is a trade-off in terms of $\{\nu _n\}_{n\ge 1}$, as smaller $\{\nu _n\}_{n\ge 1}$ will result in slower convergence of the time-discretized R-SVGF in KL-divergence. Indeed, if $\{\nu _n\}_{n\ge 1}$ are chosen to be small, (24) requires the step-size $\{h_n\}_{n\ge 1}$ to be small as well. And, as shown in (25), $\textsc {KL}(\rho _n|\pi )$ decays at a slower rate when $\{h_n\}_{n\ge 1}$ are smaller. We also refer to Remark 10 below and Remark 12 for further trade-offs with respect to the parameter $\{\nu _n\}_{n\ge 1}$.

Remark 9

According to (25), to reach an $\epsilon $-accuracy in KL-divergence, we need the number of iterations to be at least $n_\epsilon $ such that $\prod _{i=1}^{n_\epsilon } \left( 1-\frac{1}{2}\lambda (1-\nu _i)^{-1} h_i\right) R\le \epsilon $. With the fact that $\log (1-x)<-x$ for all $x\in (0,1)$, we get $n_\epsilon $ satisfies

$$\begin{aligned} \sum _{i=1}^{n_\epsilon } (1-\nu _i)^{-1} h_i\ge \frac{2}{\lambda } \log \left( \frac{R}{\epsilon }\right) . \end{aligned}$$

Under (24), if we can choose the time step sizes $(h_i)_{i=1}^\infty $ to be a constant $h>0$, then we have $n_\epsilon = O(\log (R/\epsilon ))$. For comparison, in Table 1, we provide the iteration complexity results for different methods, to obtain $\text {KL}(\rho _n|\pi ) \le \epsilon $, under the assumption that the target $\pi $ satisfies LSI.

We also emphasize that prior results on the analysis of time-discretization of the SVGF under functional inequality assumptions are established only in the weaker Stein-Fisher information metric [24, 37]. Our results above are established for the $\textsc {KL}$-divergence and is more in line with similar results established for other randomized Monte Carlo algorithms [4, 11, 45]. We end this section with the following remark on an illustrative example.

Table 1 The results for SVGF and Regularized SVGF are strictly non-algorithmic, as the stated results require that the initial density supplied to the method must have finite KL divergence to $\pi $

Full size table

Remark 10

(An illustrative example) Consider sampling from a Gaussian target $\pi =\mathcal {N}(0,Q)$, where Q is strictly positive definite and $k(x,y)=\langle x,y \rangle +1$, the linear kernel. This model has recently been studied in [28] motivated by connections to Gaussian variational inference. Note that the above target $\pi $ satisfies LSI with $\lambda $ being the minimal eigenvalue of Q. Based on the results established in [28], our results on the R-SVGF and the time-discretized R-SVGF can be interpreted as follows. Before we proceed, we point out a subtle fact. To establish existence and uniqueness later in Sect. 5 we assume that the kernel is bounded, which is not satisfied by the linear kernel. However, based on an explicit calculation, existence and uniqueness results were shown for the linear kernel in [28].

R-SVGF: If $\rho _0=\mathcal {N}(0,\Sigma _0)$, then note that we have for all $t\ge 0$, $\rho _t=\mathcal {N}(0,\Sigma _t)$, $\gamma _t={1}/{2}$, $\lambda _{1,t}\le 1+\left\Vert \Sigma _t \right\Vert _F^2$ and $\left\Vert \mathcal {J}(\rho _t,\gamma _t) \right\Vert _{L_2(\rho _t)}=\left\Vert Q^{-1}-\Sigma _t^{-1} \right\Vert _F $. It follows from [28, Equation 9] that

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}t}\Sigma _t&=2\big ( (1-\nu )\Sigma _t+\nu I_d \big )^{-1}\Sigma _t-\big ( (1-\nu )\Sigma _t+\nu I_d \big )^{-1}\Sigma _t^2 Q^{-1}\\&\quad -Q^{-1}\big ( (1-\nu )\Sigma _t+\nu I_d \big )^{-1}\Sigma _t^2. \end{aligned}$$

Further assuming that $\Sigma _0 Q=Q \Sigma _0$, there exists an orthonormal matrix P such that we have $Q=P^\intercal \text {diag}\{q_1,\cdots ,q_d\} P$ and $\Sigma _t=P^\intercal \text {diag}\{\sigma _1(t),\cdots ,\sigma _d(t)\} P$ with $q_1\ge q_2\ge \cdots \ge q_d$. According to [28, page 45], we also have that $\lambda =q_d$, $\left\Vert \Sigma _t-Q \right\Vert =O\big (\exp (-\frac{2t}{(1-\nu )q_1+\nu } )\big )$ and

$$\begin{aligned} \lambda _{1,t}&\le 1+ O\bigg (\sum _{i=1}^d \big (q_i+\exp (-\frac{2t}{(1-\nu )q_i+\nu })\big )^2 \bigg ), \\ \left\Vert \mathcal {J}(\rho _t,\gamma _t) \right\Vert _{L_2(\rho _t)}^2&=O\left( \sum _{i=1}^d \sigma _i(0)^{-\frac{2\nu }{(1-\nu ) q_i+\nu }}q_i^{-4+\frac{2\nu }{(1-\nu )q_i+\nu }}(\sigma _i(0)-q_i)^2e^{-\frac{4t}{(1-\nu )q_i+\nu }} \right) . \end{aligned}$$

Our convergence result in Theorem 4, hence translates as

$$\begin{aligned} \textsc {KL}(\rho _T|\pi )&\le e^{-2(1-\nu )^{-1}\lambda T}\textsc {KL}(\rho _0|\pi )\\&\quad +\int _0^T \nu (1-\nu )^{-2} \left\Vert {\mathfrak {I}}(\rho _t, \gamma _t) \right\Vert _{L_2^d(\rho _t)}^2 e^{2(1-\nu )^{-1}\lambda (t-T)}dt\\&= e^{-2(1-\nu )^{-1}q_d T}\textsc {KL}(\rho _0|\pi )\\&\quad + O\left( \nu (1-\nu )^{-2} T \exp \left( -\min \left\{ \frac{4}{(1-\nu )q_1+\nu },\frac{2q_d}{1-\nu } \right\} T \right) \right) , \end{aligned}$$

for $T\in (0,\infty )$. The first term in the above bound indicates the exponential convergence of the R-SVGF and the convergence rate is proportional to $1/(1-\nu )$. The second term characterizes the bias between $\rho _T$ and $\pi $. For any fixed values of T, the bias term vanishes as $\nu \rightarrow 0$. Therefore, in order to obtain an optimal upper bound on $\textsc {KL}(\rho _T|\pi )$, there is a trade-off between choosing large $\nu $ (i.e., $\nu \rightarrow 1$) and small $\nu $ (i.e., $\nu \rightarrow 0$), which depends on T and eigenvalues of $\Sigma _0$ and Q.

Time-discretized R-SVGF: Note that if $\rho ^0=\mathcal {N}(0,S_0)$, then for all $n\ge 0$, $\rho ^n=\mathcal {N}(0,S_n)$, $\gamma _n={1}/{2}$. $R_n=\left\Vert \mathcal {J}(\rho ^n,\gamma _n) \right\Vert _{L_2(\rho ^n)}=\left\Vert Q^{-1}-S_n^{-1} \right\Vert _F $ and $I(\rho ^n|\pi )=\text {trace}(S_n(Q^{-1}-S_n^{-1})^\intercal (Q^{-1}-S_n^{-1}))$, where $(S_n)_{n\ge 0}$ is updated as:

$$\begin{aligned} S_{n+1}-S_n&=h_{n+1}\big ( (1-\nu _{n+1})S_n+\nu _{n+1} I_d \big )^{-1}(S_n^{-1}-Q^{-1})S_n^2 \nonumber \\&\quad h_{n+1}S_n\big ( (1-\nu _{n+1})S_n+\nu _{n+1} I_d \big )^{-1}(S_n^{-1}-Q^{-1}) S_n \nonumber \\&\quad +h_{n+1}^2 \big ( (1-\nu _{n+1})S_n+\nu _{n+1} I_d \big )^{-1} (S_n^{-1}-Q^{-1}) S_n^3 \,\nonumber \\&\quad \times (S_n^{-1}-Q^{-1}) \big ( (1-\nu _{n+1})S_n+\nu _{n+1} I_d \big )^{-1} . \end{aligned}$$

(26)

Since $\nabla _1 k(x,y)= y$, the RKHS norm of $\nabla _1 k(x,y)$ is obtained as follows

$$\begin{aligned} \left\Vert \nabla _1 k(x,y) \right\Vert _{\mathcal {H}_k^d}^2 = \sum _{i=1}^d \langle y_i, y_i \rangle _{\mathcal {H}_k} = \sum _{i=1}^d \partial _i \partial _{i+d} k(y,y) =d . \end{aligned}$$

Therefore, we have $\gamma _n=1/2$, $B=\sqrt{d}$ and ${\mathfrak {S}}_n\le (1-\nu _{n+1})^{-2}$ for all $n\ge 0$. Now, the convergence result in Theorem 6 translates as:

$$\begin{aligned} \textsc {KL}(\rho ^n|\pi )\le \textsc {KL}(\rho ^0|\pi ) \Pi _{i=1}^n \Big ( 1-\frac{q_d}{2(1-\nu _i)} h_i \Big ), \end{aligned}$$

where for all $i\ge 1$,

$$\begin{aligned}&\frac{\nu _i}{1-\nu _i}\le \frac{\text {trace}(S_{i-1}(Q^{-1}-S_{i-1}^{-1})^\intercal (Q^{-1}-S_{i-1}^{-1}))}{2\left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F^2},\\ h_i&\le \min \left\{ \frac{1-\nu _i}{q_1},\frac{2(1-\nu _i)}{q_d} , \frac{2(1-\nu _i)}{ \sqrt{d} \left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F },{\frac{\nu _i}{2d}} \right\} . \end{aligned}$$

We now show that such a choice of $\{\nu _i,h_i\}_{i\ge 1}$ exists. Assuming $QS_0=S_0Q$, there exists an orthonormal matrix P such that $Q=P^\intercal \text {diag}\{q_1,\cdots ,q_d\} P$ and $S_i=P^\intercal \text {diag}\{\sigma _{1,i},\cdots ,\sigma _{d,i}\} P$ with $q_1\ge q_2\ge \cdots \ge q_d$. Then, (26) implies that for any $k= 1,\cdots ,d $,

$$\begin{aligned} q_k^{-1}\sigma _{k,i+1}= \big ( 1+ h_{i+1} ( (1-\nu _{i+1})\sigma _{k,i}+\nu _{i+1})^{-1}(1-q_k^{-1}\sigma _{k,i}) \big )^2 q_k^{-1}\sigma _{k,i}. \end{aligned}$$

Now, we pick

$$\begin{aligned} h_{i+1}= \delta \min _k \big ((1-\nu _{i+1})\sigma _{k,i}+\nu _{i+1}\big ) ,\quad \tfrac{\nu _i}{1-\nu _i}= \tfrac{\text {trace}(S_{i-1}(Q^{-1}-S_{i-1}^{-1})^\intercal (Q^{-1}-S_{i-1}^{-1}))}{2\left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F^2}, \end{aligned}$$

(27)

with $\delta \in (0,1/2)$. According to [28, Theorem 3.10], if $q_k^{-1}\sigma _{k,0}\in [u_\delta , 1/3+1/(3\delta )]$ for all $k=1,\cdots ,d$, we have $\left\Vert S_i-Q \right\Vert _2\le e^{-i\delta } \left\Vert S_0-Q \right\Vert $. Here $u_\delta $ is the smaller root of $f_\delta '(u)=1-\delta $ with $f_\delta (x)=(1+\delta (1-x))^2x$. According to the convergence of $S_i$ and the fact that

$$\begin{aligned} \left\Vert S_{i-1} \right\Vert _2 \left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F^2{} & {} \ge \text {trace}(S_{i-1}(Q^{-1}-S_{i-1}^{-1})^\intercal (Q^{-1}-S_{i-1}^{-1}))\\{} & {} \ge \left\Vert S_{i-1}^{-1} \right\Vert _2^{-1} \left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F^2, \end{aligned}$$

we have that $\{\nu _i\}_{i\ge 1}$ and $\{1-\nu _i\}_{i\ge 1}$ have uniform positive lower bounds, i.e., $\inf _i \nu _i>0$ and $\inf _i 1-\nu _i>0$. Therefore, we can further choose

$$\begin{aligned} \delta = \inf _i \left\{ \big ( (1-\nu _i) \Vert S_{i-1}^{-1}\Vert _2^{-1}+\nu _i \big )^{-1}\min \{ \frac{1-\nu _i}{q_1},\frac{2(1-\nu _i)}{q_d}, \frac{2(1-\nu _i)}{ \sqrt{d} \left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F },{\frac{\nu _i}{2d}} \} \right\} >0, \end{aligned}$$

where the positivity of the third term follows from the fact that

$$\begin{aligned} \left\Vert Q^{-1}-S_{i-1}^{-1} \right\Vert _F\le \left\Vert Q^{-1} \right\Vert _2 \left\Vert S_{i-1}^{-1} \right\Vert _2 \left\Vert Q-S_{i-1} \right\Vert _F. \end{aligned}$$

Therefore, we have shown that there exists $\delta \in (0,1/2)$ such that $\{\nu _i,h_i\}_{i\ge 1}$ defined in (27) satisfies the assumptions in Theorem 6 and $\inf _i h_i>0$.

Last, we discuss the iteration complexity under the choice of $\{\nu _i,h_i\}_{i\ge 1}$ in (27). According to Remark 9, $\textsc {KL}(\rho ^n|\pi )\le \epsilon $ if n satisfies

$$\begin{aligned} \sum _{i=1}^n \frac{h_i}{1-\nu _i} \ge \frac{2}{q_d} \log \frac{\textsc {KL}(\rho ^0|\pi )}{\epsilon }. \end{aligned}$$

Since $\inf _i h_i>0$ and $\inf _i 1-\nu _i>0$, we obtain $n=O(\log ({\textsc {KL}(\rho ^0|\pi )}/{\epsilon }))$.

5 Existence and Uniqueness

The existence and uniqueness of the SVGF was studied in [30]. Motivated by their approach, in this section we study the existence and uniqueness of solutions to (7) under appropriate assumptions. Our main difficulty is in handling the non-linear operator $\left( (1-\nu ){\mathcal {T}}_{k,\mu }+ \nu I\right) ^{-1} {\mathcal {T}}_{k,\mu }$ in the R-SVGF.

We first introduce the definition of weak solutions to (7). We restrict the initial conditions in the probability measure space $\mathcal {P}_V$ which is defined as

$$\begin{aligned} \mathcal {P}_V:=\left\{ \rho \in \mathcal {P}~~:~~\left\Vert \rho \right\Vert _{\mathcal {P}_V}:=\int _{\mathbb {R}^d} (1+V(x)) d\rho (x)<\infty \right\} , \end{aligned}$$

where, in this section, with a slight overload of notations, we use $\mathcal {P}$ to denote the set of all probability measures on $\mathbb {R}^d$. We emphasize here that that $\mathcal {P}_V$ is a space of probability measures because the weak solutions do not necessarily have densities even if the target measure and the initial measure has a density; see [30] for additional details. We say that a measure-valued function $\rho \in \mathcal {C}([0,\infty ),\mathcal {P}_V)$ is a weak solution to (7) with initial condition $\rho _0\in \mathcal {P}_V$ if

$$\begin{aligned} \sup _{t\in [0,T]} \left\Vert \rho _t \right\Vert _{\mathcal {P}_V}<\infty ,\ \forall \ T>0, \end{aligned}$$

and

$$\begin{aligned} \int _0^\infty \int _{\mathbb {R}^d} (\partial _t \phi (t,x)+\nabla \phi (t,x) \cdot U[\rho _t](x))d\rho _t(x)dt+\int _{\mathbb {R}^d} \phi (0,x)d\rho _0(x)=0, \end{aligned}$$

for all $\phi \in \mathcal {C}_0^\infty ([0,\infty )\times \mathbb {R}^d)$ and $U[\rho ]:=-\left( (1-\nu )\iota _{k,\rho }\iota _{k,\rho }^*+\nu I\right) ^{-1}\iota _{k,\rho }\iota _{k,\rho }^*(\nabla \log \frac{\rho }{\pi }) $.

In order to study the existence of weak solutions, we consider the characteristic flow (see, for example, [32] and [30, Definition 3.1]) induced by (7), which is written as

$$\begin{aligned} \left\{ \begin{aligned}&\frac{d}{dt}\Phi (t,x,\rho _0)=-\mathcal {D}_{\nu ,\rho _t}\nabla \log \frac{\rho _t}{\pi }(\Phi (t,x,\rho _0)),\\&\rho _t=(\Phi (t,\cdot ,\rho _0))_{\#} \rho _0,\\&\Phi (0,x,\rho _0)=x, \end{aligned} \right. \end{aligned}$$

(28)

where $\mathcal {D}_{\nu ,\rho _t}=\left( (1-\nu )\iota _{k,\rho _t}\iota _{k,\rho _t}^*+\nu I\right) ^{-1}\iota _{k,\rho _t}\iota _{k,\rho _t}^*$ for all $t>0$. Here, the expression $\rho _t=\Phi (t,\cdot ,\rho _0)_{\#}\rho _0$ means that the measure $\rho _t$ is the push-forward measure of $\rho _0$ under the map $x\rightarrow \Phi (t,x,\rho _0)$. We think of $\{ X(t,\cdot ,\rho _0)\}_{t\ge 0,\rho _0}$ as a family of maps from $\mathbb {R}^d$ to $\mathbb {R}^d$ parameterized by t and $\rho _0$. The existence and uniqueness of the weak solutions of (7) is equivalent to the existence and uniqueness of solutions to (28). In Theorem 7, we first prove that the mean field characteristic flow in (28) is well-defined. To do so, we also require the following additional assumptions on the kernel and the potential functions.

Assumption K1

The kernel $k:\mathbb {R}^d\times \mathbb {R}^d\rightarrow \mathbb {R}$ is symmetric, positive definite and fourth continuously differentiable in both variables with bounded derivatives up to fourth order. More explicitly, we assume

(1)
$\left\Vert k \right\Vert _\infty :=\sup _{x\in \mathbb {R}^d} \sqrt{k(x,x)}<\infty $
(2)
$\left\Vert \nabla k \right\Vert _\infty :=\sup _{x,y\in \mathbb {R}^d} |\nabla _1 k(x,y)|=\sup _{x,y\in \mathbb {R}^d}|\nabla _2 k(x,y)|<\infty $
(3)
$\left\Vert \nabla _1\cdot \nabla _2 k \right\Vert _{\infty }:=\sup _{x,y\in \mathbb {R}^d} |\nabla _x\cdot \nabla _y k(x,y)|<\infty $.
(4)
$\left\Vert \nabla ^2 k \right\Vert _\infty :=\sup _{x,y\in \mathbb {R}^d} \left\Vert \nabla _x^2 k(x,y) \right\Vert _2<\infty $.
(5)
$\left\Vert \nabla _1\nabla _2 k(x,y) \right\Vert _\infty :=\sup _{x,y\in \mathbb {R}^d} \left\Vert \nabla _x\nabla _y k(x,y) \right\Vert _2<\infty $.
(6)
$\left\Vert \nabla ^2(\nabla _1\cdot \nabla _2 k) \right\Vert _\infty :=\sup _{x,y\in \mathbb {R}^d} \left\Vert \nabla _x^2(\nabla _x\cdot \nabla _y k(x,y)) \right\Vert _2<\infty $.
(7)
$\left\Vert \nabla _1\nabla _2(\nabla _1\cdot \nabla _2 k) \right\Vert _\infty :=\sup _{x,y\in \mathbb {R}^d} \left\Vert \nabla _x\nabla _y (\nabla _x\cdot \nabla _y k(x,y)) \right\Vert _2<\infty $.
(8)
$\left\Vert \nabla _1^2 \cdot \nabla _2^2 k \right\Vert _\infty :=\sup _{x,y\in \mathbb {R}^d} \sum _{i,j=1}^d |\partial _{x_i}\partial _{x_j}\partial _{y_i}\partial _{y_j}k(x,y)|<\infty $.

We emphasize here that [30] required that the kernel is radial for their analysis. However, our analysis does not require this assumption. A classical example of a kernel satisfying the above conditions is the Gaussian kernel.

Assumption V1

The potential function $V:\mathbb {R}^d\rightarrow \mathbb {R}$ satisfies

(1)
$V\in \mathcal {C}^2(\mathbb {R}^d)$, $V\ge 0$ and $V(x)\rightarrow +\infty $ as $|x|\rightarrow +\infty $.
(2)
For any $\alpha ,\beta >0$, there exists a constant $C_{\alpha ,\beta }>0$ such that if $|y|\le \alpha |x|+\beta $, then
$$\begin{aligned} (1+|x|)(|\nabla V(y)|+\left\Vert \nabla ^2 V(y)\right\Vert _2)\le C_{\alpha ,\beta }(1+V(x)). \end{aligned}$$
(3)
V is gradient Lipschitz with parameter $L_V$, i.e., for all $x\in \mathbb {R}^d$, $\left\Vert \nabla ^2 V (x)\right\Vert _2\le L_V$.

To present our result, we define the set of functions

$$\begin{aligned} Y:=\left\{ u\in \mathcal {C}(\mathbb {R}^d;\mathbb {R}^d) | \sup _{x\in \mathbb {R}^d} |u(x)-x|<\infty \right\} , \end{aligned}$$

which is a complete metric space with the uniform metric $d_Y(u,v)=\sup _{x\in \mathbb {R}^d} |u(x)-v(x)|$.

Theorem 7

Let k satisfy Assumption K1, V satisfy Assumption V1 and $\rho _0\in \mathcal {P}_V$.

(i)
For any $T>0$, there exists a unique solution $\Phi (\cdot ,\cdot ,\rho _0)\in \mathcal {C}^1([0,T];Y)$ to (28). Moreover, the measure $\rho _t=\Phi (t,\cdot ,\rho _0)_{\#}\rho _0$ satisfies
$$\begin{aligned} \left\Vert \rho _t\right\Vert _{\mathcal {P}_V}\le \left\Vert \rho _0 \right\Vert _{\mathcal {P}_V} \exp (C_{1,0}\nu ^{-1/2} \left\Vert k \right\Vert _\infty \textsc {KL}(\rho _0|\pi )^{1/2}t^{1/2}). \end{aligned}$$
(ii)
For any $\rho _0\in \mathcal {P}_V$, there is a unique $\rho \in \mathcal {C}([0,\infty ); \mathcal {P}_V )$ which is a weak solution to (7). Moreover, for all $t\ge 0$,
$$\begin{aligned} \left\Vert \rho _t\right\Vert _{\mathcal {P}_V}\le \left\Vert \rho _0 \right\Vert _{\mathcal {P}_V} \exp (C_{1,0}\nu ^{-1/2} \left\Vert k \right\Vert _\infty \textsc {KL}(\rho _0|\pi )^{1/2}t^{1/2}). \end{aligned}$$

Remark 11

In Theorem 7, we introduce an upper bound to the $\mathcal {P}_V$-norm of the solution to (7) for any $\nu \in (0,1]$. A similar result is established for the case of SVGF, i.e., when $\nu =1$ in [30, Theorem 2.4]. In comparison to [30, Theorem 2.4], our result requires that the initial KL-divergence to the target is bounded. Furthermore, if we set $\nu =1$ in our result, we do not end up recovering their result. When $\nu =1$, there is an explicit integral formula to $\mathcal {D}_{1,\rho _t}\nabla \log \frac{\rho _t}{\pi }$ which is leveraged in [30] for their proof. For $\nu \in (0,1)$, due to the absence of an explicit representation, we get the result in Theorem 7 by carefully analyzing the quantity $\mathcal {D}_{\nu ,\rho _t}(\nabla \log \frac{\rho _t}{\pi })$ along with the decay of KL-divergence property introduced in Proposition 4.

Proof of Theorem 7

Our proof leverages the approach of [30, Theorem 3.2] for the case of SVGF. In comparison to [30], we handle various difficulties arising with the non-linear operator in R-SVGF. We first prove claim (i) based on the following two steps. Claim (ii) follows directly from claim (i) and [46, Theorem 5.34].

Step 1 (Local well-posedness): Fix $r>0$ and define

$$\begin{aligned} Y_r:=\left\{ u\in Y | \sup _{x\in \mathbb {R}^d} |u(x)-x|\le r \right\} . \end{aligned}$$

(29)

We will prove that there exists $T_0>0$ such that (28) has a unique solution $\Phi (t,x,\rho _0)$ in the set $S_r=\mathcal {C}([0,T_0];Y_r)$ which is a complete metric space with metric

$$\begin{aligned} d_S\left( u,v\right) =\sup _{t\in [0,T_0]} d_Y \left( u(t,\cdot ),v(t,\cdot )\right) . \end{aligned}$$

The integral formulation of (28) is

$$\begin{aligned} \Phi (t,x,\rho _0)&=x-\int _0^t \mathcal {D}_{\nu ,\rho _s} \nabla \log \frac{\rho _s}{\pi }(\Phi (s,x,\rho _0)) ds \nonumber \\&=x+\int _0^t \big ((1-\nu )\iota _{k,\rho _s}\iota _{k,\rho _s}^*+\nu I\big )^{-1}\mathbb {E}_{y\sim \rho _s} \nonumber \\&\quad [-\nabla V(y)k(y,\cdot )+\nabla k(y,\cdot )] (\Phi (s,x,\rho _0))ds. \end{aligned}$$

(30)

Let us define the operator $\mathcal {F}:u(t,\cdot )\mapsto \mathcal {F}(u)(t,\cdot )$ by

$$\begin{aligned} \mathcal {F}(u)(t,x)&=x+\int _0^t \big ((1-\nu )\iota _{k,\rho _{u,s}}\iota _{k,\rho _{u,s}}^*+\nu I\big )^{-1}\mathbb {E}_{y\sim \rho _{u,s}} \nonumber \\&\quad [-\nabla V(y)k(y,\cdot )+\nabla k(y,\cdot )] (u(s,x))ds, \end{aligned}$$

(31)

where $\rho _{u,t}=\left( u(t,\cdot ) \right) _{\#}\rho _0$. For the simplicity of notation, we will denote the map defined in (31) by $\mathcal {F}(u)(t,x)=x-\int _0^t \mathcal {D}_{\nu ,\rho _{u,s}}\nabla \log \frac{\rho _{u,s}}{\pi }(u(s,x))ds$ for any $u\in S_r$. We now show that $\mathcal {F}$ is a contraction from $S_r$ to $S_r$ and thus has a unique fixed point. First, we show that $\mathcal {F}$ maps $S_r$ into $S_r$ for some $T_0>0$. For any $u\in S_r$, checking that $(t,x)\mapsto \mathcal {F}(u)(t,x)$ is continuous is straightforward. We need to show that $\sup _{x\in \mathbb {R}^d} |\mathcal {F}(u)(t,x)-x|\le r $ for any $u\in S_r$.

Note that there is an equivalent representation for $\mathcal {D}_{\nu ,\rho _{u,s}}$:

$$\begin{aligned} \mathcal {D}_{\nu ,\rho _{u,s}}=\iota _{k,\rho _{u,s}}\left( (1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,s}}^*. \end{aligned}$$

We then analyze the operators $\iota _{k,\rho _{u,s}}$ and $\left( (1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,s}}^*$ respectively. Since $\left\Vert k \right\Vert _{\infty }<\infty $, according to Proposition 1, $\iota _{k,\rho _{u,s}}$ is the inclusion operator from $\mathcal {H}_k^d$ to $L_\infty ^d(\mathbb {R}^d)$. The corresponding operator norm, denoted as $\left\Vert \iota _{k,\rho _{u,s}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d}$ can be bounded in the following way:

$$\begin{aligned} \left\Vert \iota _{k,\rho _{u,s}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d}&:= \sup _{\left\Vert f \right\Vert _{\mathcal {H}_k^d}=1} \sup _{x\in \mathbb {R}^d} |f(x)| \nonumber \\&= \sup _{\left\Vert f \right\Vert _{\mathcal {H}_k^d}=1} \sup _{x\in \mathbb {R}^d} |\langle k(x,\cdot ), f \rangle _{\mathcal {H}_k}| \nonumber \\&\le \sup _{x\in \mathbb {R}^d} \sqrt{k(x,x)}:=\left\Vert k \right\Vert _\infty . \end{aligned}$$

(32)

Meanwhile, let $(\lambda _i,e_i)_{i=1}^\infty $ be the spectrum of $\iota _{k,\rho _{u,s}}\iota _{k,\rho _{u,s}}^*$ with $(e_i)_{i=1}^\infty $ being an orthonormal basis of $L^d_2(\rho _{u,s}) \equiv \overline{\text {Ran}(\iota _{k,\rho _{u,s}}\iota _{k,\rho _{u,s}}^*)}$. According to Proposition 1, $(\sqrt{\lambda _i}e_i)_{i=1}^\infty $ is an orthonormal basis of $\mathcal {H}^d_k$; see also Remark 1. Hence, we have

$$\begin{aligned}&\left\Vert \left( (1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,s}}^*\nabla \log \frac{\rho _{u,s}}{\pi } \right\Vert _{\mathcal {H}_k^d}^2 \nonumber \\&\quad \le \left\Vert \left( (1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}+\nu I_d\right) ^{-\frac{1}{2}} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d}^2 \nonumber \\&\qquad \left\Vert \left( (1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}+\nu I_d\right) ^{-\frac{1}{2}}\iota _{k,\rho _{u,s}}^*\nabla \log \frac{\rho _{u,s}}{\pi } \right\Vert _{\mathcal {H}_k^d}^2 \nonumber \\&\quad \le \nu ^{-1}I_{\nu ,Stein}(\rho _{u,s}|\pi ). \end{aligned}$$

(33)

where the last inequality follows from (9) and the fact that $(1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}$ is positive. With (32) and (33), we get the following uniform bound on $|\mathcal {D}_{\nu ,\rho _{u,s}}{\nabla }\log \frac{\rho _{u,s}}{\pi }(x)|$ for all $x\in \mathbb {R}^d$,

$$\begin{aligned} \left| \mathcal {D}_{\nu ,\rho _{u,s}}\nabla \log \frac{\rho _{u,s}}{\pi }(x)\right|&\le \left\Vert \mathcal {D}_{\nu ,\rho _{u,s}}\nabla \log \frac{\rho _{u,s}}{\pi } \right\Vert _{L_\infty ^d}\\&\le \left\Vert \iota _{k,\rho _{u,s}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d} \\&\quad \left\Vert \left( (1-\nu )\iota _{k,\rho _{u,s}}^*\iota _{k,\rho _{u,s}}+\nu I_d\right) ^{-1} \iota _{k,\rho _{u,s}}^*\nabla \log \frac{\rho _{u,s}}{\pi } \right\Vert _{\mathcal {H}_k^d}\\&\le \nu ^{-\frac{1}{2}} \left\Vert k \right\Vert _\infty I_{\nu ,Stein}(\rho _{u,s}|\pi )^{\frac{1}{2}}. \end{aligned}$$

Therefore, for all $t\in [0,T]$ and all $ x\in \mathbb {R}^d$:

$$\begin{aligned} |\mathcal {F}(u)(t,x)-x|\le \nu ^{-\frac{1}{2}} \left\Vert k \right\Vert _\infty \int _0^T I_{\nu ,Stein}(\rho _{u,s}|\pi )^{\frac{1}{2}} ds. \end{aligned}$$

(34)

According to Lemma 2, there exists $T_0>0$ such that for all $u\in S_r$,

$$\begin{aligned} \int _0^{T_0}I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}}dt < \nu ^{1/2} \left\Vert k \right\Vert _\infty ^{-1} r , \end{aligned}$$

which along with (34) implies $|\mathcal {F}(u)(t,x)-x|\le r$ for all $u\in S_r$.

Next we show that $\mathcal {F}$ is a contraction on $S_r$. Our goal is to show that there exists $T_0>0$ and $C\in (0,1)$ such that for any $u,v\in S_r$,

$$\begin{aligned} \sup _{t\in [0,T_0]}\sup _{x\in \mathbb {R}^d} \left| \mathcal {F}(u)(t,x)-\mathcal {F}(v)(t,x)\right| < C \sup _{t\in [0,T_0]}\sup _{x\in \mathbb {R}^d} |u(t,x)-v(t,x)|. \end{aligned}$$

Let $\rho _{u,t}=(u(t,\cdot ))_{\#}\rho _0$, $\rho _{v,t}=(v(t,\cdot ))_{\#}\rho _0$, we have that

$$\begin{aligned}&\left| \mathcal {F}(u)(t,x)-\mathcal {F}(v)(t,x)\right| \\ {}&\quad =\left| \int _0^t \mathcal {D}_{\nu ,\rho _{u,s}}\nabla \log \frac{\rho _{u,s}}{\pi }\left( u(s,x)\right) -\mathcal {D}_{\nu ,\rho _{v,s}}\nabla \log \frac{\rho _{v,s}}{\pi }\left( v(s,x)\right) ds\right| \\&\quad \le \left| \int _0^t \mathcal {D}_{\nu ,\rho _{u,s}}\nabla \log \frac{\rho _{u,s}}{\pi }(u(s,x))-\mathcal {D}_{\nu ,\rho _{v,s}}\nabla \log \frac{\rho _{v,s}}{\pi }(u(s,x)) ds \right| \\&\qquad +\left| \int _0^t \mathcal {D}_{\nu ,\rho _{v,s}}\nabla \log \frac{\rho _{v,s}}{\pi }(u(s,x))-\mathcal {D}_{\nu ,\rho _{2,s}}\nabla \log \frac{\rho _{2,s}}{\pi }(v(s,x))ds \right| \\&\quad \le d_S(u,v)\int _0^{T_0} C_1(t) dt+\int _0^{T_0} L(t) dt \sup _{t\in [0,T_0]}\sup _{x\in \mathbb {R}^d} |u(t,x)-v(t,x)|\\&\quad = d_S(u,v) \int _0^{T_0} C_1(t)+L(t)\ dt, \end{aligned}$$

where the second inequality follows from Lemma 3 and Lemma 4. Furthermore, according to (38) and (39), there exists $T_0>0$ such that

$$\begin{aligned} \int _0^{T_0} C_1(t)+L(t)\ dt < 1. \end{aligned}$$

Therefore we have proved that there exists $T_0>0$ such that $\mathcal {F}$ is a contraction from $S_r$ into $S_r$. According to the contraction theorem, $\mathcal {F}$ has a unique fixed point $\Phi (\cdot ,\cdot ,\rho _0)\in S_r$ which solves (28). Defining $\rho _t=(\Phi (t,\cdot ,\rho _0))_{\#}\rho _0$, one sees that $\Phi (t,x,\rho _0)$ solves (28) in the time interval $[0,T_0]$.

Step 2 (Extension of local solution): According to (37) and (39), we can extend the local solution beyond time $T_0$ as long as the quantity

$$\begin{aligned} \left\Vert \rho _t \right\Vert _{\mathcal {P}_V} =\int _{\mathbb {R}^d} (1+V(\Phi (t,x,\rho _0))) d\rho _0(x) \end{aligned}$$

remains finite. Next we establish a bound on this quantity showing that the local solution can be extended to any $t>0$.

$$\begin{aligned}&\partial _t \int _{\mathbb {R}^d} \left( 1+V(\Phi (t,x,\rho _0)) \right) \rho _0(dx) \\&=-\int _{\mathbb {R}^d} \left\langle \nabla V( \Phi (t,x,\rho _0) ) , \mathcal {D}_{\nu ,\rho _t}\nabla \log \frac{\rho _t}{\pi }(\Phi (t,x,\rho _0)) \right\rangle d\rho _0(x) \\&=-\left\langle \nabla V, \iota _{k,\rho _t}\left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^* \nabla \log \frac{\rho _t}{\pi } \right\rangle _{L_2^d(\rho _t)} \\&=-\left\langle \iota _{k,\rho _t}^* \nabla V, \left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^* \nabla \log \frac{\rho _t}{\pi } \right\rangle _{\mathcal {H}_k^d} \\&\le \left\Vert \iota _{k,\rho _t}^* \nabla V \right\Vert _{\mathcal {H}_k^d} \left\Vert \left( (1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d\right) ^{-1}\iota _{k,\rho _t}^* \nabla \log \frac{\rho _t}{\pi } \right\Vert _{\mathcal {H}_k^d}, \end{aligned}$$

where

$$\begin{aligned}&\left\Vert \iota _{k,\rho _t}^* \nabla V \right\Vert _{\mathcal {H}_k^d}^2=\left\langle \nabla V, \iota _{k,\rho }\iota _{k,\rho _t}^*\nabla V \right\rangle _{L_2^d(\rho _t)}\\&=\int _{\mathbb {R}^d} \int _{\mathbb {R}^d} k(y,z) \left\langle \nabla V(y) ,\nabla V(z) \right\rangle d\rho _t(y) d\rho _t(z)\\&=\int _{\mathbb {R}^d}\int _{\mathbb {R}^d} k(\Phi (t,y,\rho _0),\Phi (t,z,\rho _0)) \left\langle \nabla V(\Phi (t,y,\rho _0)), \nabla V(\Phi (t,z,\rho _0)) \right\rangle d\rho _0(y) d\rho _0(z) \\&\le \left\Vert k \right\Vert _\infty ^2 \left( \int _{\mathbb {R}^d} |{\nabla V(\Phi (t,y,\rho _0))}| d\rho _0(y) \right) ^2 \\&\le \left\Vert k \right\Vert _\infty ^2 C_{1,0}^2 \left\Vert \rho _t \right\Vert _{\mathcal {P}_V}^2, \end{aligned}$$

where the last inequality follows from Assumption V1. Therefore,

$$\begin{aligned} \partial _t \left\Vert \rho _t \right\Vert _{\mathcal {P}_V}&\le C_{1,0}\left\Vert k \right\Vert _\infty \left\Vert \rho _t \right\Vert _{\mathcal {P}_V} \left\Vert ((1-\nu )\iota _{k,\rho _t}^*\iota _{k,\rho _t}+\nu I_d)^{-1}\iota _{k,\rho _t}^* \nabla \log \frac{\rho _t}{\pi } \right\Vert _{\mathcal {H}_k^d} \\&\le C_{1,0} \left\Vert k \right\Vert _\infty \nu ^{-\frac{1}{2}} I_{\nu ,Stein}(\rho _t|\pi )^{\frac{1}{2}} \left\Vert \rho _t \right\Vert _{\mathcal {P}_V}, \end{aligned}$$

where the last inequality follows from (33). It follows from Gronwall’s inequality that

$$\begin{aligned} \left\Vert \mathcal {\rho }_t \right\Vert _{\mathcal {P}_V}&\le \left\Vert \rho _0 \right\Vert _{\mathcal {P}_V} \exp \left( C_{1,0} \nu ^{-\frac{1}{2}} \left\Vert k \right\Vert _\infty \int _0^t I_{\nu ,Stein}(\rho _s|\pi )^{\frac{1}{2}} ds \right) \nonumber \\&\le \left\Vert \rho _0 \right\Vert _{\mathcal {P}_V} \exp \left( C_{1,0} \nu ^{-\frac{1}{2}} \left\Vert k \right\Vert _\infty \sqrt{t \int _0^t I_{\nu ,Stein}(\rho _s|\pi ) ds} \right) \nonumber \\&\le \left\Vert \rho _0 \right\Vert _{\mathcal {P}_V} \exp \left( C_{1,0} \nu ^{-\frac{1}{2}} \left\Vert k \right\Vert _\infty \sqrt{t \textsc {KL}(\rho _0|\pi ) } \right) , \end{aligned}$$

(35)

where the second inequality follows from Jensen’s inequality and the last inequality follows from (15). With this bound, we can iterate the argument to extend the local solution defined on $[0,T_0]\times \mathbb {R}^d$ to all of $[0,\infty )\times \mathbb {R}^d$, so that (35) holds for all $t>0$. Finally, $\Phi (\cdot ,x,\rho _0)$ has continuous first order derivative due to the integral formulation in (30), Assumption K1 and Assumption V1. The proof is thus complete.

Lemma 2

Let $\rho _0\in \mathcal {P}_V$ and suppose the assumptions in Theorem 7 hold. Then, for any $\epsilon >0$, there exists a constant $T>0$ such that for all $u\in S_r$ and $t\in [0,T]$, with $\rho _{u,t}=u(t,\cdot )_{\#}\rho _0$, we have

$$\begin{aligned} \int _0^T I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}}dt < \epsilon . \end{aligned}$$

(36)

Proof of Lemma 2

According to Proposition 2, the regularized kernelized Stein discrepancy can be written as

$$\begin{aligned} S(\rho _{u,t}|\pi )^2&= \left( \mathbb {E}_{x\sim \rho _{u,t}} \left[ \text {trace}(\mathcal {A}_\pi \phi _{\rho _{u,t},\pi }^*(x)) \right] \right) ^2\\&= \left\Vert \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d \right) ^{-\frac{1}{2}}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}^2 \\&=\left\langle \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }, \left( (1-\nu ) \iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d \right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi } \right\rangle _{\mathcal {H}_k^d}\\&=I_{\nu ,Stein}(\rho _{u,t},\pi ). \end{aligned}$$

Meanwhile, since $\rho _{u,t}=u(t,\cdot )_{\#}\rho _0$ with $u\in S_r$, for any $\mathbb {R}^d$-valued random vector X, $u(t,X)\sim \rho _{u,t}$ and $|u(t,X)-X|\le r$ almost surely. Therefore,

$$\begin{aligned} {\mathcal {W}_1}(\rho _{u,t},\pi )&\le {\mathcal {W}_1}(\rho _0,\rho _{u,t})+{\mathcal {W}_1}(\rho _0,\pi )\\&=\inf _{X\sim \rho _0,Y\sim \rho _{u,t}} \mathbb {E}\left[ |X-Y| \right] +{\mathcal {W}_1}(\rho _0,\pi ) \le r+{\mathcal {W}_1}(\rho _0,\pi ), \end{aligned}$$

where $\mathcal {W}_1$ is the Wasserstein-1 distance defined in Sect. 1.3. Next, we upper bound the regularized kernelized Stein discrepancy by the Wasserstein-1 distance. According to [22, Lemma 18], for any general vector field $\phi \in \mathcal {H}_k^d$, we have

$$\begin{aligned} \left| \mathbb {E}_{x\sim \rho _{u,t}} \left[ \text {trace}(\mathcal {A}_\pi \phi (x)) \right] \right|&\le \left( M_0(\phi )M_1(\nabla V)+M_2(\phi )d\right) {\mathcal {W}_1}(\rho _{u,t},\pi ) \\&\quad \quad +\sqrt{2M_0(\phi )M_1(\phi )\mathbb {E}_{x\sim \pi }\left[ |\nabla V(x)|^2\right] {\mathcal {W}_1}(\rho _{u,t},\pi )}, \end{aligned}$$

where for any $g:\mathbb {R}^d\rightarrow \mathbb {R}^d$ and $g\in C^1(\mathbb {R}^d)$,

$$\begin{aligned} M_0(g):=\sup _{x\in \mathbb {R}^d} |g(x)|, \quad M_1(g):=\sup _{x\ne y} \frac{|g(x)-g(y)|}{|x-y|}, \quad M_2(g):=\sup _{x\ne y} \frac{\left\Vert \nabla g(x)-\nabla g(y) \right\Vert _2}{|x-y|}. \end{aligned}$$

For any $\phi \in \mathcal {H}_k^d$ and $\phi =[\phi _1,\cdots ,\phi _d]^T$, according to [42, Lemma 4.34],

$$\begin{aligned} \sup _{x\in \mathbb {R}^d} |D^\alpha \phi _i(x)|=\sup _{x\in \mathbb {R}^d} \left| D^\alpha \left\langle \phi _i,k(x,\cdot ) \right\rangle _{\mathcal {H}_k} \right| \le \left\Vert \phi _i \right\Vert _{\mathcal {H}_k^d} \sup _{x\in \mathbb {R}^d} |D_1^\alpha D_2^\alpha k(x,x) |^{\frac{1}{2}}. \end{aligned}$$

Therefore,

$$\begin{aligned} M_0(\phi )&=\sup _{x\in \mathbb {R}^d} \sqrt{\sum _{i=1}^d \phi _i(x)^2}\le \sqrt{ \sum _{i=1} \left\Vert \phi _i \right\Vert _{\mathcal {H}_k}^2 \sup _{x\in \mathbb {R}^d} k(x,x) } =\left\Vert k \right\Vert _\infty \left\Vert \phi \right\Vert _{\mathcal {H}_k^d},\\&\\ M_1(\phi )&= \sup _{x\ne y} \frac{\sqrt{\sum _{i=1}^d\left( \phi _i(x)-\phi _i(y)\right) ^2}}{|x-y|}\le \sqrt{ \sum _{i=1}^d \sup _{x\in \mathbb {R}^d} |\nabla \phi _i(x)|^2 } \\&\le \sqrt{ \sum _{i=1}^d\sum _{j=1}^d \left\Vert \phi _i \right\Vert _{\mathcal {H}_k}^2 \sup _{x\in \mathbb {R}^d} D_1^{e_j}D_2^{e_j} k(x,x) } = \left( \sup _{x\in \mathbb {R}^d} \text {trace} \left( \nabla _1 \nabla _2 k(x,x) \right) \right) ^{\frac{1}{2}}\left\Vert \phi \right\Vert _{\mathcal {H}_k^d} \\&\le \left\Vert \nabla _1\cdot \nabla _2 k \right\Vert _\infty ^{\frac{1}{2}} \left\Vert \phi \right\Vert _{\mathcal {H}_k^d},\\&\\ M_2(\phi )&=\sup _{x\ne y} \frac{\left\Vert \nabla \phi (x)-\nabla \phi (y) \right\Vert _2}{|x-y|}\le \sup _{x\ne y} \frac{\left\Vert \nabla \phi (x)-\nabla \phi (y) \right\Vert _F}{|x-y|}\\&\le \sqrt{\sum _{i,j=1}^d \sup _{x\ne y} \frac{|\partial _j \phi _i(x)-\partial _j \phi _i(y)|^2}{|x-y|^2} } \\&\le \sqrt{\sum _{i,j,l=1}^d \sup _{x\in \mathbb {R}^d} D_1^{e_j+e_l}D_2^{e_j+e_l} k(x,x) \left\Vert \phi _i \right\Vert _{\mathcal {H}_k}^2 } \le \left\Vert \nabla _1^2 \cdot \nabla _2^2 k\right\Vert _\infty ^{\frac{1}{2}} \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}. \end{aligned}$$

According to Assumption V1, $M_1(\nabla V)=L_V$ and $\mathbb {E}_{x\sim \pi }\left[ |\nabla V(x)|^2 \right] \le C_{1,0} \left\Vert \pi \right\Vert _{\mathcal {P}_V} $. Therefore,

$$\begin{aligned} \left| \mathbb {E}_{x\sim \rho _{u,t}} \left[ \text {trace}(\mathcal {A}_\pi \phi (x)) \right] \right|&\le \left( \left\Vert k \right\Vert _\infty L_V+\left\Vert \nabla _1^2\cdot \nabla _2^2 k \right\Vert _\infty ^{\frac{1}{2}}d \right) \left\Vert \phi \right\Vert _{\mathcal {H}_k^d} ({\mathcal {W}_1}(\rho _0,\pi )+r)\\&\quad +\sqrt{ 2\left\Vert k \right\Vert _\infty \left\Vert \nabla _1\cdot \nabla _2 k \right\Vert _\infty ^{\frac{1}{2}}C_{1,0}\left\Vert \pi \right\Vert _{\mathcal {P}_V}({\mathcal {W}_1}(\rho _0,\pi )+r) } \left\Vert \phi \right\Vert _{\mathcal {H}_k^d}. \end{aligned}$$

Note that $\phi _{k,\rho _{u,t}}^*$ satisfies that $\nu \left\Vert \phi _{k,\rho _{u,t}}^* \right\Vert _{\mathcal {H}_k^d}^2+(1-\nu )\left\Vert \phi _{k,\rho _{u,t}}^* \right\Vert _{L^d_2(\rho _{u,t})^d}\le 1$. Therefore,

$$\begin{aligned} \left\Vert \phi _{k,\rho _{u,t}}^* \right\Vert _{\mathcal {H}_k^d}\le \nu ^{-1/2}, \end{aligned}$$

and

$$\begin{aligned} I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}}&=S(\rho _{u,t},\pi ) =\left| \mathbb {E}_{x\sim \rho _{u,t}} \left[ \text {trace}(\mathcal {A}_\pi \phi _{k,\rho _{u,t}}^*(x)) \right] \right| \\&\le \nu ^{-\frac{1}{2}} \left( \left\Vert k \right\Vert _\infty L_V+\left\Vert \nabla _1^2\cdot \nabla _2^2 k \right\Vert _\infty ^{\frac{1}{2}}d \right) ({\mathcal {W}_1}(\rho _0,\pi )+r)\\&\quad +\nu ^{-\frac{1}{2}}\sqrt{ 2\left\Vert k \right\Vert _\infty \left\Vert \nabla _1\cdot \nabla _2 k \right\Vert _\infty ^{\frac{1}{2}}C_{1,0}\left\Vert \pi \right\Vert _{\mathcal {P}_V}({\mathcal {W}_1}(\rho _0,\pi )+r) }. \end{aligned}$$

Since the upper bound is independent of the choice of $u(t,\cdot )\in S_r$ and the time variable t, for any $\epsilon >0$, we can choose a small enough T such that (36) holds.

Lemma 3

Under the assumptions in Theorem 7, let $S_r=\mathcal {C}([0,T];Y_r)$ with $Y_r$ defined in (29). Then for any $t\in [0,T]$, there exists $L(t)>0$ such that for any $u\in S_r$, for all $x,y\in \mathbb {R}^d$ and $t\in [0,T]$,

$$\begin{aligned} \left| \mathcal {D}_{\nu ,\rho _{u,t}}\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\mathcal {D}_{\nu ,\rho _{u,t}}\nabla \log \frac{\rho _{u,t}}{\pi }(y)\right| \le L(t)|x-y| , \end{aligned}$$

(37)

where for all $t\in [0,T]$, $\rho _{u,t}=(u(t,\cdot ))_{\#}\rho _0$ and

$$\begin{aligned} L(t)=\nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}}. \end{aligned}$$

(38)

Proof of Lemma 3

Since $\mathcal {D}_{\nu ,\rho _{u,t}}=\iota _{k,\rho _{u,t}}\left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,t}}^*$ and $\iota _{k,\rho _{u,t}}$ is the inclusion operator,

$$\begin{aligned}&\left| \mathcal {D}_{\nu ,\rho _{u,t}}\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\mathcal {D}_{\nu ,\rho _{u,t}}\nabla \log \frac{\rho _{u,t}}{\pi }(y)\right| \\&\quad =\left| \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)\right. \\&\qquad \left. -\left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(y)\right| \\&\quad =\left| \left\langle k(x,\cdot )-k(y,\cdot ),\left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi } \right\rangle _{\mathcal {H}_k}\right| \\&\quad \le \left\Vert k(x,\cdot )-k(y,\cdot ) \right\Vert _{\mathcal {H}_k} \left\Vert \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}\\&\quad \le \nu ^{-\frac{1}{2}}I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}} \left\Vert k(x,\cdot )-k(y,\cdot ) \right\Vert _{\mathcal {H}_k^d}, \end{aligned}$$

where the second identity follows from the reproducing property and the last inequality follows from (33). Furthermore, we can write

$$\begin{aligned} \left\Vert k(x,\cdot )-k(y,\cdot ) \right\Vert _{\mathcal {H}_k^d}^2&=k(x,x)-2k(x,y)+k(y,y)\\&\le \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) |x-y|^2, \end{aligned}$$

where the first identity follows from the RKHS property and the second identity follows from Taylor expansion and Assumption K1. Therefore, (37) holds with L(t) defined in (38).

Lemma 4

Under the assumptions in Theorem 7, let $S_r=\mathcal {C}([0,T];Y_r)$ with $Y_r$ defined in (29). Then for any $t\in [0,T]$, there exists $C_1(t)>0$ such that for any $u,v\in S_r$,

$$\begin{aligned} \sup _{x\in \mathbb {R}^d } \left| \mathcal {D}_{\nu ,\rho _{u,t}}\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\mathcal {D}_{\nu ,\rho _{v,t}}\nabla \log \frac{\rho _{v,t}}{\pi }(x) \right| \le C_1(t) d_S(u,v), \end{aligned}$$

(39)

where for all $t\in [0,T]$, $\rho _{u,t}=(u(t,\cdot ))_{\#}\rho _0$, $\rho _{v,t}=(v(t,\cdot ))_{\#}\rho _0$ and

$$\begin{aligned}&C_1(t)= 2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\nonumber \\&\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} \left\Vert \nabla k \right\Vert _\infty I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}}+\nu ^{-1}L_r \left\Vert k \right\Vert _\infty , \end{aligned}$$

(40)

with $L_r$ being defined in (44).

Proof of Lemma 4

With the facts that $\mathcal {D}_{\nu ,\mu }=\iota _{k,\mu }(\iota _{k,\mu }^*\iota _{k,\mu }+\nu I)^{-1}\iota _{k,\mu }^*$ and $\iota _{k,\mu }$ is the inclusion operator we get,

$$\begin{aligned}&\left| \mathcal {D}_{\nu ,\rho _{u,t}}\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\mathcal {D}_{\nu ,\rho _{v,t}}\nabla \log \frac{\rho _{v,t}}{\pi }(x) \right| \\&\quad = \left| \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)\right. \\&\qquad \left. -\left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1}\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi }(x)\right| \\&\quad \le \left| \left( \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I\right) ^{-1}-\left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \right) \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x) \right| \\&\qquad +\left| \left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \left( \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi }(x) \right) \right| . \end{aligned}$$

We then turn to study the two terms in the upper bound separately.

First term: Note that, we have

$$\begin{aligned}&\left| \left( \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I\right) ^{-1}-\left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \right) \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x) \right| \nonumber \\&\quad =\left| \iota _{k,\rho _{u,t}}\left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1}\left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}-(1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}\right) \right. \nonumber \\&\qquad \left. \times \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)\right| \nonumber \\&\quad \le \left\Vert \iota _{k,\rho _{u,t}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d} \left\Vert \left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d} \nonumber \\&\qquad (1-\nu )\left\Vert \iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}-\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d} \nonumber \\&\qquad \times \left\Vert \left( (1-\nu )\iota _{k,\rho _{u,t}}^*\iota _{k,\rho _{u,t}}+\nu I\right) ^{-1}\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi } \right\Vert _{\mathcal {H}_k^d} \nonumber \\&\quad \le \left\Vert k \right\Vert _\infty \nu ^{-1}(1-\nu ) \nu ^{-\frac{1}{2}} I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}} \nonumber \\&\qquad \times \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \left\langle \int _{\mathbb {R}^d} k(\cdot ,x)\phi (x)(d\rho _{u,t}(x)-d\rho _{v,t}(x)),\right\rangle .\nonumber \\&\qquad \left\langle .\int _{\mathbb {R}^d} k(\cdot ,y)\phi (y)(d\rho _{u,t}(y)-d\rho _{v,t}(y)) \right\rangle _{\mathcal {H}_k^d}^{\frac{1}{2}} \nonumber \\&\quad \le 2 \nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 3\left\Vert \nabla ^2 k \right\Vert _\infty +2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty \right) ^{\frac{1}{2}} I_{\nu ,Stein}(\rho _{u,t}|\pi )^{\frac{1}{2}} d_{S}(u,v). \end{aligned}$$

(41)

As we are bounding the function value by its $L_\infty ^d$ norm, the second step allows the function to be in the space of $L_\infty ^d$, without which we think of the function as belonging to the RKHS. The second inequality follows from (32) and (33). The last inequality follows from the fact that

$$\begin{aligned}&\left\langle \int _{\mathbb {R}^d} k(\cdot ,x)\phi (x)(d\rho _{u,t}(x)-d\rho _{v,t}(x)), \int _{\mathbb {R}^d} k(\cdot ,y)\phi (y)(d\rho _{u,t}(y)-d\rho _{v,t}(y)) \right\rangle _{\mathcal {H}_k^d}^{\frac{1}{2}}\\&=\bigg ( \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d}\int _{\mathbb {R}^d} \bigg \langle k\left( u(t,x),\cdot \right) \phi \left( u(t,x)\right) -k\left( v(t,x),\cdot \right) \phi \left( v(t,x)\right) ,\\&\qquad k\left( u(t,y),\cdot \right) \phi \left( u(t,y)\right) -k\left( v(t,y),\cdot \right) \phi \left( v(t,y)\right) \bigg \rangle _{\mathcal {H}_k^d} d\rho _0(x)d\rho _0(y) \bigg )^{\frac{1}{2}}\\&\le \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d} \left\Vert k\left( u(t,x),\cdot \right) \phi \left( u(t,x)\right) -k\left( v(t,x),\cdot \right) \phi \left( v(t,x)\right) \right\Vert _{\mathcal {H}_k^d} d\rho _0(x) \\&\le \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d} \left\Vert \left( k\left( u(t,x),\cdot \right) -k\left( v(t,x),\cdot \right) \right) \phi \left( u(t,x)\right) \right\Vert _{\mathcal {H}_k^d}d\rho _0(x) \\&\qquad +\sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d}\left\Vert k\left( v(t,x),\cdot \right) \left( \phi \left( u(t,x)\right) -\phi \left( v(t,x)\right) \right) \right\Vert _{\mathcal {H}_k^d} d\rho _0(x)\\&=\sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d} \left\Vert \left( k\left( u(t,x),\cdot \right) -k\left( v(t,x),\cdot \right) \right) \right\Vert _{\mathcal {H}_k}\left| \langle \phi (\cdot ),k\left( u\left( t,x\right) ,\cdot \right) \rangle _{\mathcal {H}_k}\right| d\rho _0(x)\\&\qquad + \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d}\left\Vert k\left( v(t,x),\cdot \right) \right\Vert _{\mathcal {H}_k}\left| \langle \phi (\cdot ),k\left( u(t,x),\cdot \right) -k\left( v(t,x),\cdot \right) \rangle _{\mathcal {H}_k}\right| d\rho _0(x)\\&\le \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d} \left\Vert \left( k\left( u(t,x),\cdot \right) -k\left( v(t,x),\cdot \right) \right) \right\Vert _{\mathcal {H}_k}\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}\left\Vert k\left( u\left( t,x\right) ,\cdot \right) \right\Vert _{\mathcal {H}_k}d\rho _0(x)\\&\qquad + \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^d}\left\Vert k\left( v(t,x),\cdot \right) \right\Vert _{\mathcal {H}_k}\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}\left\Vert k\left( u(t,x),\cdot \right) -k\left( v(t,x),\cdot \right) \right\Vert _{\mathcal {H}_k} d\rho _0(x)\\&= \sup _{x\in \mathbb {R}^d} \left( \sqrt{k\left( u(t,x),u(t,x)\right) +k\left( v(t,x),v(t,x)\right) -2k\left( u(t,x),v(t,x)\right) } \right. \\&\qquad \times \left. \left( \sqrt{k\left( u(t,x),u(t,x)\right) }+\sqrt{k\left( v(t,x),v(t,x)\right) }\right) \right) \\&\le 2\left\Vert k \right\Vert _\infty \left( 3\left\Vert \nabla ^2 k \right\Vert _\infty +2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty \right) ^{\frac{1}{2}} d_{S}(u,v), \end{aligned}$$

where the first identity follows from the definitions of $\rho _{u,t}$ and $\rho _{v,t}$ and change of variable. The second inequality holds due to the symmetry in x and y. The second identity follows from the reproducing property of the RKHS. The last identity follows from the fact that $\left\Vert k(x,\cdot ) \right\Vert _{\mathcal {H}_k}=\sqrt{k(x,x)}$ for all x and the last inequality follows from Assumption K1 and Taylor expansion on both variables in k up to second order.

Second term: Note that we have

$$\begin{aligned}&\left| \left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \left( \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi }(x) \right) \right| \\&=\left| \iota _{k,\rho _{v,t}}\left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \left( \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi }(x) \right) \right| \\&\le \left\Vert \iota _{k,\rho _{v,t}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d} \left\Vert \left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d} \\&\qquad \left\Vert \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}\\&\le \left\Vert k \right\Vert _\infty \nu ^{-1} \left\Vert \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}, \end{aligned}$$

where the last inequality follows from (32) and for all $x\in \mathbb {R}^d$,

$$\begin{aligned}&\iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi }(x) \\&=\int _{\mathbb {R}^d} k(x,y)\nabla \log \frac{\rho _{u,t}}{\pi }(y)d\rho _{u,t}(y)-\int _{\mathbb {R}^d} k(x,y)\nabla \log \frac{\rho _{v,t}}{\pi }d\rho _{v,t}(y) \\&= \int _{\mathbb {R}^d} \left( k(x,y)\nabla V(y)-\nabla _2 k(x,y) \right) d\rho _{u,t}(y)-\int _{\mathbb {R}^d} \left( k(x,y)\nabla V(y)-\nabla _2 k(x,y) \right) d\rho _{v,t}(y) \\&=\int _{\mathbb {R}^d} \left( k(x,u(t,y))\nabla V(u(t,y))-k(x,v(t,y))\nabla V(v(t,y))\right) d\rho _0(y)\\&\qquad - \int _{\mathbb {R}^d}\left( \nabla _2 k(x,u(t,y))-\nabla _2 k(x,v(t,y)) \right) d\rho _0(y). \end{aligned}$$

Therefore, we have

$$\begin{aligned}&\left\Vert \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi } \right\Vert _{\mathcal {H}_k^d} \\&\le \int _{\mathbb {R}^d} \Bigg (\left\Vert k(\cdot ,u(t,y))\nabla V(u(t,y))-k(\cdot ,v(t,y))\nabla V(v(t,y)) \right\Vert _{\mathcal {H}_k^d} \\&\qquad +\left\Vert \nabla _2 k(\cdot ,u(t,y))-\nabla _2 k(\cdot ,v(t,y)) \right\Vert _{\mathcal {H}_k^d}\Bigg ) d\rho _0(y). \end{aligned}$$

For simplicity, in the following calculations, we denote u(t, y) and v(t, y) as u and v respectively. We will bound $\left\Vert k(\cdot ,u)\nabla V(u)-k(\cdot ,v)\nabla V(v) \right\Vert _{\mathcal {H}_k^d}$ and $\left\Vert \nabla _2 k(\cdot ,u)-\nabla _2 k(\cdot ,v) \right\Vert _{\mathcal {H}_k^d}$ respectively. Note that we have

$$\begin{aligned}&\left\Vert k(\cdot ,u)\nabla V(u)-k(\cdot ,v)\nabla V(v) \right\Vert _{\mathcal {H}_k^d}^2\\&\quad = \left\langle k(\cdot ,u)\nabla V(u)-k(\cdot ,v)\nabla V(v), k(\cdot ,u)\nabla V(u)-k(\cdot ,v)\nabla V(v) \right\rangle _{\mathcal {H}_k^d}\\&\quad =|\nabla V(u)|^2 k(u,u)-2\left\langle \nabla V(u) , \nabla V(v) \right\rangle k(u,v)+|\nabla V(v)|^2 k(v,v)\\&\quad \le \left| \left\langle \nabla V(u)-\nabla V(v) , \nabla V(u)k(u,u)-\nabla V(v)k(v,v) \right\rangle \right| \\&\qquad +\left| \left\langle \nabla V(u) , \nabla V(v)\right\rangle \left( k(u,u)+k(v,v)-2k(u,v) \right) \right| , \end{aligned}$$

where

$$\begin{aligned}&\left| \left\langle \nabla V(u)-\nabla V(v) , \nabla V(u)k(u,u)-\nabla V(v)k(v,v) \right\rangle \right| \\&\le \left| \nabla V(u)-\nabla V(v)\right| ^2 k(u,u)+\left| \nabla V(u)-\nabla V(v)\right| \left| \nabla V(v)\right| \left| k(u,u)-k(v,v) \right| \\&\le C_{1,r}^2 (1+V(y))^2 d_{S_T}(u,v)^2 k(u,u) +C_{1,r}^2(1+V(y))^2 d_{S}(u,v)\left| k(u,u)-k(v,v)\right| \\&\le C_{1,r}^2 (1+V(y))^2 d_{S}(u,v)^2 \left\Vert k \right\Vert _\infty ^2+2C_{1,r}^2 \left\Vert \nabla k \right\Vert _\infty (1+V(y))^2 d_{S}(u,v)^2. \end{aligned}$$

The second inequality follows from Assumption V1 and the last inequality follows from Assumption K1 and Taylor expansion on both variables in k up to first order. And, we also have

$$\begin{aligned}&\left| \left\langle \nabla V(u) , \nabla V(v)\right\rangle \left( k(u,u)+k(v,v)-2k(u,v) \right) \right| \\&\le C_{1,r}^2(1+V(y))^2 \left| k(u,u)+k(v,v)-2k(u,v)\right| \\&\le C_{1,r}^2(1+V(y))^2\left( 3\left\Vert \nabla ^2 k \right\Vert _\infty +2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty \right) d_{S}(u,v)^2, \end{aligned}$$

where the first inequality follows from Assumption V1 and the last inequality follows from Assumption K1 and Taylor expansion on both variables in k up to second order. With the above two inequalities, we have

$$\begin{aligned}&\left\Vert k(\cdot ,u(t,y))\nabla V(u(t,y))-k(\cdot ,v(t,y))\nabla V(v(t,y)) \right\Vert _{\mathcal {H}_k^d} \nonumber \\&\quad \le C_{1,r}\left( \left\Vert k \right\Vert _\infty +2{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+3{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+2{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}} \right) d_{S}(u,v) (1+V(y)) . \end{aligned}$$

(42)

Observe that, for all $x,y\in \mathbb {R}^d$

$$\begin{aligned} \langle \nabla _2 k(\cdot ,x), \nabla _2 k(\cdot , y) \rangle _{\mathcal {H}_k^d}&=\nabla _1\cdot \nabla _2 \langle k(\cdot ,x),k(\cdot ,y) \rangle _{\mathcal {H}_k^d}\\&=\nabla _1\cdot \nabla _2 k(x,y). \end{aligned}$$

If we denote the function $\nabla _1\cdot \nabla _2 k=D_{1,2}k$ where $D_{1,2}k$ is symmetric since k is symmetric, we get

$$\begin{aligned} \left\Vert \nabla _2 k(\cdot ,u)-\nabla _2 k(\cdot ,v) \right\Vert _{\mathcal {H}_k^d}^2&=D_{1,2}k(u,u)+ D_{1,2}k(v,v)-2D_{1,2}k(u,v) \\&\le \left( 2\left\Vert \nabla ^2(D_{1,2}k) \right\Vert _\infty +\left\Vert \nabla _1\nabla _2(D_{1,2}k) \right\Vert _\infty \right) d_{S}(u,v)^2. \end{aligned}$$

where the inequality follows from Taylor expansion on both variables of $D_{1,2}k$. Therefore,

$$\begin{aligned} \left\Vert \nabla _2 k(\cdot ,u)-\nabla _2 k(\cdot ,v) \right\Vert _{\mathcal {H}_k^d}\le \left( 2\left\Vert \nabla ^2(D_{1,2}k) \right\Vert _\infty ^{\frac{1}{2}}+\left\Vert \nabla _1\nabla _2(D_{1,2}k) \right\Vert _\infty ^{\frac{1}{2}} \right) d_{S}(u,v). \end{aligned}$$

(43)

According to (42) and (43), we get

$$\begin{aligned} \left\Vert \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}\le L_r d_s(u,v) \end{aligned}$$

with

$$\begin{aligned} \begin{aligned} L_r&= C_{1,r}\left\Vert \rho _0 \right\Vert _{\mathcal {P}_V}\left( \left\Vert k \right\Vert _\infty +2{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+3{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+2{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}}\right) \\&\quad \qquad + 2{\left\Vert \nabla ^2(D_{1,2}k) \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla _1\nabla _2(D_{1,2}k) \right\Vert _\infty }^{\frac{1}{2}}. \end{aligned} \end{aligned}$$

(44)

Therefore, the second term is bounded as

$$\begin{aligned}&\left| \left( (1-\nu )\iota _{k,\rho _{v,t}}^*\iota _{k,\rho _{v,t}}+\nu I\right) ^{-1} \left( \iota _{k,\rho _{u,t}}^*\nabla \log \frac{\rho _{u,t}}{\pi }(x)-\iota _{k,\rho _{v,t}}^*\nabla \log \frac{\rho _{v,t}}{\pi }(x) \right) \right| \nonumber \\&\quad \le \nu ^{-1} L_r \left\Vert k \right\Vert _\infty d_S(u,v). \end{aligned}$$

(45)

The Lemma is proved based on (41) and (45).

6 Stability

In this section, we prove a stability estimate for the weak solutions to (7). To do this, we introduce a space of probability measures on $\mathbb {R}^d$ and assumptions on V as follows,

$$\begin{aligned} \mathcal {P}_p:=\left\{ \rho \in \mathcal {P}:\left\Vert \rho \right\Vert _{\mathcal {P}_p}:=\int _{\mathbb {R}^d} |x|^p \rho (x)dx<\infty \right\} , \end{aligned}$$

where $\mathcal {P}$ denotes the set of all probability measures on $\mathbb {R}^d$.

Assumption V2

In addition to Assumption V1, there exists a constant $C_V>0$ and $q>1$ such that $|\nabla V(x)|^q \le C_V (1+V(x))$ for all $x\in \mathbb {R}^d$ and $\sup _{\theta \in [0,1]} \left\Vert \nabla ^2 V (\theta x+(1-\theta )y)\right\Vert _2^q \le C_V (1+V(x)+V(y)) $.

Theorem 8

Let V satisfy Assumption V2 with $q\in (1,\infty )$ and k satisfy Assumption K1. Let p be the conjugate of q, i.e., $p^{-1}+q^{-1}=1$. Let $\rho _1,\rho _2\in \mathcal {P}_p$ be two initial probability measures satisfying $\left\Vert \rho _i \right\Vert _{\mathcal {P}_p}\le R$ for some constant $R>0$ and $i=1,2$. Let $\rho _{1,t}$ and $\rho _{2,t}$ be the associated weak solution to (7). Then given any $T>0$, there exists a constant $C>0$ depending on $k,V,q,\nu ,\rho _1,\rho _2$ such that,

$$\begin{aligned} \sup _{t\in [0,T]} \mathcal {W}_p(\rho _{1,t},\rho _{2,t})\le C\mathcal {W}_p(\rho _1,\rho _2). \end{aligned}$$

More explicitly, the constant C is given by

$$\begin{aligned} C=&\exp \bigg ( \nu ^{-1}\left\Vert k \right\Vert _\infty C(T,k,V,\nu ,\rho _1,\rho _2,q)T++ \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\nonumber \\&\sqrt{ \textsc {KL}(\rho _2|\pi ) } T^{\frac{1}{2}} \nonumber \\&\qquad \qquad +2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k+3\left\Vert \nabla ^2 k \right\Vert _\infty \right\Vert _\infty \right) ^{\frac{1}{2}}\sqrt{\textsc {KL}(\rho _1|\pi )} T^{\frac{1}{2}} \bigg ), \end{aligned}$$

(46)

where $C(T,k,V,\nu ,\rho _1,\rho _2,q)$ is defined in (56).

Before we prove Theorem 8, we introduce the following corollary to Theorem 8.

Corollary 1

Suppose the assumptions in Theorem 8 hold. Let $\rho _1 \in \mathcal {P}_p$ and $\rho _1^N$ be the empirical measure formed from N samples from $\rho _1$. Then for any $T>0$,

$$\begin{aligned} \sup _{t\in [0,T]} \mathcal {W}_p(\rho _{1,t},\rho _{1,t}^N)\rightarrow 0, \qquad \qquad \text {as }N\rightarrow \infty , \end{aligned}$$

where $\rho _{1,t}$ and $\rho _{1,t}^N$ are the unique weak solutions to (7) with initial conditions $\rho _1$ and $\rho _1^N$ respectively.

Since $\mathcal {W}_p(\rho _1^N,\rho _1)\rightarrow 0$ as $N\rightarrow \infty $, Corollary 1 follows directly from Theorem 8. Non-asymptotic bounds on $\sup _{t\in [0,T]}\mathcal {W}_p(\rho _{1,t},\rho _{1,t}^N)$ that are non-uniform in time T (i.e., for $T<\infty $) in terms of N can also be obtained based on Theorem 8 and the convergence of empirical measures in $\mathcal {W}_p$ [20, 49]. We leave that to the interested reader.

Remark 12

If we focus on the dependency on $\nu $ and T in (46), we have

$$\begin{aligned} C\le C'\exp \left( \nu ^{-1}T\exp \left( C' \nu ^{-\frac{1}{2}}T^{\frac{1}{2}} \right) +\nu ^{-\frac{3}{2}}(1-\nu )T^{\frac{1}{2}}+\nu ^{-\frac{1}{2}}T^{\frac{1}{2}} \right) \end{aligned}$$

where $C'$ is a constant independent of $\nu $ and T. In particular, the stability constant blows up when $\nu \rightarrow 0$, i.e. the smaller is $\nu $, the larger is the $\mathcal {W}_p$ distance between $\rho _{1,t}$ and $\rho _{1,t}^N$ in Corollary 1. Therefore, if we choose smaller $\nu $ in the finite-particle algorithm, the R-SVGD, we will need more particles (larger N) in order to get the comparable iteration complexities.

The proof of Theorem 8 is inspired by that of [30, Theorem 2.7] which in turn is motivated by the Dobrushin’s coupling argument (see, for example, [18] and [32, Theorem 1.4.1]). In the following proof, we mainly highlight the parts of our proof that are different from the proof of [30, Theorem 2.7].

Proof (Proof of Theorem 8)

[Proof of Theorem 8] First, under Assumption V2, there exists a constant $C_0>0$ such that $V(x)\le C_0(1+|x|^p)$ for all $x\in \mathbb {R}^d$. Therefore, $\mathcal {P}_p\subset \mathcal {P}_V$ and $\left\Vert \rho _i \right\Vert _{\mathcal {P}_V}\le C(R)<\infty $ for $i=1,2$. By Theorem 7, the weak solutions take the form

$$\begin{aligned} \rho _{i,t}=\left( \Phi (t,\cdot ,\rho _i) \right) _{{\#}\rho _i},\quad i=1,2 \end{aligned}$$

where $\Phi (\cdot ,\cdot ,\rho _i)$ solves (28) with $\rho _0=\rho _i$. Let $\mathbf {\rho }^{1,2}$ be a coupling measure between $\rho _1$ and $\rho _2$. Notice that, according to conditions in Theorem 8, $p>1$. Define $\phi (x)=\frac{1}{p}|x|^p$ and observe that $p>1$. We have that $\phi $ is continuously differentiable on $\mathbb {R}^d$ with

$$\begin{aligned} ~|\nabla \phi (x)|= |x|^{p-1},\quad \text {for all}\quad x \in \mathbb {R}^d. \end{aligned}$$

To see this, note that as $p>1$, for any $x\ne 0$, $\nabla \phi (x)=|x|^{p-2}x$ and $|\nabla \phi (x)|=|x|^{p-1}$. At $x=0$, by definition, we have $\nabla \phi (0)=0$. Since $p>1$, $\nabla \phi (x)$ is continuous. Then, we start from estimating the derivative of $\phi $ in the time variable, for which we have

$$\begin{aligned}&\partial _t {\phi }(\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2))\\&\quad =-\nabla {\phi }(\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2))\\&\qquad \left( \mathcal {D}_{\nu ,\rho _{1,t}}\nabla \log \frac{\rho _{1,t}}{\pi }(\Phi (t,x_1,\rho _1))-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(\Phi (t,x_2,\rho _2)) \right) . \end{aligned}$$

The next step is to estimate

$$\begin{aligned} \left| \mathcal {D}_{\nu ,\rho _{1,t}}\nabla \log \frac{\rho _{1,t}}{\pi }(\Phi (t,x_1,\rho _1))-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(\Phi (t,x_2,\rho _2))\right| . \end{aligned}$$

Note that

$$\begin{aligned} \mathcal {D}_{\nu ,\rho _{1,t}}\nabla \log \frac{\rho _{1,t}}{\pi }(\Phi (t,x_1,\rho _1))-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(\Phi (t,x_2,\rho _2)):=I_1+I_2, \end{aligned}$$

where

$$\begin{aligned} I_1&:=\mathcal {D}_{\nu ,\rho _{1,t}}\nabla \log \frac{\rho _{1,t}}{\pi }(\Phi (t,x_1,\rho _1))-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(\Phi (t,x_1,\rho _1)),\\ I_2&:=\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(\Phi (t,x_1,\rho _1))-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(\Phi (t,x_2,\rho _2)). \end{aligned}$$

According to Proposition 5, we have

$$\begin{aligned}&\left| -\nabla {\phi }(\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)) \cdot I_1\right| \\&\quad \le 2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\\&\qquad I_{\nu ,Stein}(\rho _{1,t}|\pi )^{\frac{1}{2}}|\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)|^{p-1}\\&\qquad \times \left( \int _{\mathbb {R}^d} |\Phi (t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|\mathbf {\rho }^{1,2}(dy_1,dy_2)\right) \\&\qquad +\nu ^{-1}\left\Vert k \right\Vert _\infty C(t,k,V,\nu ,\rho _1,\rho _2,q) |\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)|^{p-1} \\&\qquad \times \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi (t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p} \end{aligned}$$

and

$$\begin{aligned}&\left| -\nabla {\phi }(\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)) \cdot I_2\right| \\&\quad \le \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} {I_{\nu ,Stein}(\rho _{2,t}|\pi )}^{\frac{1}{2}}|\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)|^p. \end{aligned}$$

Now, defining

$$\begin{aligned} D_p(\mathbf {\rho }^{1,2})(s):=\left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi (s,y_1,\rho _1)-\Phi (s,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p}, \end{aligned}$$

we have, for any $t\in [0,T]$ that

$$\begin{aligned}&{\phi }(\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)) \\&\quad \le \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\\&\qquad \int _0^t{I_{\nu ,Stein}(\rho _{2,s}|\pi )}^{\frac{1}{2}}|\Phi (s,x_1,\rho _1)-\Phi (s,x_2,\rho _2)|^p ds\\&\qquad +{\phi }(x_1-x_2)+\nu ^{-1}\left\Vert k \right\Vert _\infty C(T,k,V,\nu ,\rho _1,\rho _2,q) \\&\qquad \int _0^t |\Phi (s,x_1,\rho _1)-\Phi (s,x_2,\rho _2)|^{p-1}D_p(\mathbf {\rho }^{1,2})(s)ds\\&\qquad +2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\\&\qquad \times \int _0^t I_{\nu ,Stein}(\rho _{1,s}|\pi )^{\frac{1}{2}}|\Phi (s,x_1,\rho _1)-\Phi (s,x_2,\rho _2)|^{p-1}D_p(\mathbf {\rho }^{1,2})(s)ds. \end{aligned}$$

Integrating the above inequality w.r.t. the coupling $\mathbf {\rho }^{1,2}$, and using the fact that

$$\begin{aligned} \int _{\mathbb {R}^d\times \mathbb {R}^d} |\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)|^{p-1}\mathbf {\rho }^{1,2}(dx_1,dx_2)\le D_p(\mathbf {\rho }^{1,2})({t})^{p-1}, \end{aligned}$$

we get

$$\begin{aligned}&D_p(\mathbf {\rho }^{1,2})(t)^p \le D_p(\mathbf {\rho }^{1,2})(0)^p+\nu ^{-1}\left\Vert k \right\Vert _\infty C(T,k,V,\nu ,\rho _1,\rho _2,q) \int _0^t D_p(\mathbf {\rho }^{1,2})(s)^p ds\\&+ 2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} \int _0^t{I_{\nu ,Stein}(\rho _{1,s}|\pi )}^{\frac{1}{2}} D_p(\mathbf {\rho }^{1,2})(s)^p ds\\&\quad + \left( \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\right) \int _0^t{I_{\nu ,Stein}(\rho _{2,s}|\pi )}^{\frac{1}{2}} D_p(\mathbf {\rho }^{1,2})(s)^p ds. \end{aligned}$$

By using Gronwall’s inequality, we further obtain

$$\begin{aligned}&D_p(\mathbf {\rho }^{1,2})(t)^p\\&\le D_p(\mathbf {\rho }^{1,2})(0)^p \exp \left( \nu ^{-1}\left\Vert k \right\Vert _\infty C(T,k,V,\nu ,\rho _1,\rho _2,q) t\right) \\&\quad \times \exp \left( 2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} \int _0^t{I_{\nu ,Stein}(\rho _{1,s}|\pi )}^{\frac{1}{2}}ds \right) \\&\quad \times \exp \left( \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} \int _0^t{I_{\nu ,Stein}(\rho _{2,s}|\pi )}^{\frac{1}{2}}ds \right) \\&\le D_p(\mathbf {\rho }^{1,2})(0)^p \exp \Bigg ( \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\sqrt{ \textsc {KL}(\rho _2|\pi ) } t^{\frac{1}{2}}\\&\quad +\nu ^{-1}\left\Vert k \right\Vert C(T,k,V,\nu ,\rho _1,\rho _2,q)t\\&\quad +2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} \sqrt{\textsc {KL}(\rho _1|\pi )}t^{\frac{1}{2}} \Bigg ). \end{aligned}$$

Hence, we obtain

$$\begin{aligned}&\mathcal {W}_p^p(\rho _{1,t},\rho _{2,t})\\&\quad =\inf _{\pi \in \Gamma (\rho _{1,t},\rho _{2,t})} \int _{\mathbb {R}^d\times \mathbb {R}^d} |x_1-x_2|^p \pi (dx_1,dx_2) \le \inf _{\mathbf {\rho }^{1,2}\in \Gamma (\rho _{1},\rho _{2})} D_p(\mathbf {\rho }^{1,2})(t)^p\\&\quad \le \exp \Bigg ( \nu ^{-1}\left\Vert k \right\Vert _\infty C(T,k,V,\nu ,\rho _1,\rho _2,q)t\\&\qquad + \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\sqrt{ \textsc {KL}(\rho _2|\pi ) } t^{\frac{1}{2}}\\&\qquad +2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\sqrt{\textsc {KL}(\rho _1|\pi )}t^{\frac{1}{2}} \Bigg )\\&\qquad \inf _{\mathbf {\rho }^{1,2}\in \Gamma (\rho _{1},\rho _{2})} D_p(\mathbf {\rho }^{1,2})(0)^p\\&\quad =\exp \Bigg ( \nu ^{-1}\left\Vert k \right\Vert _\infty C(T,k,V,\nu ,\rho _1,\rho _2,q)t\\&\qquad + \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\sqrt{ \textsc {KL}(\rho _2|\pi ) } t^{\frac{1}{2}}\\&\qquad +2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\sqrt{\textsc {KL}(\rho _1|\pi )}t^{\frac{1}{2}} \Bigg )\mathcal {W}_p^p(\rho _1,\rho _2), \end{aligned}$$

yielding the result. $\square $

Note that in Lemma 4, we studied perturbation bounds for (28) under two different push-forward maps. In the next result, with the existence and uniqueness of solutions proved in Theorem 7, we study perturbation results for (7) with two different initial conditions.

Proposition 5

Under the assumptions of Theorem 8, Let $\mathbf {\rho }^{1,2}$ be a probability measure on $\mathbb {R}^{2d}$ with marginals $\rho _1$ and $\rho _2$. Then for any $x,y\in \mathbb {R}^d$, we have

$$\begin{aligned}&\left| \mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }({x})-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }({y})\right| \nonumber \\&\quad \le \nu ^{-\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} {I_{\nu ,Stein}(\rho _{2,t}|\pi )}^{\frac{1}{2}}\left| {x-y}\right| , \end{aligned}$$

(47)

$$\begin{aligned}&\text {and}\nonumber \\&\left| \mathcal {D}_{\nu ,\rho _{1,t}}\nabla \log \frac{\rho _{1,t}}{\pi }({x})-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }({x})\right| \nonumber \\&\quad \le 2\nu ^{-\frac{3}{2}}(1-\nu )\left\Vert k \right\Vert _\infty ^2\left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} I_{\nu ,Stein}(\rho _{1,t}|\pi )^{\frac{1}{2}} \nonumber \\&\qquad \times \int _{\mathbb {R}^d} |\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)|\mathbf {\rho }^{1,2}(dx_1,dx_2) \nonumber \\&\qquad +\nu ^{-1}\left\Vert k\right\Vert _\infty C(t,k,V,\nu ,\rho _1,\rho _2,q)\nonumber \\&\qquad \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}\left| \Phi (t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)\right| ^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p}, \end{aligned}$$

(48)

where $C(t,k,V,\nu ,\rho _1,\rho _2,q)$ is given in (56).

Proof of Proposition 5

First, we prove (47). For any $x,y\in \mathbb {R}^d$,

$$\begin{aligned}&\left| \mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(x)-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(y)\right| \\&=\left| \left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(x)\right. \\&\qquad -\left. \left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(y)\right| \\&=\left| \left\langle k(x,\cdot )-k(y,\cdot ),\left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(\cdot ) \right\rangle _{\mathcal {H}_k}\right| \\&\le \left\Vert k(x,\cdot )-k(y,\cdot ) \right\Vert _{\mathcal {H}_k} \left\Vert \left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1}\iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}\\&\le \nu ^{-\frac{1}{2}}I_{\nu ,Stein}(\rho _{2,t}|\pi )^{\frac{1}{2}} \left\Vert k(x,\cdot )-k(y,\cdot ) \right\Vert _{\mathcal {H}_k^d}\\&\le \nu ^{-\frac{1}{2}}I_{\nu ,Stein}(\rho _{2,t}|\pi )^{\frac{1}{2}} \left( 2\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}} |x-y|, \end{aligned}$$

where the second inequality follows from (33) and the last inequality follows from the reproducing property and Taylor expansion. The claim in (47) then follows from the above inequality.

To prove (48), first we have for any $x\in \mathbb {R}^d$,

$$\begin{aligned}&\left| \mathcal {D}_{\nu ,\rho _{1,t}}\nabla \log \frac{\rho _{1,t}}{\pi }(x)-\mathcal {D}_{\nu ,\rho _{2,t}}\nabla \log \frac{\rho _{2,t}}{\pi }(x)\right| \nonumber \\&\le \left| \left( \left( (1-\nu )\iota _{k,\rho _{1,t}}^*\iota _{k,\rho _{1,t}}+\nu I_d\right) ^{-1}-\left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1} \right) \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x) \right| \nonumber \\&\qquad +\left| \left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1} \left( \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x)-\iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(x) \right) \right| . \end{aligned}$$

(49)

Next we bound the two terms on the right-hand side of (49).

First term: we have

$$\begin{aligned}&\left| \left( \left( (1- \nu )\iota _{k,\rho _{1,t}}^*\iota _{k,\rho _{1,t}}+\nu I_d\right) ^{-1}- \left( (1- \nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I_d\right) ^{-1} \right) \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x) \right| \\&=\left| \iota _{k,\rho _{1,t}}\left( (1- \nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I\right) ^{-1}\left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}-(1- \nu ) \iota _{k,\rho _{1,t}}^*\iota _{k,\rho _{1,t}}\right) \right. \\&\quad \left. \times \left( (1- \nu )\iota _{k,\rho _{1,t}}^*\iota _{k,\rho _{1,t}}+\nu I\right) ^{-1}\iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x)\right| \\&\le \left\Vert \iota _{k,\rho _{1,t}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d} \left\Vert \left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I\right) ^{-1} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d} \\&\quad (1- \nu )\left\Vert \iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}- \iota _{k,\rho _{1,t}}^*\iota _{k,\rho _{1,t}} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d} \\&\quad \times \left\Vert \left( (1- \nu )\iota _{k,\rho _{1,t}}^*\iota _{k,\rho _{1,t}}+\nu I\right) ^{-1}\iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi } \right\Vert _{\mathcal {H}_k^d} \\&\le \left\Vert k \right\Vert _\infty \nu ^{- \frac{3}{2}}(1- \nu ) I_{\nu ,Stein}(\rho _{1,t}|\pi )^{\frac{1}{2}} \\&\quad \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \left\langle \int _{\mathbb {R}^d} k(\cdot ,x)\phi (x)(d\rho _{1,t}(x)-d\rho _{2,t}(x)), \int _{\mathbb {R}^d} k(\cdot ,y)\phi (y)(d\rho _{1,t}(y)-d\rho _{2,t}(y)) \right\rangle _{\mathcal {H}_k^d}^{\frac{1}{2}}, \end{aligned}$$

and

$$\begin{aligned}&\sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \left\langle \int _{\mathbb {R}^d} k(\cdot ,x)\phi (x)(d\rho _{1,t}(x)-d\rho _{2,t}(x)), \right\rangle .\\&\left\langle .\int _{\mathbb {R}^d} k(\cdot ,y)\phi (y)(d\rho _{1,t}(y)-d\rho _{2,t}(y)) \right\rangle _{\mathcal {H}_k^d}^{\frac{1}{2}}\\&=\bigg ( \sup _{\left\Vert \phi \right\Vert _{\mathcal {H}_k^d}=1} \int _{\mathbb {R}^{d}\times \mathbb {R}^d}\int _{\mathbb {R}^{d}\times \mathbb {R}^d} \bigg \langle k\left( \Phi (t,x_1,\rho _1),\cdot \right) \phi \left( \Phi (t,x_1,\rho _1)\right) \\&\quad -k\left( \Phi (t,x_2,\rho _2),\cdot \right) \phi \left( \Phi (t,x_2,\rho _2)\right) , \\&k\left( \Phi (t,y_1,\rho _1),\cdot \right) \phi \left( \Phi (t,y_1,\rho _1)\right) -k\left( \Phi (t,y_2,\rho _2),\cdot \right) \phi \left( \Phi (t,y_2,\rho _2)\right) \bigg \rangle _{\mathcal {H}_k^d}\\&\quad \varvec{\rho }^{1,2}(dx_1,dx_2)\varvec{\rho }^{1,2}(dy_1,y_2) \bigg )^{\frac{1}{2}}\\&\le \int _{\mathbb {R}^d}\int _{\mathbb {R}^d} \left\Vert k\left( \Phi (t,x_1,\rho _1),\cdot \right) \phi \left( \Phi (t,x_1,\rho _1)\right) {-}k\left( \Phi (t,x_2,\rho _2),\cdot \right) \phi \left( \Phi (t,x_2,\rho _2)\right) \right\Vert _{\mathcal {H}_k^d} {\rho }^{1,2}(dx_1,dx_2) \\&\le 2\left\Vert k \right\Vert _\infty \left( 2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty +3\left\Vert \nabla ^2 k \right\Vert _\infty \right) ^{\frac{1}{2}}\\&\quad \int _{\mathbb {R}^d\times \mathbb {R}^d} |\Phi (t,x_1,\rho _1)-\Phi (t,x_2,\rho _2)|\varvec{\rho }^{1,2}(dx_1,dx_2). \end{aligned}$$

Second term: we have

$$\begin{aligned}&\left| \left( (1- \nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I\right) ^{-1} \left( \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x)- \iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(x) \right) \right| \\&=\left| \iota _{k,\rho _{2,t}}\left( (1- \nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I\right) ^{-1} \left( \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x)- \iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(x) \right) \right| \\&\le \left\Vert \iota _{k,\rho _{2,t}} \right\Vert _{\mathcal {H}_k^d\rightarrow L_\infty ^d} \left\Vert \left( (1-\nu )\iota _{k,\rho _{2,t}}^*\iota _{k,\rho _{2,t}}+\nu I\right) ^{-1} \right\Vert _{\mathcal {H}_k^d\rightarrow \mathcal {H}_k^d} \\&\quad \left\Vert \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }- \iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}\\&\le \left\Vert k \right\Vert _\infty \nu ^{-1} \left\Vert \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }- \iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}. \end{aligned}$$

For the factor $\left\Vert \iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }- \iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi } \right\Vert _{\mathcal {H}_k^d}$, notice that

$$\begin{aligned}&\iota _{k,\rho _{1,t}}^*\nabla \log \frac{\rho _{1,t}}{\pi }(x)- \iota _{k,\rho _{2,t}}^*\nabla \log \frac{\rho _{2,t}}{\pi }(x) \\&=\int _{\mathbb {R}^d} k(x,y)\nabla \log \frac{\rho _{1,t}}{\pi }(y)d\rho _{1,t}(y)- \int _{\mathbb {R}^d} k(x,y)\nabla \log \frac{\rho _{2,t}}{\pi }d\rho _{2,t}(y) \\&= \int _{\mathbb {R}^d} \left( k(x,y)\nabla V(y)- \nabla _2 k(x,y) \right) d\rho _{1,t}(y)\\&\qquad - \int _{\mathbb {R}^d} \left( k(x,y)\nabla V(y)- \nabla _2 k(x,y) \right) d\rho _{2,t}(y) \\&=\int _{\mathbb {R}^d\times \mathbb {R}^d} \left( k(x,\Phi (t,y_1,\rho _1))\nabla V(\Phi (t,y_1,\rho _1))\right. \\&\qquad \left. -k(x,\Phi (t,y_2,\rho _2))\nabla V(\Phi (t,y_2,\rho _2))\right) d\mathbf {\rho }^{1,2}(dy_1,dy_2) \\&\qquad - \int _{\mathbb {R}^d\times \mathbb {R}^d} \left( \nabla _2 k(x,\Phi (t,y_1,\rho _1))- \nabla _2 k(x,\Phi (t,y_2,\rho _2)) \right) d\mathbf {\rho }^{1,2}(dy_1,dy_2), \end{aligned}$$

and we get

$$\begin{aligned}&\left\Vert \iota _{k,{\rho _{1,t}}}^{*}\nabla \log \frac{{\rho _{1,t}}}{\pi }- \iota _{k,{\rho _{2,t}}}^*\nabla \log \frac{{\rho _{2,t}}}{\pi } \right\Vert _{\mathcal {H}_k^d}\nonumber \\&\le \int _{\mathbb {R}^d\times \mathbb {R}^d} \left\Vert k(\cdot ,\Phi (t,y_1,\rho _1))\nabla V(\Phi (t,y_1,\rho _1))-k(\cdot ,\Phi (t,y_2,\rho _2))\nabla V(\Phi (t,y_2,\rho _2))\right\Vert _{\mathcal {H}_k^d} d\mathbf {\rho }^{1,2}(dy_1,dy_2) \nonumber \\&\quad + \int _{\mathbb {R}^d\times \mathbb {R}^d} \left\Vert \nabla _2 k(\cdot ,\Phi (t,y_1,\rho _1))- \nabla _2 k(\cdot ,\Phi (t,y_2,\rho _2)) \right\Vert _{\mathcal {H}_k^d} d{\rho }^{1,2}(dy_1,dy_2). \end{aligned}$$

(50)

For simplicity, we denote $\Phi (t,y_1,\rho _1),\Phi (t,y_2,\rho _2)$ as $\Phi _1$ and $\Phi _2$ respectively in the following calculations. We will bound the two integrals in (50) separately.

For the first integral in (50), we have

$$\begin{aligned}&\left\Vert k(\cdot ,\Phi _1)\nabla V(\Phi _1)-k(\cdot ,\Phi _2)\nabla V(\Phi _2) \right\Vert _{\mathcal {H}_k^d}^2 \\&= \left\langle k(\cdot ,\Phi _1)\nabla V(\Phi _1)-k(\cdot ,\Phi _2)\nabla V(\Phi _2), k(\cdot ,\Phi _1)\nabla V(\Phi _1)-k(\cdot ,\Phi _2)\nabla V(\Phi _2) \right\rangle _{\mathcal {H}_k^d}\\&=|\nabla V(\Phi _1)|^2 k(\Phi _1,\Phi _1)-2\left\langle \nabla V(\Phi _1) , \nabla V(\Phi _2) \right\rangle k(\Phi _1,\Phi _2)+|\nabla V(\Phi _2)|^2 k(\Phi _2,\Phi _2)\\&{\le |}\left\langle \nabla V(\Phi _1)-\nabla V(\Phi _2) , \nabla V(\Phi _1)k(\Phi _1,\Phi _1)-\nabla V(\Phi _2)k(\Phi _2,\Phi _2) \right\rangle {|}\\&\quad +{|}\left\langle \nabla V(\Phi _1), \nabla V(\Phi _2)\right\rangle \left( k(\Phi _1,\Phi _1)+k(\Phi _2,\Phi _2)-2k(\Phi _1,\Phi _2) \right) {|}, \end{aligned}$$

where

$$\begin{aligned}&\left| \left\langle \nabla V(\Phi _1)-\nabla V(\Phi _2) , \nabla V(\Phi _1)k(\Phi _1,\Phi _1)-\nabla V(\Phi _2)k(\Phi _2,\Phi _2) \right\rangle \right| \nonumber \\&\le |\nabla V(\Phi _1)-\nabla V(\Phi _2)|^2 k(\Phi _1,\Phi _1)+|\nabla V(\Phi _1)\nonumber \\&\quad -\nabla V(\Phi _2)||\nabla V(\Phi _2)||k(\Phi _1,\Phi _1)-k(\Phi _2,\Phi _2) |\nonumber \\&\le \sup _{\theta \in [0,1]}\left\Vert \nabla ^2 ( \theta \Phi _1+(1-\theta )\Phi _2 )\right\Vert _2^2 |\Phi _1-\Phi _2|^2 \left\Vert k \right\Vert _\infty ^2\nonumber \\&\quad +\sup _{\theta \in [0,1]}\left\Vert \nabla ^2 ( \theta \Phi _1+(1-\theta )\Phi _2 )\right\Vert _2 |\Phi _1-\Phi _2| {|\nabla V(\Phi _2)|} |k(\Phi _1,\Phi _1)-k(\Phi _2,\Phi _2) |\nonumber \\&\le C_V^{2/q}(1+V(\Phi _1)+V(\Phi _2))^{2/q} \left\Vert k \right\Vert _\infty ^2 |\Phi _1-\Phi _2|^2\nonumber \\&\quad + C_V^{1/q} (1+V(\Phi _1)+V(\Phi _2))^{1/q} C_V^{{1/q}}(1+V(\Phi _2))^{1/q} \left\Vert \nabla k \right\Vert _\infty |\Phi _1-\Phi _2|^2 \nonumber \\&\le C_V^{2/q} \left( \left\Vert k \right\Vert _\infty ^2+\left\Vert \nabla k \right\Vert _\infty \right) (1+V(\Phi _1)+V(\Phi _2))^{2/q} |\Phi _1-\Phi _2|^2. \end{aligned}$$

(51)

The third inequality follows from Assumption V2 and the last inequality follows from Assumption K1 and Taylor expansion on both variables in k up to first order. Furthermore, we have

$$\begin{aligned}&\left| \left\langle \nabla V(\Phi _1), \nabla V(\Phi _2)\right\rangle \left( k(\Phi _1,\Phi _1)+k(\Phi _2,\Phi _2)-2k(\Phi _1,\Phi _2) \right) \right| \nonumber \\&\le C_V^{2/q} (1+V(\Phi _1))^{1/q}(1+V(\Phi _2))^{1/q} |k(\Phi _1,\Phi _1)+k(\Phi _2,\Phi _2)-2k(\Phi _1,\Phi _2)|\nonumber \\&\le C_V^{2/q} (1+V(\Phi _1))^{1/q}(1+V(\Phi _2))^{1/q}\left( 3\left\Vert \nabla ^2 k \right\Vert _\infty +2\left\Vert \nabla _1\nabla _2 k\right\Vert _\infty \right) |\Phi _1-\Phi _2|^2, \end{aligned}$$

(52)

where the first inequality follows from Assumption V2 and the last inequality follows from Assumption K1 and Taylor expansion on both variables in k up to second order.

From (51) and (52), we get

$$\begin{aligned}&\int _{\mathbb {R}^d\times \mathbb {R}^d} \left\Vert k(\cdot ,\Phi (t,y_1,\rho _1))\nabla V(\Phi (t,y_1,\rho _1))-k(\cdot ,\Phi (t,y_2,\rho _2))\nabla V(\Phi (t,y_2,\rho _2))\right\Vert _{\mathcal {H}_k^d} d\mathbf {\rho }^{1,2}(dy_1,dy_2) \nonumber \\&\le C_V^{1/q}\left( \left\Vert k \right\Vert _\infty +{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+3{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+2{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}} \right) \nonumber \\&\qquad \times \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|(1+V(\Phi (t,y_1,\rho _1))\nonumber \\&\qquad +V(\Phi (t,y_2,\rho _2)))^{1/q}\mathbf {\rho }^{1,2}(dy_1,dy_2) \nonumber \\&\le C_V^{1/q}\left( \left\Vert k \right\Vert _\infty +{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+3{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+2{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}} \right) \nonumber \\&\qquad \times \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p} \nonumber \\&\qquad \times \left( \int _{\mathbb {R}^d\times \mathbb {R}^d} {\big (}1+V(\Phi (t,y_1,\rho _1))+V(\Phi (t,y_2,\rho _2)) {\big )} \mathbf {\rho }^{1,2}(dy_1,dy_2)\right) ^{1/q} \nonumber \\&\le \left( \left\Vert \rho _1 \right\Vert _{\mathcal {P}_V} \exp (C_{1,0}\nu ^{-1/2}\left\Vert k \right\Vert _\infty \sqrt{t \textsc {KL}(\rho _1|\pi )} )\right. \nonumber \\&\qquad \left. +\left\Vert \rho _2 \right\Vert _{\mathcal {P}_V} \exp (C_{1,0}\nu ^{-1/2}\left\Vert k \right\Vert _\infty \sqrt{t \textsc {KL}(\rho _2|\pi ) )} \right) ^{1/q} \nonumber \\&\qquad \times C_V^{1/q}\left( \left\Vert k \right\Vert _\infty +{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+3{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+2{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}} \right) \nonumber \\&\qquad \times \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p} \nonumber \\&\le 3C_V^{1/q}\left( \left\Vert k \right\Vert _\infty +{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}} \right) \left( \left\Vert \rho _1 \right\Vert _{\mathcal {P}_V}+\left\Vert \rho _2 \right\Vert _{\mathcal {P}_V} \right) ^{1/q} \nonumber \\&\qquad \times \exp (C_{1,0}\nu ^{-1/2}q^{-1}\left\Vert k \right\Vert _\infty \sqrt{t (\textsc {KL}(\rho _1|\pi )+\textsc {KL}(\rho _2|\pi )})) \nonumber \\&\qquad \times \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p} \nonumber \\&:=C_1(k,V) \left( \left\Vert \rho _1 \right\Vert _{\mathcal {P}_V}+\left\Vert \rho _2 \right\Vert _{\mathcal {P}_V} \right) ^{\frac{1}{q}}\exp \left( D_1(k,\nu ,q) \left( \textsc {KL}(\rho _1|\pi )+\textsc {KL}(\rho _2|\pi )\right) ^{\frac{1}{2}}t^{\frac{1}{2}} \right) \nonumber \\&\qquad \times \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p}. \end{aligned}$$

(53)

where the third inequality follows from (35).

For the second integral in (50), denoting the function $\nabla _1\cdot \nabla _2 k=D_{1,2}k$, we first notice that $D_{1,2}k$ is symmetric since k is symmetric. According to the above identity, we get

$$\begin{aligned} \left\Vert \nabla _2 k(\cdot ,\Phi _1)-\nabla _2 k(\cdot ,\Phi _2) \right\Vert _{\mathcal {H}_k^d}^2&=D_{1,2}k(\Phi _1,\Phi _1)+ D_{1,2}k(\Phi _2,\Phi _2)-2D_{1,2}k(\Phi _1,\Phi _2) . \end{aligned}$$

Applying Taylor’s series expansion on both variables of $D_{1,2}k$, we get

$$\begin{aligned} \left\Vert \nabla _2 k(\cdot ,\Phi _1)-\nabla _2 k(\cdot ,\Phi _2) \right\Vert _{\mathcal {H}_k^d}^2\le \left( 2\left\Vert \nabla ^2(D_{1,2}k) \right\Vert _\infty +\left\Vert \nabla _1\nabla _2(D_{1,2}k) \right\Vert _\infty \right) |\Phi _1-\Phi _2|^2 . \end{aligned}$$

Therefore we obtain

$$\begin{aligned}&\int _{\mathbb {R}^d}\left\Vert \nabla _2 k(\cdot ,\Phi (t,y_1,\rho _1))-\nabla _2 k(\cdot ,\Phi (t,y_2,\rho _2)) \right\Vert _{\mathcal {H}_k^d} d\mathbf {\rho }^{1,2}(dy_1,dy_2) \nonumber \\&\quad \le \left( 2{\left\Vert \nabla ^2(D_{1,2}k) \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla _1\nabla _2(D_{1,2}k) \right\Vert _\infty }^{\frac{1}{2}} \right) \nonumber \\&\qquad \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi (t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)| \mathbf {\rho }^{1,2}(dy_1,dy_2). \end{aligned}$$

(54)

Based on (50), (53) and (54), we then get

$$\begin{aligned}&\left\Vert \iota _{k,{\rho _{1,t}}}^*\nabla \log \frac{{\rho _{1,t}}}{\pi }-\iota _{k,{\rho _{2,t}}}^*\nabla \log \frac{{\rho _{2,t}}}{\pi } \right\Vert _{\mathcal {H}_k^d}\nonumber \\&\le C_1(k,V) \left( \left\Vert \rho _1 \right\Vert _{\mathcal {P}_V}+\left\Vert \rho _2 \right\Vert _{\mathcal {P}_V} \right) ^{\frac{1}{q}}\exp \left( D_1(k,\nu ,q) \left( \textsc {KL}(\rho _1|\pi )+\textsc {KL}(\rho _2|\pi )\right) ^{\frac{1}{2}}t^{\frac{1}{2}} \right) \nonumber \\&\quad \left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p} \nonumber \\&\quad + \left( 2{\left\Vert \nabla ^2(D_{1,2}k) \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla _1\nabla _2(D_{1,2}k) \right\Vert _\infty }^{\frac{1}{2}} \right) \nonumber \\&\quad \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi (t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)| \mathbf {\rho }^{1,2}(dy_1,dy_2)\nonumber \\&\le C(t,k,V,\nu ,\rho _1,\rho _2,q)\left( \int _{\mathbb {R}^d\times \mathbb {R}^d}|\Phi _1(t,y_1,\rho _1)-\Phi (t,y_2,\rho _2)|^p \mathbf {\rho }^{1,2}(dy_1,dy_2) \right) ^{1/p}, \end{aligned}$$

(55)

with

$$\begin{aligned}&C(t,k,V,\nu ,\rho _1,\rho _2,q)\nonumber \\&=C_1(k,V) \left( \left\Vert \rho _1 \right\Vert _{\mathcal {P}_V}+\left\Vert \rho _2 \right\Vert _{\mathcal {P}_V} \right) ^{\frac{1}{q}}\nonumber \\&\quad \exp \left( D_1(k,\nu ,q) \left( \textsc {KL}(\rho _1|\pi )+\textsc {KL}(\rho _2|\pi )\right) ^{\frac{1}{2}}t^{\frac{1}{2}} \right) +C_2(k), \end{aligned}$$

(56)

where

$$\begin{aligned} C_1(k,V)&= 3C_V^{1/q}\left( \left\Vert k \right\Vert _\infty +{\left\Vert \nabla k \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla ^2 k \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla _1\nabla _2 k \right\Vert _\infty }^{\frac{1}{2}} \right) , \\ D_1(k,\nu ,q)&=C_{1,0}\nu ^{-1/2}q^{-1}\left\Vert k \right\Vert _\infty ,\\ C_2(k)&=2{\left\Vert \nabla ^2(\nabla _1\cdot \nabla _2 k) \right\Vert _\infty }^{\frac{1}{2}}+{\left\Vert \nabla _1\nabla _2(\nabla _1\cdot \nabla _2 k) \right\Vert _\infty }^{\frac{1}{2}}. \end{aligned}$$

Therefore (48) follows from our estimations on the First term and Second term.

7 Space-Time Discretization: A Practical Algorithm

In this section, we introduce a practical space-time discretization to the R-SVGF described in (28). In the algorithm, we let positive integers N and n to denote the number of particles and (discrete) iterations. We denote by $(X_n^i)_{i=1}^N$ the position of the N particles at the n-th step. We let ${\bar{X}}_n:=[X_n^1,\ldots ,X_n^N]^T$. For all functions $f:\mathbb {R}^d\rightarrow \mathbb {R}^d$, we define the operator $L_n$ as

$$\begin{aligned} L_n f:=[f(X_n^1),\cdots ,f(X_n^N)]^T. \end{aligned}$$

The positions of the particles are then updated as

$$\begin{aligned} {\bar{X}}_{n+1}&={\bar{X}}_n-h_{n+1}\left( \frac{(1-\nu _{n+1})}{N} K_n +\nu _{n+1} I_N \right) ^{-1} \nonumber \\&\quad \left( \frac{1}{N} K_n (L_n\nabla V) -\frac{1}{N}\sum _{j=1}^N L_n\nabla k(X_n^j,\cdot ) \right) , \end{aligned}$$

(57)

where $(h_n)_{n=1}^\infty $ is the sequence of step-sizes, $I_{N\times N}$ is the $N \times N$ identity matrix and $K_n \in {\mathbb {R}}^{N\times N}$ is the gram matrix defined as $(K_n)_{ij}=k(X_n^i,X_n^j)$ for all $i,j\in [N]$. We call the above algorithm as the Regularized SVGD algorithm. The iterates in (57) follow from Proposition 2 and the finite-sample representations for the operators $\iota _{k,{\hat{\rho }}^n}\iota _{k,{\hat{\rho }}^n}$ where ${\hat{\rho }}^n$ is the empirical measure of the particles at the n-th step, i.e., ${\hat{\rho }}^n=\sum _{i=1}^N \delta _{X_n^i}$.

While the convergence analysis of space-time discretization of the SVGF (i.e., the SVGD algorithm) and the R-SVGD (i.e., the regularized SVGD algorithm) is an interesting and challenging open question, in this section we demonstrate the improved performance of the regularized SVGD algorithm over the SVGD algorithm in some simulation examples. All experiments were done in MacBook Pro (2021 model). Specifically, we consider the simulation setup in [27]: We let the target $\pi : = (1/3) \pi _1 + (2/3) \pi _2$, where $\pi _1\equiv \text {Normal}(-2, 1)$ and $\pi _2 \equiv \text {Normal}(+2, 1)$, and we let the initial distribution to be $\text {Normal}(-10, 1)$. We now focus on numerically computing the expectations of the form ${\mathbb {E}}_{x\sim \pi } [h_i(x)]$, for three cases, $h_1(x):= x$, $h_2(x):= x^2$ and $h_3(x):= \cos (\omega x +b)$, where $\omega \sim \text {Normal}(0,1)$ and $b\sim \text {Uniform}([0,2\pi ])$.

Table 2 Average additional per-iteration wall-clock run-time (in seconds) for R-SVGD over SVGD

Full size table

In Fig. 1, we plot the mean-squared error in estimating the above expectations with the regularized and unregularized SVGD algorithm. Here, the expectation is over the initialization (and over $\omega $ and b for $h_3$). In the top row, we report the logarithm of the mean-squared error versus the number of particles N for a fixed number of iterations (set to 100). The corresponding average per-iteration wall-clock run times are reported in Table 2. In the bottom row, we report the logarithm of the mean-squared error versus the number of iterations for a fixed number of particles (set to 200). For both algorithms, we use the Gaussian kernel $k(u,v)= \exp \left( -\frac{1}{\gamma } \Vert u-v\Vert _2^2\right) $, where the bandwidth parameter $\gamma $ is set using the median heuristic [27]. We use the Adagrad step-size choice for both cases, following [27]. For the choice of the regularization parameter, we report results for various choices of $\nu $. The case of $\nu =1$ corresponds exactly to the SVGD algorithm. We notice that for small values of $\nu $ the regularized SVGD algorithm performs better than the SVGD algorithm.

We now discuss the per-iteration complexity of regularized SVGD and standard SVGD. SVGD requires computing the kernel matrix (which requires ${\mathcal {O}}(N^2)$ operations per-iterations) and the gradient of the potential (i.e., $\nabla V$). As pointed out in [27] in Bayesian inference problems, typically, computing the gradient of the potential might be the main bottleneck. However, this computation could easily be done in parallel due to its finite-sum structure in Bayesian inference problems. We denote by ${\mathcal {O}}(\text {Score})$, the time complexity of computing/evaluating the gradient of the potential for a given target density. On top of the above computations, each iteration of regularized SVGD requires inverting a $N \times N$ matrix (or equivalently, solving a positive-definite linear system of equations), which costs ${\mathcal {O}}(N^3)$ complexity. With standard implementations, the per-iteration complexity regularized SVGD and standard SVGD are hence of order $\max \{{\mathcal {O}}(N^3), {\mathcal {O}}(\text {Score})\}$ and $\max \{{\mathcal {O}}(N^2), {\mathcal {O}}(\text {Score})\}$, respectively.

The matrix being inverted in regularized SVGD is the regularized kernel matrix also arising in other problems like kernel ridge regression. Hence, the rich literature on efficiently inverting kernel matrices (under various structural assumptions) could be leveraged in this context. While we leave a detailed study of speeding up the regularized SVGD algorithm as future work, we conclude here two concrete methods for provably speeding-up practical implementations of the regularized SVGD algorithms (under structural assumptions):

Pre-conditioned CG methods: State-of-the-art methods on designing pre-conditioners for conjugate gradient methods for kernel ridge regression from [17] could be leveraged in the context of regularized SVGD. In particular, such results may help reduce the computational complexity of regularized SVGD to $\max \{{\mathcal {O}}(N^2), {\mathcal {O}}(\text {Score})\}$, matching that of SVGD.
Randomized methods: Randomized algorithms designed in the context of kernel ridge regression, namely Random Fourier Features [35] and sketching [52] could also be leveraged to speed-up practical implementations of regularized SVGD.

Notes

See, for example, [1, 38] for the exact definition.
Defined in Sect. 1.3.
Defined in Sect. 2.

References

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savare. Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Springer Science & Business Media, 2005.
Douglas Azevedo and Valdir Antonio Menegatto. Sharp estimates for eigenvalues of integral operators generated by dot product kernels on the sphere. Journal of Approximation Theory, 177:57–68, 2014.
Article MathSciNet Google Scholar
Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and Geometry of Markov Diffusion Operators, volume 103. Springer, 2014.
Krishnakumar Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: First-order stationarity guarantees for Langevin Monte Carlo. In Conference on Learning Theory, pages 2896–2923. PMLR, 2022.
Krishnakumar Balasubramanian, Tong Li, and Ming Yuan. On the optimality of kernel-embedding based goodness-of-fit tests. Journal of Machine Learning Research, 22(1), 2021.
Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2011.
José Antonio Carrillo, Katy Craig, and Francesco S Patacchini. A blob method for diffusion. Calculus of Variations and Partial Differential Equations, 58(2):1–53, 2019.
Article MathSciNet Google Scholar
Lin Chen and Sheng Xu. Deep neural tangent kernel and Laplace kernel have the same RKHS. In International Conference on Learning Representations, 2020.
Yongxin Chen, Sinho Chewi, Adil Salim, and Andre Wibisono. Improved analysis for a proximal algorithm for sampling. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178, pages 2984–3014, 2022.
Google Scholar
Alina Chertock. A practical guide to deterministic particle methods. In Handbook of Numerical Analysis, volume 18, pages 177–202. Elsevier, 2017.
Sinho Chewi, Murat A Erdogdu, Mufan Li, Ruoqi Shen, and Shunshi Zhang. Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. In Conference on Learning Theory, pages 1–2. PMLR, 2022.
Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. Advances in Neural Information Processing Systems, 33:2098–2109, 2020.
Google Scholar
Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. In International Conference on Machine Learning, pages 2606–2615. PMLR, 2016.
Katy Craig and Andrea Bertozzi. A blob method for the aggregation equation. Mathematics of Computation, 85(300):1681–1717, 2016.
Article MathSciNet Google Scholar
Felipe Cucker and Ding-Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint, volume 24. Cambridge University Press, 2007.
Pierre Degond and Francisco-José Mustieles. A deterministic approximation of diffusion equations using particles. SIAM Journal on Scientific and Statistical Computing, 11(2):293–310, 1990.
Article MathSciNet Google Scholar
Mateo Díaz, Ethan N Epperly, Zachary Frangella, Joel A Tropp, and Robert J Webber. Robust, randomized preconditioning for kernel ridge regression. arXiv preprintarXiv:2304.12465, 2023.
Roland Dobrushin. Vlasov equations. Functional Analysis and Its Applications, 13(2):115–123, 1979.
Article Google Scholar
Andrew Duncan, Nikolas Nüsken, and Lukasz Szpruch. On the geometry of stein variational gradient descent. Journal of Machine Learning Research, 24:1–39, 2023.
MathSciNet Google Scholar
Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability theory and related fields, 162(3-4):707–738, 2015.
Article MathSciNet Google Scholar
Jackson Gorham, Andrew B Duncan, Sebastian J Vollmer, and Lester Mackey. Measuring sample quality with diffusions. The Annals of Applied Probability, 29(5):2884–2928, 2019.
Article MathSciNet Google Scholar
Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pages 1292–1301. PMLR, 2017.
Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
Article MathSciNet Google Scholar
Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for Stein Variational Gradient Descent. Advances in Neural Information Processing Systems, 33, 2020.
Qiang Liu. Stein Variational Gradient Descent as gradient flow. Advances in Neural Information Processing Systems, 30, 2017.
Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284. PMLR, 2016.
Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. Advances in Neural Information Processing Systems, 29, 2016.
Tianle Liu, Promit Ghosal, Krishnakumar Balasubramanian, and Natesh Pillai. Towards understanding the dynamics of gaussian-stein variational gradient descent. Advances in Neural Information Processing Systems, 36, 2024.
Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. In 33rd Conference on Uncertainty in Artificial Intelligence, UAI 2017, 2017.
Jianfeng Lu, Yulong Lu, and James Nolen. Scaling limit of the Stein Variational Gradient Descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648–671, 2019.
Article MathSciNet Google Scholar
Ha Quang Minh, Partha Niyogi, and Yuan Yao. Mercer’s theorem, feature maps, and smoothing. In International Conference on Computational Learning Theory, pages 154–168. Springer, 2006.
Adrian Muntean, Jens Rademacher, and Antonios Zagaris. Macroscopic and Large Scale Phenomena: Coarse Graining, Mean Field Limits and Ergodicity. Springer, 2016.
Vern Paulsen and Mrinal Raghupathi. An Introduction to the Theory of Reproducing Kernel Hilbert Spaces, volume 152. Cambridge University Press, 2016.
Pierre-Arnaud Raviart. An analysis of particle methods. In Numerical Methods in Fluid Dynamics, pages 243–324. Springer, 1985.
Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. Advances in neural information processing systems, 30, 2017.
Giovanni Russo. Deterministic diffusion of particles. Communications on Pure and Applied Mathematics, 43(6):697–733, 1990.
Article MathSciNet Google Scholar
Adil Salim, Lukang Sun, and Peter Richtarik. A convergence theory for SVGD in the population limit under Talagrand’s inequality $T_1$. In International Conference on Machine Learning, pages 19139–19152. PMLR, 2022.
Filippo Santambrogio. $\{$Euclidean, metric, and Wasserstein$\}$ gradient flows: An overview. Bulletin of Mathematical Sciences, 7(1):87–154, 2017.
Article MathSciNet Google Scholar
Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. In International conference on Artificial Intelligence and Statistics, pages 3394–3402. PMLR, 2021.
Carl-Johann Simon-Gabriel, Alessandro Barp, Bernhard Schölkopf, and Lester Mackey. Metrizing weak convergence with maximum mean discrepancies. Journal of Machine Learning Research, 24(184):1–20, 2023.
MathSciNet Google Scholar
Bharath Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11, 1517–1561, 2010.
MathSciNet Google Scholar
Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008.
Lukang Sun and Peter Richtárik. A note on the convergence of mirrored Stein Variational Gradient Descent under $(L_0, L_1)$-smoothness condition. arXiv preprintarXiv:2206.09709, 2022.
Nicolas Garcia Trillos and Daniel Sanz-Alonso. The Bayesian update: Variational formulations and gradient flows. Bayesian Analysis, 15(1):29–56, 2020.
MathSciNet Google Scholar
Santosh Vempala and Andre Wibisono. Rapid convergence of the Unadjusted Langevin Algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.
Cédric Villani. Topics in Optimal Transportation, volume 58. American Mathematical Soc., 2021.
Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein Variational Gradient Descent with matrix-valued kernels. Advances in Neural Information Processing Systems, 32, 2019.
Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pages 5219–5227. PMLR, 2018.
Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620–2648, 2019.
Article MathSciNet Google Scholar
Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.
Lantian Xu, Anna Korba, and Dejan Slepc̆ev. Accurate quantization of measures via interacting particle-based optimization. In International Conference on Machine Learning, pages 24576–24595. PMLR, 2022.
Yun Yang, Mert Pilanci, and Martin J Wainwright. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, pages 991–1023, 2017.

Download references

Acknowledgements

The authors thank the Editors-in-Chief, Associate Editor and two reviewers for detailed comments, which helped to improve the presentation. YH was supported in part by NSF TRIPODS grant CCF-1934568. KB was supported in part by NSF grant DMS-2053918. BKS was supported in part by NSF CAREER Award DMS-1945396. JL was supported in part by NSF via award DMS-2012286. Parts of this work were done when YH, KB and JL visited the Simons Institute for the Theory of Computing as a part of the “Geometric Methods in Optimization and Sampling" program during Fall 2021.

Author information

Authors and Affiliations

School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, Atlanta, GA, 30332, USA
Ye He
Department of Statistics, University of California, Davis, 399 Crocker Lane, One Shields Avenue, Davis, CA, 95616, USA
Krishnakumar Balasubramanian
Department of Statistics, Pennsylvania State University, 314, Thomas Building, University Park, PA, 16802, USA
Bharath K. Sriperumbudur
Mathematics Department, Duke University, Box 90320, 120 Science Dr., Durham, NC, 27708-0320, USA
Jianfeng Lu

Authors

Ye He
View author publications
You can also search for this author in PubMed Google Scholar
Krishnakumar Balasubramanian
View author publications
You can also search for this author in PubMed Google Scholar
Bharath K. Sriperumbudur
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krishnakumar Balasubramanian.

Additional information

Communicated by Joan Bruna.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

He, Y., Balasubramanian, K., Sriperumbudur, B.K. et al. Regularized Stein Variational Gradient Flow. Found Comput Math (2024). https://doi.org/10.1007/s10208-024-09663-w

Download citation

Received: 15 November 2022
Revised: 12 March 2024
Accepted: 08 May 2024
Published: 17 July 2024
DOI: https://doi.org/10.1007/s10208-024-09663-w

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Regularized Stein Variational Gradient Flow

Abstract

Similar content being viewed by others

Convergent numerical approximation of the stochastic total variation flow

A variational finite volume scheme for Wasserstein gradient flows

Variational approach to coarse-graining of generalized gradient flows

1 Introduction

1.1 Summary of Contributions

1.2 Organization

1.3 Notations

2 Preliminaries on Reproducing Kernel Hilbert Space

Assumption A1

Proposition 1

Remark 1

3 Regularized SVGF

Proposition 2

Proof

Remark 2

4 Convergence Results in Continuous and Discrete Time

4.1 Regularized Stein-Fisher Information and its Properties

Definition 1

Remark 3

Proposition 3

Proof of Proposition 3

4.2 Convergence Results for R-SVGF

4.2.1 Relationship Between R-SVGF and WGF

Theorem 1

Proof of Theorem 1

Remark 4

Remark 5

4.2.2 Convergence to Equilibrium Along the Fisher Information

Theorem 2

Proposition 4

Proof of Proposition 4

Proof of Theorem 2

4.2.3 Convergence in KL-Divergence Under LSI

Assumption 1

Theorem 3

Proof of Theorem 3

Theorem 4

Proof of Theorem 4

Remark 6

4.3 Convergence Results for Time-Discretized R-SVGF

Assumption A2

Theorem 5

Lemma 1

Proof of Lemma 1

Proof of Theorem 5

Remark 7

Theorem 6

Proof of Theorem 6

Remark 8

Remark 9

Remark 10

5 Existence and Uniqueness

Assumption K1

Assumption V1

Theorem 7

Remark 11

Proof of Theorem 7

Lemma 2

Proof of Lemma 2

Lemma 3

Proof of Lemma 3

Lemma 4

Proof of Lemma 4

6 Stability

Assumption V2

Theorem 8

Corollary 1

Remark 12

Proof (Proof of Theorem 8)

Proposition 5

Proof of Proposition 5

7 Space-Time Discretization: A Practical Algorithm

Notes

References

Acknowledgements

Author information

Authors and Affiliations