Abstract
This is an expository paper on the theory of gradient flows, and in particular of those PDEs which can be interpreted as gradient flows for the Wasserstein metric on the space of probability measures (a distance induced by optimal transport). The starting point is the Euclidean theory, and then its generalization to metric spaces, according to the work of Ambrosio, Gigli and Savaré. Then comes an independent exposition of the Wasserstein theory, with a short introduction to the optimal transport tools that are needed and to the notion of geodesic convexity, followed by a precise description of the Jordan–Kinderlehrer–Otto scheme and a sketch of proof to obtain its convergence in the easiest cases. A discussion of which equations are gradient flows PDEs and of numerical methods based on these ideas is also provided. The paper ends with a new, theoretical, development, due to Ambrosio, Gigli, Savaré, Kuwada and Ohta: the study of the heat flow in metric measure spaces.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Gradient flows, or steepest descent curves, are a very classical topic in evolution equations: take a functional F defined on a vector space X, and, instead of looking at points x minizing F (which is related to the static equation \(\nabla F(x)=0\)), we look, given an initial point \(x_0\), for a curve starting at \(x_0\) and trying to minimize F as fast as possible (in this case, we will solve equations of the form \(x'(t)=-\nabla F(x(t))\)). As we speak of gradients (which are element of X, and not of \(X'\) as the differential of F should be), it is natural to impose that X is an Hilbert space (so as to identify it with its dual and produce a gradient vector). In the finite-dimensional case, the above equation is very easy to deal with, but also the infinite-dimensional case is not so exotic. Indeed, just think at the evolution equation \(\partial _t u =\Delta u\), which is the evolution variant of the static Laplace equation \(-\Delta u=0\). In this way, the heat equation is the gradient flow, in the \(L^2\) Hilbert space, of the Dirichlet energy \(F(u)=\frac{1}{2}\int |\nabla u|^2\), of which \(-\Delta u\) is the gradient in the appropriate sense (more generally, one could consider equations of the form \(\partial _t u =\delta F/\delta u\), where this notation stands for the first variation of F).
But this is somehow classical... The renovated interest for the notion of gradient flow arrived between the end of the twentieth century and the beginning of the twenty-first century, with the work of Jordan et al. [54] and then of Otto [78], who saw a gradient flow structure in some equations of the form \(\partial _t\varrho -\nabla \cdot (\varrho \mathbf {v})=0\), where the vector field \(\mathbf {v}\) is given by \(\mathbf {v}=\nabla [\delta F/\delta \varrho ]\). This requires to use the space of probabilities \(\varrho \) on a given domain, and to endow it with a non-linear metric structure, derived from the theory of optimal transport. This theory, initiated by Monge [77] in the eighteenth century, then developed by Kantorovich in the ’40s [55], is now well-established (many texts present it, such as [84, 90, 91]) and is intimately connected with PDEs of the form of the continuity equation \(\partial _t\varrho -\nabla \cdot (\varrho \mathbf {v})=0\).
The turning point for the theory of gradient flows and for the interest that researchers in PDEs developed for it was surely the publication of [3]. This celebrated book established a whole theory on the notion of gradient flow in metric spaces, which requires careful definitions because, in the equation \(x'(t)=-\nabla F(x(t))\), neither the term \(x'\) nor \(\nabla F\) make any sense in this framework. For existence and—mainly—uniqueness results, the notion of geodesic convexity (convexity of a functional F defined on a metric space X, when restricted to the geodesic curves of X) plays an important role. Then the theory is particularized in the second half of [3] to the case of the metric space of probability measures endowed with the so-called Wasserstein distance coming from optimal transport, whose differential structure is widely studied in the book. In this framework, the geodesic convexity results that McCann obtained in [72] are crucial to make a bridge from the general to the particular theory.
It is interesting to observe that, finally, the heat equation turns out to be a gradient flow in two different senses: it is the gradient flow of the Dirichlet energy in the \(L^2\) space, but also of the entropy \(\int \varrho \log (\varrho )\) in the Wasserstein space. Both frameworks can be adapted from the particular case of probabilities on a domain \(\Omega \subset \mathbb R^d\) to the more general case of metric measure spaces, and the question whether the two flows coincide, or under which assumptions they do, is natural. It has been recently studied by Ambrosio, Gigli, Savaré and new collaborators (Kuwada and Ohta) in a series of papers [6, 48, 50], and has been the starting point of recent researches on the differential structure of metric measure spaces.
The present survey, which is an extended, updated, and English version of a Bourbaki seminar given by the author in 2013 ([83]; the reader will also remark that most of the extensions are essentially taken from [84]), aims at giving an overview of the whole theory. In particular, among the goals, there is at the same time to introduce the tools for studying metric gradient flows, but also to see how to deal with Wasserstein gradient flows without such a theory. This could be of interest for applied mathematicians, who could be more involved in the specific PDEs that have this gradient flow form, without a desire for full generality; for the same audience, a section has been added about numerical methods inspired from the so-called JKO (Jordan–Kinderlehrerer–Otto) scheme, and one on a list of equations which fall into these framework. De facto, more than half of the survey is devoted to the theory in the Wasserstein spaces.
The paper is organized as follows: after this introduction, In Sect. 2 we describe the theory in the Euclidean case, and present those which are the good definitions which can be translated into a metric setting; Sect. 3 is devoted to the general metric setting, as in the first half of [3], and is quite expository (only the key ingredients to obtain the proofs are sketched); Sect. 4 is the longest one and contains the Wasserstein setting: after an introduction to optimal transport and to the Wasserstein distances, there in an informal presentation of the equations that can be obtained as gradient flows, a presentation of the different tools to prove convergence of the discrete scheme (called JKO scheme) to a solution of the corresponding equation, a discussion of the functionals which have geodesic convexity properties, a short discussion about other equations and functionals which fit the framework and possible variants, and finally a short section about numerics. Last but not least, in Sect. 5 we give a very short presentation of the fascinating topic of heat flows in arbitraty metric measure spaces, with reference to the interesting implications that this has in the differential structure of these spaces.
This survey is meant to be suitable for readers with different backgrounds and interests. In particular, the reader who is mainly interested in gradient flows in the Wasserstein space and in PDE applications can decide to skip Sects. 3, part of 4.5 and 5, which deal on the contrary with key objects for the—very lively at the moment—subject of analysis on metric measure spaces.
This is an expository paper where, of course, no new result is presented. Moreover, the presentation is inspired from different sources (several articles and books by the author, in particular [84], even if Sects. 3 and 5 are not detailed there, but also [82, 83]); yet, some topics are presented under a different point of view, and/or for the first time. In particular, the reader will find new presentation ingredients in some parts of Sects. 2.1 (some classical observations on gradient flows in \(\mathbb R^n\), which are presented via a JKO-compatible approach, yet not usually detailed in optimal-transport-oriented texts), 4.5 (the explicit role of geodesic convexity for uniqueness results), and 4.7 (the discussion about numerical methods).
2 From Euclidean to metric
2.1 Gradient flows in the Euclidean space
Before dealing with gradient flows in general metric spaces, the best way to clarify the situation is to start from the easiest case, i.e. what happens in the Euclidean space \(\mathbb R^n\). Most of what we will say stays true in an arbitrary Hilbert space, but we will stick to the finite-dimensional case for simplicity.
Here, given a function \(F:\mathbb R^n\rightarrow \mathbb R\), smooth enough, and a point \(x_0\in \mathbb R^n\), a gradient flow is just defined as a curve x(t), with starting point at \(t=0\) given by \(x_0\), which moves by choosing at each instant of time the direction which makes the function F decrease as much as possible. More precisely, we consider the solution of the Cauchy Problem
This is a standard Cauchy problem which has a unique solution if \(\nabla F\) is Lipschitz continuous, i.e. if \(F\in C^{1,1}\). We will see that existence and uniqueness can also hold without this strong assumption, thanks to the variational structure of the equation.
A first interesting property is the following, concerning uniqueness and estimates. We will present it in the case where F is convex, which means that it could be non-differentiable, but we can replace the gradient with the subdifferential. More precisely, we can consider instead of (2.1), the following differential inclusion: we look for an absolutely continuous curve \(x:[0,T]\rightarrow \mathbb R^n\) such that
where \(\partial F(x)=\{p\in \mathbb R^n\,:\,F(y)\ge F(x)+p\cdot (y-x)\; \text{ for } \text{ all } y\in \mathbb R^n\}\). We refer to [80] for all the definitions and notions from convex analysis that could be needed in the following, and we recall that, if F is differentiable at x, we have \(\partial F(x)=\{\nabla F(x)\}\) and that F is differentiable at x if and only if \(\partial F\) is a singleton. Also note that \(\partial F(x)\) is always a convex set, and is not empty whenever F is real-valued (or x is in the interior of \(\{x\,:\,F(x)<+\infty \}\)), and we denote by \(\partial ^\circ F(x)\) its element of minimal norm.
Proposition 2.1
Suppose that F is convex and let \(x_1\) and \(x_2\) be two solutions of (2.2). Then \(t\mapsto |x_1(t)-x_2(t)|\) is non-increasing. In particular the Cauchy problem (2.2) has a unique solution.
Proof
Consider \(g(t)=\frac{1}{2}|x_1(t)-x_2(t)|^2\): we have
Using the fact that for every \(x_1,x_2,p_1,p_2\) with \(p_i\in \partial F(x_i)\) we have \((x_1-x_2)\cdot (p_1-p_2)\ge 0,\) we obtain \(g'(t)\le 0\). This proves the first part of the claim. The second is an easy consequence: if we have \(x_1(0)=x_2(0)\), then we find \(x_1(t)=x_2(t)\) for any \(t>0\).\(\square \)
We can also study the case where F is semi-convex, i.e. \(\lambda \)-convex for some \(\lambda \in \mathbb R\). This means that \(x\mapsto F(x)-\frac{\lambda }{2} |x|^2\) is convex: for \(\lambda >0\) this is stronger than convexity, and for \(\lambda <0\) it is weaker. Roughly speaking, \(\lambda \)-convexity corresponds to \(D^2F\ge \lambda I\). The reason for the interest in semi-convex functions lies in the fact that on the one hand, as the reader will see throughout the exposition, the general theory of gradient flows applies very well to this class of functions and that, on the other hand, they are general enough to cover many interesting cases. In particular, on a bounded set, all smooth (\(C^2\) is enough) functions are \(\lambda \)-convex for a suitable \(\lambda <0\).
For \(\lambda \)-convex functions, we can define their subdifferential as follows
This definition is consistent with the above one whenever \(\lambda \ge 0\) (and guarantees \(\partial F(x)\ne \emptyset \) for \(\lambda <0\)). Also, one can check that, setting \(\tilde{F}(x)=F(x)-\frac{\lambda }{2} |x|^2\), this definition coincides with \(\{p\in \mathbb R^n\,:\,p-\lambda x\in \partial \tilde{F}(x)\}\). Again, we define \(\partial ^\circ F\) the element of minimal norm of \(\partial F\).
Remark 2.1
From the same proof of Proposition 2.1, one can also deduce uniqueness and stability estimates in the case where F is \(\lambda \)-convex. Indeed, in this case we obtain \(|x_1(t)-x_2(t)|\le |x_1(0)-x_2(0)|e^{-\lambda t},\) which also proves, if \(\lambda >0\), exponential convergence to the unique minimizer of F. The key point is that, if F is \(\lambda \)-convex it is easy to prove that \(x_1,x_2,p_1,p_2\) with \(p_i\in \partial F(x_i)\) provides
This implies \(g'(t)\le -2\lambda g(t)\) and allows to conclude, by Gronwall’s lemma, \(g(t)\le g(0) e^{-2\lambda t}\). For the exponential convergence, if \(\lambda >0\) then F is coercive and admits a minimizer, which is unique by strict convexity. Let us call it \(\bar{x}\). Take a solution x(t) and compare it to the constant curve \(\bar{x}\), which is a solution since \(0\in \partial F(\bar{x})\). Then we get \(|x_1(t)-\bar{x}|\le e^{-\lambda t}|x_1(0)-\bar{x}|\).
Another well-known fact about the \(\lambda \)-convex case is the fact that the differential inclusion \(x'(t)\in -\partial F(x(t))\) actually becomes, a.e., an equality: \(x'(t)=-\partial ^\circ F(t)\). More precisely, we have the following.
Proposition 2.2
Suppose that F is \(\lambda \)-convex and let x be a solution of (2.2). Then, for all the times \(t_0\) such that both \(t\mapsto x(t)\) and \(t\mapsto F(x(t))\) are differentiable at \(t=t_0\), the subdifferential \(\partial F(x(t_0))\) is contained in a hyperplane orthogonal to \(x'(t_0)\). In particular, we have \(x'(t)=-\partial ^\circ F(x(t))\) for a.e. t.
Proof
Let \(t_0\) be as in the statement, and \(p\in \partial F(x(t_0))\). From the definition of subdifferential, for every t we have
but this inequality becomes an equality for \(t=t_0\). Hence, the quantity
is minimal for \(t=t_0\) and, differentiating in t (which is possible by assumption), we get
Since this is true for every \(p\in \partial F(x(t_0))\), this shows that \(\partial F(x(t_0))\) is contained in a hyperplane of the form \(\{p\,:\,p\cdot x'(t_0)=const\}\).
Whenever \(x'(t_0)\) belongs to \(\partial F(x(t_0))\) (which is true for a.e. \(t_0\)), this shows that \(x'(t_0)\) is the orthogonal projection of 0 onto \(\partial F(x(t_0))\) and onto the hyperplane which contains it, and hence its element of minimal norm. This provides \(x'(t_0)=-\partial ^\circ F(x(t_0))\) for a.e. \(t_0\), as the differentiability of x and of \(F\circ x\) are also true a.e., since x is supposed to be absolutely continuous and F is locally Lipschitz.\(\square \)
Another interesting feature of those particular Cauchy problems which are gradient flows is their discretization in time. Actually, one can fix a small time step parameter \(\tau >0\) and look for a sequence of points \((x^\tau _k)_k\) defined through the iterated scheme, called Minimizing Movement Scheme (see [1, 36]),
The convexity assumption on F is not necessary for this part of the analysis. It is enough to suppose that F is l.s.c. and satisfies some lower bounds, for instance \(F(x)\ge C_1-C_2|x|^2\), to guarantee that a minimizer exists for small \(\tau \). However, it is true that, if F is \(\lambda \)-convex, then these conditions are satisfied, and we also have uniqueness of the minimizer. Indeed, for \(\lambda >0\) we have strict convexity for every \(\tau \), and the sum will be strictly convex for small \(\tau \) if \(\lambda <0\).
The sequence of points \(x^\tau _{k}\) must be considered as the values of the curve x(t) at times \(t=0,\tau ,\dots , k\tau ,\dots \) If we look at the optimality conditions of the recursive minimization we obtain
This last condition, written as
is exactly the discrete-time implicit Euler scheme for \(x'=-\nabla F(x)\)! (note that in the convex non-smooth case this becomes \( \frac{x^\tau _{k+1}-x^\tau _k}{\tau }\in - \partial F(x^\tau _{k+1})\)).
We recall that, given an ODE \(x'(t)=\mathbf {v}(x(t))\) (that we take autonomous for simplicity), with given initial datum \(x(0)=x_0\), Euler schemes are time-discretizations where derivatives are replaced by finite differences. We fix a time step \(\tau >0\) and define a sequence \(x^\tau _k\). The explicit scheme is given by
while the implicit scheme is given by
This means that \(x^\tau _{k+1}\) is selected as a solution of an equation involving \(x^\tau _k\), instead of being explicitly computable from \(x^\tau _k\). The explicit scheme is obviously easier to implement, but enjoys less stability and qualitative properties than the implicit one. Suppose for instance \( \mathbf {v}=-\nabla F\): then the quantity F(x(t)) decreases in t in the continuous solution, which is also the case for the implicit scheme, but not for the explicit one (which represents the iteration of the gradient method for the minimization of F). Note that the same can be done for evolution PDEs, and that solving the heat equation \(\partial _t\varrho =\Delta \varrho _t\) by an explicit scheme is very dangerous: at every step, \(\varrho ^\tau _{k+1}\) would have two degrees of regularity less than \(\varrho ^\tau _{k}\), since it would be obtained through \(\varrho ^\tau _{k+1}=\varrho ^\tau _{k}-\tau \Delta \varrho ^\tau _{k}\).
It is possible to prove that, for \(\tau \rightarrow 0\), the sequence we found, suitably interpolated, converges to the solution of Problem (2.2). We give here below the details of this argument, as it will be the basis of the argument that we will use in Sect. 4.
First, we define two different interpolations of the points \(x^\tau _k\). Let us define two curves \(x^\tau ,\tilde{x}^\tau :[0,T]\rightarrow \mathbb R^n\) as follows: first we define
then we set
Also set
It is easy to see that \(\tilde{x}^\tau \) is a continuous curve, piecewise affine (hence absolutely continuous), satisfying \((\tilde{x}^\tau )'=\mathbf {v}^\tau \). On the contrary, \(x^\tau \) is not continuous, but satisfies by construction \(\mathbf {v}^\tau (t)\in -\partial F(x^\tau (t))\).
The iterated minimization scheme defining \(x^\tau _{k+1}\) provides the estimate
obtained comparing the optimal point \(x^\tau _{k+1}\) to the previous one. If \(F(x_0)<+\infty \) and \(\inf F>-\infty \), summing over k we get
This is valid for every \(\ell \), and we can arrive up to \(\ell =\lfloor T/\tau \rfloor \). Now, note that
This means that we have
and hence \(\tilde{x}^\tau \) is bounded in \(H^1\) and \(\mathbf {v}^\tau \) in \(L^2\). The injection \(H^1\subset C^{0,1/2}\) provides an equicontinuity bound on \(\tilde{x}^\tau \) of the form
This also implies
since \( x^\tau (t)=\tilde{x}^\tau (s)\) for a certain \(s=k\tau \) with \(|s-t|\le \tau \). Finally, we have proven the necessary compactness to obtain the following:
Proposition 2.3
Let \(\tilde{x}^\tau \), \(x^\tau \) and \(\mathbf {v}^\tau \) be constructed as above using the minimizing movement scheme. Suppose \(F(x_0)<+\infty \) and \(\inf F>-\infty \). Then, up to a subsequence \(\tau _j\rightarrow 0\) (still denoted by \(\tau \)), both \(\tilde{x}^\tau \) and \(x^\tau \) converge uniformly to a same curve \(x\in H^1\), and \(\mathbf {v}^\tau \) weakly converges in \(L^2\) to a vector function \(\mathbf {v}\), such that \(x'=\mathbf {v}\) and
-
1.
if F is \(\lambda \)-convex, we have \(\mathbf {v}(t)\in -\partial F(x(t))\) for a.e. t, i.e. x is a solution of (2.2);
-
2.
if F is \(C^1\), we have \(\mathbf {v}(t)= -\nabla F(x(t))\) for all t, i.e. x is a solution of (2.1).
Proof
Thanks to the estimates (2.6) and (2.7) and the fact that the initial point \(\tilde{x}^\tau (0)\) is fixed, we can apply Ascoli–Arzelà’s theorem to \(\tilde{x}^\tau \) and get a uniformly converging subsequence. The estimate (2.8) implies that, on the same subsequence, \(x^\tau \) also converges uniformly to the same limit, that we will call \(x=[0,T]\rightarrow \mathbb R^n\). Then, \(\mathbf {v}^\tau =(\tilde{x}^\tau )'\) and (2.6) allows to guarantee, up to an extra subsequence extraction, the weak convergence \(\mathbf {v}^\tau \rightharpoonup \mathbf {v}\) in \(L^2\). The condition \(x'=\mathbf {v}\) is automatical as a consequence of distributional convergence.
To prove 1), we will fix a point \(y\in \mathbb R^n\) and write
We then multiply by a positive measurable function \(a:[0,T]\rightarrow \mathbb R_+\) and integrate:
We can pass to the limit as \(\tau \rightarrow 0\), using the uniform (hence \(L^2\) strong) convergence \(x^\tau \rightarrow x\) and the weak convergence \(\mathbf {v}^\tau \rightharpoonup \mathbf {v}\). In terms of F, we just need its lower semi-continuity. This provides
From the arbitrariness of a, the inequality
is true for a.e. t (for fixed y). Using y in a dense countable set in the interior of \(\{F<+\infty \}\) (where F is continuous), we get \(\mathbf {v}(t)\in \partial F(x(t))\).
To prove 2), the situation is easier. Indeed we have
The first term in the equality uniformly converges, as a function of t, (since \(\nabla F\) is continuous and \(x^\tau \) lives in a compact set) to \(-\nabla F(x)\), the second weakly converges to \(\mathbf {v}\) and the third to \(x'\). This proves the claim, and the equality is now true for every t as the function \(t\mapsto -\nabla F(x(t))\) is uniformly continuous.\(\square \)
In the above result, we only proved convergence of the curves \(x^\tau \) to the limit curve x, solution of \(x'=-\nabla F(x)\) (or \(-x'\in \partial F(x)\)), but we gave no quantitative order of convergence, and we will not study such an issue in the rest of the survey neither. On the contrary, the book [3] which will be the basis for the metric case, also provides explicit estimates; these estimates are usually of order \(\tau \). An interesting observation, in the Euclidean case, is that if the sequence \(x^\tau _k\) is defined by
then we have
and the convergence is of order \(\tau ^2\). This has been used in the Wasserstein caseFootnote 1 (see Sect. 4 and in particular Sect. 4.7) in [61].
2.2 An introduction to the metric setting
The iterated minimization scheme that we introduced above has another interesting feature: it even suggests how to define solutions for functions F which are only l.s.c., with no gradient at all!
Even more, this discretized formulation can easily be adapted to metric spaces. Indeed, given a metric space (X, d) and a l.s.c. function \(F:X\rightarrow \mathbb R\cup \{+\infty \}\) (under suitable compactness assumptions to guarantee existence of the minimum), one can define
and study the limit as \(\tau \rightarrow 0\). Then, we use the piecewise constant interpolation
and study the limit of \(x^\tau \) as \(\tau \rightarrow 0\).
De Giorgi, in [36], defined what he called Generalized Minimizing Movements Footnote 2:
Definition 2.1
A curve \(x:[0,T]\rightarrow X\) is called Generalized Minimizing Movements (GMM) if there exists a sequence of time steps \(\tau _j\rightarrow 0\) such that the sequence of curves \(x^{\tau _j}\) defined in (2.10) using the iterated solutions of (2.9) uniformly converges to x in [0, T].
The compactness results in the space of curves guaranteeing the existence of GMM are also a consequence of an Hölder estimate that we already saw in the Euclidean case. Yet, in the general case, some arrangements are needed, as we cannot use the piecewise affine interpolation. We will see later that, in case the segments may be replaced by geodesics, a similar estimate can be obtained. However, we can also obtain a Hölder-like estimate from the piecewise constant interpolation.
We start from
and
The Cauchy–Schwartz inequality gives, for \(t<s\), \(t\in [k\tau ,(k+1)\tau [\) and \(s\in [l\tau ,(l+1)\tau [\) (hence \(|l-k|\le \frac{|t-s|}{\tau }+1\)),
This provides a Hölder bound on the curves \(x^\tau \) (with a negligible error of order \(\sqrt{\tau }\) which disappears at the limit \(\tau \rightarrow 0\)), and allows to extract a converging subsequence.
Anyway, if we add some structure to the metric space (X, d), a more similar analysis to the Euclidean case can be performed. This is what happens when we suppose that (X, d) is a geodesic space. This requires a short discussion about curves and geodesics in metric spaces.
Curves and geodesics in metric spaces We recall that a curve \(\omega \) is a continuous function defined on an interval, say [0, 1], and valued in a metric space (X, d). As it is a map between metric spaces, it is meaningful to say whether it is Lipschitz or not, but its velocity \(\omega '(t)\) has no meaning, unless X is a vector space. Surprisingly, it is possible to give a meaning to the modulus of the velocity (i.e. the speed) \(|\omega '|(t)\).
Definition 2.2
If \(\omega :[0,1]\rightarrow X\) is a curve valued in the metric space (X, d) we define the metric derivative of \(\omega \) at time t, denoted by \(|\omega '|(t)\) through
provided this limit exists.
In the spirit of Rademacher Theorem, it is possible to prove (see [9]) that, if \(\omega :[0,1]\rightarrow X\) is Lipschitz continuous, then the metric derivative \(|\omega '|(t)\) exists for a.e. t. Moreover we have, for \(t_0<t_1\),
The same is also true for more general curves, not only Lipschitz continuous.
Definition 2.3
A curve \(\omega :[0,1]\rightarrow X\) is said to be absolutely continuous whenever there exists \(g\in L^1([0,1])\) such that \(d(\omega (t_0),\omega (t_1))\le \int _{t_0}^{t_1}g(s)\,\mathrm{d}s\) for every \(t_0<t_1\). The set of absolutely continuous curves defined on [0, 1] and valued in X is denoted by \(\mathrm {AC}(X)\).
It is well-known that every absolutely continuous curve can be reparametrized in time (through a monotone-increasing reparametrization) and become Lipschitz continuous. The existence of the metric derivative for a.e. t is also true for \(\omega \in \mathrm {AC}(X)\), via this reparametrization.
Given a continuous curve, we can also define its length, and the notion of geodesic curves.
Definition 2.4
For a curve \(\omega :[0,1]\rightarrow X\), let us define
It is easy to see that all curves \(\omega \in \mathrm {AC}(X)\) satisfy \({{\mathrm{Length}}}(\omega )\le \int _0^1 g(t)\mathrm{d}t<+\infty \). Also, we can prove that, for any curve \(\omega \in \mathrm {AC}( X)\), we have
We collect now some more definitions.
Definition 2.5
A curve \(\omega :[0,1]\rightarrow X\) is said to be a geodesic between \(x_0\) and \(x_1\in X\) if \(\omega (0)=x_0\), \(\omega (1)=x_1\) and \({{\mathrm{Length}}}(\omega )=\min \{{{\mathrm{Length}}}(\tilde{\omega })\,:\,\tilde{\omega }(0)=x_0,\,\tilde{\omega }(1)=x_1\}\).
A space (X, d) is said to be a length space if for every x and y we have
A space (X, d) is said to be a geodesic space if for every x and y we have
i.e. if it is a length space and there exist geodesics between arbitrary points.
In a length space, a curve \(\omega :[t_0,t_1]\rightarrow X\) is said to be a constant-speed geodesic between \(\omega (0)\) and \(\omega (1)\in X\) if it satisfies
It is easy to check that a curve with this property is automatically a geodesic, and that the following three facts are equivalent (for arbitrary \( p>1\))
-
1.
\(\omega \) is a constant-speed geodesic defined on \( [t_0,t_1]\) and joining \(x_0\) and \(x_1\),
-
2.
\(\omega \in \mathrm {AC}(X)\) and \(|\omega '|(t)=\frac{d(\omega (t_0),\omega (t_1))}{t_1-t_0}\) a.e.,
-
3.
\(\omega \) solves \(\min \left\{ \int _{t_0}^{t_1} |\omega '|(t)^p\mathrm{d}t\,:\,\omega (t_0)=x_0,\omega (t_1)=x_1\right\} \).
We can now come back to the interpolation of the points obtained through the Minimizing Movement scheme (2.9) and note that, if (X, d) is a geodesic space, then the piecewise affine interpolation that we used in the Euclidean space may be helpfully replaced via a piecewise geodesic interpolation. This means defining a curve \(x^\tau :[0,T]\rightarrow X\) such that \(x^\tau (k\tau )=x^\tau _k\) and such that \(x^\tau \) restricted to any interval \([k\tau ,(k+1)\tau ]\) is a constant-speed geodesic with speed equal to \(d(x^\tau _k,x^\tau _{k+1})/\tau \). Then, the same computations as in the Euclidean case allow to prove an \(H^1\) bound on the curves \(x^\tau \) (i.e. an \(L^2\) bound on the metric derivatives \(|(x^\tau )'|\)) and prove equicontinuity.
The next question is how to characterize the limit curve obtained when \(\tau \rightarrow 0\), and in particular how to express the fact that it is a gradient flow of the function F. Of course, one cannot try to prove the equality \(x'=-\nabla F(x)\), just because neither the left-hand side nor the right-hand side have a meaning in a metric space!
In some particular cases, when the space X, the distance d, and the functional F have some extra structures, it is possible to pass to the limit the optimality conditions of each optimization problem in the discretized scheme, and characterize the limit curves (or the limit curve) x(t). This will be the case in the framework of probability measures, as it will be discussed in Sect. 4, but not in general. Indeed, having some sort of differential structure on the space X is necessary in order to do so. Hence, if we want to develop a general theory for gradient flows in metric spaces, finer tools are needed. In particular, we need to characterize the solutions of \(x'=-\nabla F(x)\) (or \(x'\in -\partial F(x)\)) by only using metric quantities (in particular, avoiding derivatives, gradients, and more generally vectors). The book by Ambrosio–Gigli–Savaré [3], and in particular its first part (the second being devoted to the space of probability measures) exactly aims at doing so.
Hence, what we do here is to present alternative characterizations of gradient flows in the smooth Euclidean case, which can be used as a definition of gradient flow in the metric case, since all the quantities which are involved have their metric counterpart.
The first observation is the following: thanks to the Young inequality, for every curve we have
Here, this inequality is an equality if and only if \(x'(r)=-\nabla F(x(r))\) for a.e. r. Hence, the condition, called EDE (Energy Dissipation Equality)
(or even the simple inequality \(F(x(s)){-}F(x(t))\ge \! \int _s^t \left( \frac{1}{2} |x'(r)|^2 \!+ \frac{1}{2} |\nabla F(x(r))|^2\right) \mathrm{d}r\)) is equivalent to \(x'=-\nabla F(x)\) a.e., and could be taken as a definition of gradient flow.
In the general theory of Gradient Flows [3] in metric spaces, another characterization, different from the EDE, is proposed in order to cope with uniqueness and stability results. It is based on the following observation: if \(F:\mathbb R^d\rightarrow \mathbb R\) is convex, then the inequality
characterizes (by definition) the vectors \(p\in \partial F(x)\) and, if \(F\in C^1\), it is only satisfied for \(p=\nabla F(x)\). Analogously, if F is \(\lambda \)-convex, the inequality that characterizes the gradient is
Hence, we can pick a curve x(t) and a point y and compute
Consequently, imposing
for all y, will be equivalent to \(-x'(t)\in -\partial F(x(t))\). This will provide a second characterization (called EVI, Evolution Variational Inequality) of gradient flows in a metric environment. Indeed, all the terms appearing in the above inequality have a metric counterpart (only squared distances and derivatives w.r.t. time appear). Even if we often forget the dependance on \(\lambda \), it should be noted that the condition EVI should actually be written as EVI\(_\lambda \), since it involves a parameter \(\lambda \), which is a priori arbitrary. Actually, \(\lambda \)-convexity of F is not necessary to define the EVI\(_\lambda \) property, but it will be necessary in order to guarantee the existence of curves which satisfy such a condition. The notion of \(\lambda \)-convexity will hence be crucial also in metric spaces, where it will be rather “\(\lambda \)-geodesic-convexity”.
The role of the EVI condition in the uniqueness and stability of gradient flows is quite easy to guess. Take two curves, that we call x(t) and y(s), and compute
If one wants to estimate \(E(t)=\frac{1}{2}d(x(t),y(t))^2\), summing up the two above inequalities, after a chain-rule argument for the composition of the function of two variables \((t,s)\mapsto \frac{1}{2}d(x(t),y(s))^2\) and of the curve \(t\mapsto (t,t)\), gives
By Gronwall Lemma, this provides uniqueness (when \(x(0)=y(0)\)) and stability.
3 The general theory in metric spaces
3.1 Preliminaries
In order to sketch the general theory in metric spaces, first we need to give (or recall) general definitions for the three main objects that we need in the EDE and EVI properties, characterizing gradient flows: the notion of speed of a curve, that of slope of a function (somehow the modulus of its gradient) and that of (geodesic) convexity.
Metric derivative We already introduced in the previous section the notion of metric derivative: given a curve \(x:[0,T]\rightarrow X\) valued in a metric space, we can define, instead of the velocity \(x'(t)\) as a vector (i.e., with its direction, as we would do in a vector space), the speed (i.e. the modulus, or norm, of \(x'(t)\)) as follows:
provided the limit exists.This is the notion of speed that we will use in metric spaces.
Slope and modulus of the gradient Many definitions of the modulus of the gradient of a function F defined over a metric space are possible. First, we call upper gradient of F every function \(g:X\rightarrow \mathbb R\) such that, for every Lipschitz curve x, we have
If F is Lipschitz continuous, a possible choice is the local Lipschitz constant
another is the descending slope (we will often say just slope), which is a notion more adapted to the minimization of a function than to its maximization, and hence reasonable for lower semi-continuous functions:
(note that the slope vanishes at every local minimum point). In general, it is not true that the slope is an upper gradient, but we will give conditions to guarantee that it is. Later on (Sect. 5) we will see how to define a Sobolev space \(H^1\) on a metric (measure) space, by using suitable relaxations of the modulus of the gradient of F.
Geodesic convexity The third notion to be dealt with is that of convexity. This can only be done in a geodesic metric space. On such a space, we can say that a function is geodesically convex whenever it is convex along geodesics. More precisely, we require that for every pair (x(0), x(1)) there existsFootnote 3 a geodesic x with constant speed connecting these two points and such that
We can also define \(\lambda \)-convex functions as those which satisfy a modified version of the above inequality:
3.2 Existence of a gradient flow
Once fixed these basic ingredients, we can now move on to the notion of gradient flow. A starting approach is, again, the sequential minimization along a discrete scheme, for a fixed time step \(\tau >0\), and then pass to the limit. First, we would like to see in which framework this procedure is well-posed. Let us suppose that the space X and the function F are such that every sub-level set \(\{ F\le c\}\) is compact in X, either for the topology induced by the distance d, or for a weaker topology, such that d is lower semi-continuous w.r.t. it; F is required to be l.s.c. in the same topology. This is the minimal framework to guarantee existence of the minimizers at each step, and to get estimates as in (2.11) providing the existence of a limit curve. It is by the way a quite general situation, as we can see in the case where X is a reflexive Banach space and the distance d is the one induced by the norm: in this case there is no need to restrict to the (very severe) assumption that F is strongly l.s.c., but the weak topology allows to deal with a much wider situation.
We can easily understand that, even if the estimate (2.11) is enough to provide compactness, and thus the existence of a GMM, it will never be enough to characterize the limit curve (indeed, it is satisfied by any discrete evolution where \(x^\tau _{k+1}\) gives a better value than \(x^\tau _k\), without any need for optimality). Hence, we will never obtain either of the two formulations—EDE or EVI—of metric gradient flows.
In order to improve the result, we should exploit how much \(x^\tau _{k+1}\) is better than \(x^\tau _k\). An idea due to De Giorgi allows to obtain the desired result, via a “variational interpolation” between the points \(x^\tau _k\) and \(x^\tau _{k+1}\). In order to do so, once we fix \(x^\tau _k\), for every \(\theta \in ]0,1]\), we consider the problem
and call \(x(\theta )\) any minimizer for this problem, and \(\varphi (\theta )\) the minimal value. It is clear that, for \(\theta \rightarrow 0^+\), we have \(x(\theta )\rightarrow x^\tau _k\) and \(\varphi (\theta )\rightarrow F(x^\tau _k)\), and that, for \(\theta =1\), we get back to the original problem with minimizer \(x^\tau _{k+1}\). Moreover, the function \(\varphi \) is non-increasing and hence a.e. differentiable (actually, we can even prove that it is locally semiconcave). Its derivative \(\varphi '(\theta )\) is given by the derivative of the function \(\theta \mapsto F(x)+\frac{d^2(x,x^\tau _k)}{2\theta \tau }\), computed at the optimal point \(x=x(\theta )\) (the existence of \(\varphi '(\theta )\) implies that this derivative is the same at every minimal point \(x(\theta )\)). Hence we have
which means, by the way, that \(d(x(\theta ),x^\tau _k)^2\) does not depend on the minimizer \(x(\theta )\) for all \(\theta \) such that \(\varphi '(\theta )\) exists. Moreover, the optimality conditions for the minimization problem with \(\theta >0\) easily show that
This can be seen if we consider the minimization of an arbitrary function \(x\mapsto F(x)+cd^2(x,\bar{x})\), for fixed \(c>0\) and \(\bar{x}\), and we consider a competitor y. If x is optimal we have \(F(y)+cd^2(y,\bar{x})\ge F(x)+cd^2(x,\bar{x})\), which implies
We divide by d(y, x), take the positive part and then the \(\limsup \) as \(y\rightarrow x\), and we get \(|\nabla ^- F|(x)\le 2cd(x,\bar{x})\).
We now come back to the function \(\varphi \) and use
(the inequality is due to the possible singular part of the derivative for monotone functions; actually, we can prove that it is an equality by using the local semiconcave behavior, but this is not needed in the following), together with the inequality
that we just proved. Hence, we get an improved version of (2.11):
If we sum up for \(k=0,1,2,\dots \) and then take the limit \(\tau \rightarrow 0\), we can prove, for every GMM x, the inequality
under some suitable assumptions that we must select. In particular, we need lower-semicontinuity of F in order to handle the term \(F(x^\tau _{k+1})\) (which will become F(x(t)) at the limit), but we also need lower-semicontinuity of the slope \(|\nabla ^- F|\) in order to handle the corresponding term.
This inequality does not exactly correspond to EDE: on the one hand we have an inequality, and on the other we just compare instants t and 0 instead of t and s. If we want equality for every pair (t, s), we need to require the slope to be an upper gradient. Indeed, in this case, we have the inequality \(F(x(0))-F(x(t))\le \int _0^t |\nabla ^- F(x(r))||x'|(r)\mathrm{d}r\) and, starting from the usual inequalities, we find that (3.3) is actually an equality. This allows to subtract the equalities for s and t, and get, for \(s<t\):
Magically, it happens that the assumption that F is \(\lambda \)-geodesically convex simplifies everything. Indeed, we have two good points: the slope is automatically l.s.c., and it is automatically an upper gradient. These results are proven in [2, 3]. We just give here the main idea to prove both. This idea is based on a pointwise representation of the slope as a \(\sup \) instead of a \(\limsup \): if F is \(\lambda \)-geodesically convex, then we have
In order to check this, we just need to add a term \(\frac{\lambda }{2} d(x,y)\) inside the positive part of the definition of \(|\nabla ^- F|(x)\), which does not affect the limit as \(y\rightarrow x\) and shows that \(|\nabla ^- F|(x)\) is smaller than this \(\sup \). The opposite inequality is proven by fixing a point y, connecting it to x through a geodesic x(t), and computing the limit along this curve.
This representation as a \(\sup \) allows to prove semicontinuity of the slope.Footnote 4 It is also possible (see [2], for instance) to prove that the slope is an upper gradient.
Let us insist anyway on the fact that the \(\lambda \)-convexity assumption is not natural nor crucial to prove the existence of a gradient flow. On the one hand, functions smooth enough could satisfy the assumptions on the semi-continuity of F and of \(|\nabla ^-F|\) and the fact that \(|\nabla ^-F|\) is an upper gradient independently of convexity; on the other hand the discrete scheme already provides a method, well-posed under much weaker assumptions, to find a limit curve. If the space and the functional allow for it (as it will be the case in the next section), we can hope to characterize this limit curve as the solution of an equation (it will be a PDE in Sect. 4), without passing through the general theory and the EDE condition.
3.3 Uniqueness and contractivity
On the contrary, if we think at the uniqueness proof that we gave in the Euclidean case, it seems that some sort of convexity should be the good assumption in order to prove uniqueness. Here we will only give the main lines of the uniqueness theory in the metric framework: the key point is to use the EVI condition instead of the EDE.
The situation concerning these two different notions of gradient flows (EVI and EDE) in abstract metric spaces has been clarified by Savaré (in an unpublished note, but the proof can also be found in [2]), who showed that
-
All curves which are gradient flows in the EVI sense also satisfy the EDE condition.
-
The EDE condition is not in general enough to guarantee uniqueness of the gradient flow. A simple example: take \(X=\mathbb R^2\) with the \(\ell ^\infty \) distance
$$\begin{aligned} d((x_1,x_2),(y_1,y_2))=|x_1-y_1|\vee |x_2-y_2|, \end{aligned}$$and take \(F(x_1,x_2)=x_1\); we can check that any curve \((x_1(t),x_2(t))\) with \(x_1'(t)=-1\) and \(|x_2'(t)|\le 1\) satisfies EDE.
-
On the other hand, existence of a gradient flow in the EDE sense is quite easy to get, and provable under very mild assumption, as we sketched in Sect. 3.2.
-
The EVI condition is in general too strong in order to get existence (in the example above of the \(\ell ^\infty \) norm, no EVI gradient flow would exist), but always guarantees uniqueness and stability (w.r.t. initial data).
Also, the existence of EVI gradient flows is itself very restricting on the function F: indeed, it is proven in [35] that, if F is such that from every starting point \(x_0\) there exists an EVI\(_\lambda \) gradient flow, then F is necessarily \(\lambda \)-geodesically-convex.
We provide here an idea of the proof of the contractivity (and hence of the uniqueness) of the EVI gradient flows.
Proposition 3.1
If two curves \(x,y:[0,T]\rightarrow X\) satisfy the EVI condition, then we have
and \(d(x(t),y(t))\le e^{-\lambda t}d(x(0),y(0))\).
The second part of the statement is an easy consequence of the first one, by Gronwall Lemma. The first is (formally) obtained by differentiating \(t\mapsto d(x(t),y(t_0))^2\) at \(t=t_0\), then \(s\mapsto d(x(t_0),y(s))^2\) at \(s=t_0\).The EVI condition allows to write
and hence, summing up, and playing with the chain rule for derivatives, we get
If we want a satisfying theory for gradient flows which includes uniqueness, we just need to prove the existence of curves which satisfy the EVI condition, accepting that this will probably require additional assumptions. This can still be done via the discrete scheme, adding a compatibility hypothesis between the function F and the distance d, a condition which involves some sort of convexity. We do not enter the details of the proof, for which we refer to [3], where the convergence to an EVI gradient flow is proven, with explicit error estimates. These a priori estimates allow to prove that we have a Cauchy sequence, and then allow to get rid of the compactness part of the proof (by the way, we could even avoid using compactness so as to prove existence of a minimizer at every time step, using almost-minimizers and the in the Ekeland’s variational principle [39]). Here, we will just present this extra convexity assumption, needed for the existence of EVI gradient flows developed in [3].
This assumption, that we will call C\(^2\)G\(^2\) (Compatible Convexity along Generalized Geodesics) is the following: suppose that, for every pair \((x_0,x_1)\) and every \(y\in X\), there is a curve x(t) connecting \(x(0)=x_0\) to \(x(1)=x_1\), such that
In other words, we require \(\lambda \)-convexity of the function F, but also the 2-convexity of the function \(x\mapsto d^2(x,y)\), along a same curve which is not necessarily the geodesic. This second condition is automatically satisfied, by using the geodesic itself, in the Euclidean space (and in every Hilbert space), since the function \(x\mapsto |x-y|^2\) is quadratic, and its Hessian matrix is 2I at every point. We can also see that it is satisfied in a normed space if and only if the norm is induced by a scalar product. It has been recently pointed out by Gigli that the sharp condition on the space X in order to guarantee existence of EVI gradient flows is that X should be infinitesimally Hilbertian (this will be made precise in Sect. 5).
Here, we just observe that C\(^2\)G\(^2\) implies \((\lambda +\frac{1}{\tau })\)-convexity, along those curves, sometimes called generalized geodesics (consider that these curves also depend on a third point, sort of a base point, typically different from the two points that should be connected), of the functional that we minimize at each time step in the minimizing movement scheme. This provides uniqueness of the minimizer as soon as \(\tau \) is small enough, and allows to perform the desired estimates.
Also, the choice of this C\(^2\)G\(^2\) condition, which is a technical condition whose role is only to prove the existence of an EVI gradient flow, has been done in view of the applications to the case of the Wasserstein spaces, that wil be the object of the next section. Indeed, in these spaces the squared distance is not in general 2-convex along geodesics, but we can find some adapted curves on which it is 2-convex, and many functionals F stay convex on these curves.
We finish this section by mentioning a recent extension to some non-\(\lambda \)-convex functionals. The starting point is the fact that the very use of Gronwall’s lemma to prove uniqueness can be modified by allowing for a weaker condition. Indeed, it is well-known that, whenever a function \(\omega \) satisfies an Osgood condition \(\int _0^1\frac{1}{\omega (s)}\mathrm{d}s=+\infty \), then \(E'\le \omega (E)\) together with \(E(0)=0\) implies \(E(t)=0\) for \(t>0\). This suggests that one could define a variant of the EVI definition for functions which are not \(\lambda \)-convex, but almost, and this is the theory developed in [34]. Such a paper studies the case where F satisfies some sort of \(\omega \)-convexity for a “modulus of convexity” \(\omega \). More precisely, this means
on generalized geodesics \(x_t\) (note that in the case \(\omega (s)=s\) we come back to (3.2)). The function \(\omega \) is required to satisfy an Osgood condition (and some other technical conditions). Then, the EVI condition is replaced by
and this allows to produce a theory with existence and uniqueness results (via a variant of Proposition 3.1). In the Wasserstein spaces (see next section), a typical case of functionals which can fit this theory are functionals involving singular interaction kernels (or solutions to elliptic PDEs, as in the Keller–Segel case) under \(L^\infty \) constraints on the density (using the fact that the gradient \(\nabla u\) of the solution of \(-\Delta u=\varrho \) is not Lipschitz when \(\varrho \in L^\infty \), but is at least log-lipschitz).
4 Gradient flows in the Wasserstein space
One of the most exciting applications (and maybe the only one,Footnote 5 in what concerns applied mathematics) of the theory of gradient flows in metric spaces is that of evolution PDEs in the space of measures. This topic is inspired from the work by Jordan, Kinderlehrer and Otto [54], who had the intuition that the heat and the Fokker–Planck equations have a common variational structure in terms of a particular distance on the probability measures, the so-called Wasserstein distance. Yet, the theory has only become formal and general with the work by Ambrosio, Gigli and Savaré (which does not mean that proofs in [54] were not rigorous, but the intuition on the general structure still needed to be better understood).
The main idea is to endow the space \(\mathcal {P}(\Omega )\) of probability measures on a domain \(\Omega \subset \mathbb R^d\) with a distance, and then deal with gradient flows of suitable functionals on such a metric space. Such a distance arises from optimal transport theory. More details about optimal transport can be found in the books by Villani [90, 91] and in the book on gradient flows by Ambrosio et al. [3]Footnote 6; a recent book by the author of this survey is also available [84].
4.1 Preliminaries on optimal transport
The motivation for the whole subject is the following problem proposed by Monge [77]: given two densities of mass \(f,\,g\ge 0\) on \(\mathbb R^d\), with \(\int f=\int g =1\), find a map \(T:\mathbb R^d\rightarrow \mathbb R^d\) pushing the first one onto the other, i.e. such that
and minimizing the quantity
among all the maps satisfying this condition. This means that we have a collection of particles, distributed with density f on \(\mathbb R^d\), that have to be moved, so that they arrange according to a new distribution, whose density is prescribed and is g. The movement has to be chosen so as to minimize the average displacement. The map T describes the movement, and T(x) represents the destination of the particle originally located at x. The constraint on T precisely accounts for the fact that we need to reconstruct the density g. In the sequel, we will always define, similarly to (4.1), the image measure of a measure \(\mu \) on X (measures will indeed replace the densities f and g in the most general formulation of the problem) through a measurable map \(T:X\rightarrow Y\): it is the measure denoted by \(T_\#\mu \) on Y and characterized by
The problem of Monge has stayed with no solution (does a minimizer exist? how to characterize it?...) till the progress made in the 1940 s with the work by Kantorovich [55]. In the Kantorovich’s framework, the problem has been widely generalized, with very general cost functions c(x, y) instead of the Euclidean distance \(|x-y|\) and more general measures and spaces.
Let us start from the general picture. Consider a metric space X, that we suppose compact for simplicityFootnote 7 and a cost function \(c:X\times X\rightarrow [0,+\infty ]\). For simplicity of the exposition, we will suppose that c is continuous and symmetric: \(c(x,y)=c(y,x)\) (in particular, the target and source space will be the same space X).
The formulation proposed by Kantorovich of the problem raised by Monge is the following: given two probability measures \(\mu ,\nu \in \mathcal {P}(X)\), consider the problem
where \(\Pi (\mu ,\nu )\) is the set of the so-called transport plans, i.e.
where \(\pi _0\) and \(\pi _1\) are the two projections of \(X\times X\) onto its factors. These probability measures over \(X\times X\) are an alternative way to describe the displacement of the particles of \(\mu \): instead of saying, for each x, which is the destination T(x) of the particle originally located at x, we say for each pair (x, y) how many particles go from x to y. It is clear that this description allows for more general movements, since from a single point x particles can a priori move to different destinations y. If multiple destinations really occur, then this movement cannot be described through a map T. It can be easily checked that if \((id, T)_{\#}\mu \) belongs to \(\Pi (\mu ,\nu )\) then T pushes \(\mu \) onto \(\nu \) (i.e. \(T_\#\mu =\nu \)) and the functional takes the form \(\int c(x,T(x))\mathrm{d}\mu (x),\) thus generalizing Monge’s problem.
The minimizers for this problem are called optimal transport plans between \(\mu \) and \(\nu \). Should \(\gamma \) be of the form \((id, T)_{\#}\mu \) for a measurable map \(T:X\rightarrow X\) (i.e. when no splitting of the mass occurs), the map T would be called optimal transport map from \(\mu \) to \(\nu \).
This generalized problem by Kantorovich is much easier to handle than the original one proposed by Monge: for instance in the Monge case we would need existence of at least a map T satisfying the constraints. This is not verified when \(\mu =\delta _0\), if \(\nu \) is not a single Dirac mass. On the contrary, there always exists at least a transport plan in \(\Pi (\mu ,\nu )\) (for instance we always have \(\mu \otimes \nu \in \Pi (\mu ,\nu )\)). Moreover, one can state that \((\textit{KP})\) is the relaxation of the original problem by Monge: if one considers the problem in the same setting, where the competitors are transport plans, but sets the functional at \(+\infty \) on all the plans that are not of the form \((id, T)_{\#}\mu \), then one has a functional on \(\Pi (\mu ,\nu )\) whose relaxation (in the sense of the largest lower-semicontinuous functional smaller than the given one) is the functional in \((\textit{KP})\) (see for instance Section 1.5 in [84]).
Anyway, it is important to note that an easy use of the Direct Method of Calculus of Variations (i.e. taking a minimizing sequence, saying that it is compact in some topology—here it is the weak convergence of probability measures—finding a limit, and proving semicontinuity, or continuity, of the functional we minimize, so that the limit is a minimizer) proves that a minimum does exist. As a consequence, if one is interested in the problem of Monge, the question may become “does this minimizer come from a transport map T?” (note, on the contrary, that directly attacking by compactness and semicontinuity Monge’s formulation is out of reach, because of the non-linearity of the constraint \(T_\#\mu =\nu \), which is not closed under weak convergence).
Since the problem \((\textit{KP})\) is a linear optimization under linear constraints, an important tool will be duality theory, which is typically used for convex problems. We will find a dual problem \((\textit{DP})\) for \((\textit{KP})\) and exploit the relations between dual and primal.
The first thing we will do is finding a formal dual problem, by means of an inf-sup exchange.
First express the constraint \(\gamma \in \Pi (\mu ,\nu )\) in the following way : notice that, if \(\gamma \) is a non-negative measure on \(X\times X\), then we have
Hence, one can remove the constraints on \(\gamma \) by adding the previous sup, since if they are satisfied nothing has been added and if they are not one gets \(+\infty \) and this will be avoided by the minimization. We may look at the problem we get and interchange the inf in \(\gamma \) and the sup in \(\phi ,\psi \):
becomes
Obviously it is not always possible to exchange inf and sup, and the main tools to do this come from convex analysis. We refer to [84], Section 1.6.3 for a simple proof of this fact, or to [90], where the proof is based on Flenchel–Rockefeller duality (see, for instance, [41]). Anyway, we insist that in this case it is true that \(\inf \sup =\sup \inf \).
Afterwards, one can re-write the inf in \(\gamma \) as a constraint on \(\phi \) and \(\psi \), since one has
This leads to the following dual optimization problem: given the two probabilities \(\mu \) and \(\nu \) and the cost function \(c:X\times X\rightarrow [0,+\infty ]\) we consider the problem
This problem does not admit a straightforward existence result, since the class of admissible functions lacks compactness. Yet, we can better understand this problem and find existence once we have introduced the notion of c-transform (a kind of generalization of the well-known Legendre transform).
Definition 4.1
Given a function \(\chi :X\rightarrow \overline{\mathbb R}\) we define its c-transform (or c-conjugate function) by
Moreover, we say that a function \(\psi \) is c-concave if there exists \(\chi \) such that \(\psi =\chi ^c\) and we denote by \(\Psi _c(X)\) the set of c-concave functions.
It is quite easy to realize that, given a pair \((\phi ,\psi )\) in the maximization problem (DP), one can always replace it with \((\phi ,\phi ^c)\), and then with \((\phi ^{cc},\phi ^c)\), and the constraints are preserved and the integrals increased. Actually one could go on but it is possible to prove that \(\phi ^{ccc}=\phi ^c\) for any function \(\phi \). This is the same as saying that \(\psi ^{cc}=\psi \) for any c-concave function \(\psi \), and this perfectly recalls what happens for the Legendre transform of convex functions (which corresponds to the particular case \(c(x,y)=-x\cdot y\)).
A consequence of these considerations is the following well-known result:
Proposition 4.1
We have
where the max on the right hand side is realized. In particular the minimum value of \((\textit{KP})\) is a convex function of \((\mu ,\nu )\), as it is a supremum of linear functionals.
Definition 4.2
The functions \(\phi \) realizing the maximum in (4.4) are called Kantorovich potentials for the transport from \(\mu \) to \(\nu \) (and will be often denoted by the symbol \(\varphi \) instead of \(\phi \)).
Notice that any c-concave function shares the same modulus of continuity of the cost c. Hence, if c is uniformly continuous (which is always the case whenever c is continuous and X is compact), one can get a uniform modulus of continuity for the functions in \(\Psi _c(X)\). This is the reason why one can prove existence for \((\textit{DP})\) (which is the same of the right hand side problem in Proposition 4.1), by applying Ascoli–Arzelà’s Theorem.
We look at two interesting cases. When c(x, y) is equal to the distance d(x, y) on the metric space X, then we can easily see that
and
Another interesting case is the case where \(X=\Omega \subset \mathbb R^d\) and \(c(x,y)=\frac{1}{2}|x-y|^2\). In this case we have
Moreover, if \(X=\mathbb R^d\) this is an equivalence.
A consequence of (4.5) and (4.6) is that, in the case where \(c=d\), Formula 4.4 may be re-written as
We now concentrate on the quadratic case when X is a domain \(\Omega \subset \mathbb R^d\), and look at the existence of optimal transport maps T. From now on, we will use the word domain to denote the closure of a bounded and connected open set.
The main tool is the duality result. If we have equality between the minimum of \((\textit{KP})\) and the maximum of \((\textit{DP})\) and both extremal values are realized, one can consider an optimal transport plan \(\gamma \) and a Kantorovich potential \(\varphi \) and write
The equality on \(\mathrm{{spt}}\gamma \) is a consequence of the inequality which is valid everywhere and of
which implies equality \(\gamma \)-a.e. These functions being continuous, the equality passes to the support of the measure. Once we have that, let us fix a point \((x_0,y_0)\in \mathrm{{spt}}\gamma \). One may deduce from the previous computations that
and, if \(\varphi \) is differentiable at \(x_0\), one gets \(\nabla \varphi (x_0)=x_0-y_0,\) i.e. \(y_0=x_0-\nabla \varphi (x_0)\). This shows that only one unique point \(y_0\) can be such that \((x_0,y_0)\in \mathrm{{spt}}\gamma \), which means that \(\gamma \) is concentrated on a graph. The map T providing this graph is of the form \(x\mapsto x-\nabla \varphi (x)=\nabla u(x)\) (where \(u(x){:=} \frac{x^2}{2}-\varphi (x)\) is a convex function). This shows the following well-known theorem, due to Brenier ([19, 20], see also [45,46,47, 71]). Note that this also gives uniqueness of the optimal transport plan and of the gradient of the Kantorovich potential. The only technical point to let this strategy work is the \(\mu \)-a.e. differentiability of the potential \(\varphi \). Since \(\varphi \) has the same regularity of a convex function, and convex function are locally Lipschitz, it is differentiable Lebesgue-a.e., which allows to prove the following:
Theorem 4.2
Given \(\mu \) and \(\nu \) probability measures on a domain \(\Omega \subset \mathbb R^d\) there exists an optimal transport plan \(\gamma \) for the quadratic cost \(\frac{1}{2}|x-y|^2\). It is unique and of the form \((id, T)_{\#}\mu \), provided \(\mu \) is absolutely continuous. Moreover there exists also at least a Kantorovich potential \(\varphi \), and the gradient \(\nabla \varphi \) is uniquely determined \(\mu \)-a.e. (in particular \(\varphi \) is unique up to additive constants if the density of \(\mu \) is positive a.e. on \(\Omega \)). The optimal transport map T and the potential \(\varphi \) are linked by \(T(x)=x-\nabla \varphi (x)\). Moreover, the optimal map T is equal a.e. to the gradient of a convex function u, given by \(u(x){:=} \frac{x^2}{2}-\varphi (x)\).
Actually, the existence of an optimal transport map is true under weaker assumptions (see [47]): we can replace the condition of being absolutely continuous with the condition “\(\mu (A)=0\) for any \(A\subset \mathbb R^d\) such that \(\mathcal {H}^{d-1}(A)<+\infty \)” or with any condition which ensures that the non-differentiability set of \(\varphi \) is negligible (and convex function are more regular than locally Lipschitz functions).
In Theorem 4.2 only the part concerning the optimal map T is not symmetric in \(\mu \) and \(\nu \): hence the uniqueness of the Kantorovich potential is true even if it \(\nu \) (and not \(\mu \)) has positive density a.e. (since one can retrieve \(\varphi \) from \(\varphi ^c\) and viceversa).
We stress that Theorem 4.2 admits a converse implication and that any gradient of a convex function is indeed optimal between \(\mu \) and its image measure. Moreover, Theorem 4.2 can be translated in very easy terms in the one-dimensional case \(d=1\): given a non-atomic measure \(\mu \) and another measure \(\nu \), both in \(\mathcal {P}(\mathbb R)\), there exists a unique monotone increasing transport map T such that \(T_\#\mu =\nu \), and it is optimal for the quadratic cost.
Finally, the same kind of arguments could be adapted to prove existence and uniqueness of an optimal map for other costs, in particular to costs of the form \(c(x,y)=h(x-y)\) for a strictly convex function \(h:\mathbb R^d\rightarrow \mathbb R\), which includes all the costs of the form \(c(x,y)=|x-y|^p\) with \(p>1\). In the one-dimensional case it even happens that the same monotone increasing map is optimal for every \(p\ge 1\) (and it is the unique optimal map for \(p>1\))!
4.2 The Wasserstein distances
Starting from the values of the problem \((\textit{KP})\) in (4.2) we can define a set of distances over \(\mathcal {P}(X)\).
We mainly consider costs of the form \(c(x,y)=|x-y|^p\) in \(X=\Omega \subset \mathbb R^d\), but the analysis can be adapted to a power of the distance in a more general metric space X. The exponent p will always be taken in \([1,+\infty [\) (we will not discuss the case \(p=\infty \)) in order to take advantage of the properties of the \(L^p\) norms. When \(\Omega \) is unbounded we need to restrict our analysis to the following set of probabilities
In a metric space, we fix an arbitrary point \(x_0\in X\), and set
(the finiteness of this integral does not depend on the choice of \(x_0\)).
The distances that we want to consider are defined in the following way: for any \(p\in [ 1,+\infty [\) set
The quantities that we obtain in this way are called Wasserstein distances.Footnote 8 They are very important in many fields of applications and they seem a natural way to describe distances between equal amounts of mass distributed on a same space.
It is interesting to compare these distances to \(L^p\) distances between densities (a comparison which is meaningful when we consider absolutely continuous measures on \(\mathbb R^d\), for instance). A first observation is the very different behavior of these two classes of distances. We could say that, if \(L^p\) distances can be considered “vertical”, Wasserstein distances are instead “horizontal”. This consideration is very informal, but is quite natural if one associates with every absolutely continuous measure the graph of its density. To compute \(||f-g||_{L^p}\) we need to look, for every point x, at the distance between f(x) and g(x), which corresponds to a vertical displacement between the two graphs, and then integrate this quantity. On the contrary, to compute \(W_p(f,g)\) we need to consider the distance between a point x and a point T(x) (i.e. an horizontal displacement on the graph) and then to integrate this, for a particular pairing between x and T(x) which makes a coupling between f and g.
A first example where we can see the very different behavior of these two ways of computing distances is the following: take two densities f and g supported on [0, 1], and define \(g_h\) as \(g_h(x)=g(x-h)\). As soon as \(|h|>1\), the \(L^p\) distance between f and \(g_h\) equals \((||f||_{L^p}^p+||g||_{L^p}^p)^{1/p}\), and does not depend on the “spatial” information consisting in |h|. On the contrary, the \(W_p\) distance between f and \(g_h\) is of the order of |h| (for \(h\rightarrow \infty \)) and depends much more on the displacement than on the shapes of f and g (Fig. 1).
We now analyze some properties of these distances. Most proofs can be found in [84], Chapter 5, or in [90] or [3].
First we underline that, as a consequence of Hölder (or Jensen) inequalities, the Wasserstein distances are always ordered, i.e. \(W_{p_1}\le W_{p_2}\) if \(p_1\le p_2\). Reversed inequalities are possible only if \(\Omega \) is bounded, and in this case we have, if set \(D={{\mathrm{diam}}}(\Omega )\), for \(p_1\le p_2\),
This automatically guarantees that, if the quantities \(W_p\) are distances, then they induce the same topology, at least when \(\Omega \) is bounded. But first we should check that they are distances...
Proposition 4.3
The quantity \(W_p\) defined above is a distance over \(\mathcal {P}_p(\Omega )\).
Proof
First, let us note that \(W_p\ge 0\). Then, we also remark that \(W_p(\mu ,\nu )=0\) implies that there exists \(\gamma \in \Pi (\mu ,\nu )\) such that \(\int |x-y|^p\,\mathrm{d}\gamma =0\). Such a \(\gamma \in \Pi (\mu ,\nu )\) is concentrated on \(\{x=y\}\). This implies \(\mu =\nu \) since, for any test function \(\phi \), we have
We need now to prove the triangle inequality. We only give a proof in the case \(p>1\), with absolutely continuous measures.
Take three measures \(\mu ,\varrho \) and \(\nu \), and suppose that \(\mu \) and \(\varrho \) are absolutely continuous. Let T be the optimal transport from \(\mu \) to \(\varrho \) and S the optimal one from \(\varrho \) to \(\nu \). Then \(S\circ T\) is an admissible transport from \(\mu \) to \(\nu \), since \((S\circ T)_\#\mu =S_\#(T_\#\mu )=S_\#\varrho =\nu \). We have
Moreover,
and this last quantity equals \(W_p(\varrho ,\nu )\). Moreover, \(||T-id||_{L^p(\mu )}=W_p(\mu ,\varrho )\), whence
This gives the proof when \(\mu ,\varrho \ll \mathcal {L}^d\) and \(p>1\). For the general case, an approximation is needed (but other arguments can also apply, see, for instance, [84], Section 5.1).\(\square \)
We now give, without proofs, two results on the topology induced by \(W_p\) on a general metric space X.
Theorem 4.4
If X is compact, for any \(p\ge 1\) the function \(W_p\) is a distance over \(\mathcal {P}(X)\) and the convergence with respect to this distance is equivalent to the weak convergence of probability measures.
To prove that the convergence according to \(W_p\) is equivalent to weak convergence one first establish this result for \(p=1\), through the use of the duality formula in the form (4.7). Then it is possible to use the inequalities between the distances \(W_p\) (see above) to extend the result to a general p.
The case of a noncompact space X is a little more difficult. As we said, the distance is only defined on a subset of the whole space of probability measures, to avoid infinite values. Set, for a fixed reference point \(x_0\) which can be chosen to be 0 in the Euclidean space,
In this case, the distance \(W_p\) is only defined on \(\mathcal {P}_p(X){:=}\left\{ \mu \in \mathcal {P}(X):\,m_p(\mu ){<}{+}\infty \right\} \). We have
Theorem 4.5
For any \(p\ge 1\) the function \(W_p\) is a distance over \(\mathcal {P}_p(X)\) and, given a measure \(\mu \) and a sequence \((\mu _n)_n\) in \(\mathbb W_p(X)\), the following are equivalent:
-
\(\mu _n\rightarrow \mu \) according to \(W_p\);
-
\(\mu _n\rightharpoonup \mu \) and \(m_p(\mu _n)\rightarrow m_p(\mu )\);
-
\(\int _{X}\phi \,\,\mathrm{d}\mu _n\rightarrow \int _{X}\phi \,\,\mathrm{d}\mu \) for any \(\phi \in C^0(X)\) whose growth is at most of order p (i.e. there exist constants A and B depending on \(\phi \) such that \(|\phi (x)|\le A+Bd(x,x_0)^p\) for any x).
After this short introduction to the metric space \(\mathbb W_p{:=}(\mathcal {P}_p(X),W_p)\) and its topology, we will focus on the Euclidean case, i.e. where the metric space X is a domain \(\Omega \subset \mathbb R^d\), and study the curves valued in \(\mathbb W_p(\Omega )\) in connections with PDEs.
The main point is to identify the absolutely continuous curves in the space \(\mathbb W_p(\Omega )\) with solutions of the continuity equation \(\partial _t\mu _t+\nabla \cdot (\mathbf {v}_t\mu _t)=0\) with \(L^p\) vector fields \(\mathbf {v}_t\). Moreover, we want to connect the \(L^p\) norm of \(\mathbf {v}_t\) with the metric derivative \(|\mu '|(t)\).
We recall that standard considerations from fluid mechanics tell us that the continuity equation above may be interpreted as the equation ruling the evolution of the density \(\mu _t\) of a family of particles initially distributed according to \(\mu _0\) and each of which follows the flow
The main theorem in this setting (originally proven in [3]) relates absolutely continuous curves in \(\mathbb W_p\) with solutions of the continuity equation:
Theorem 4.6
Let \((\mu _t)_{t\in [0,1]}\) be an absolutely continuous curve in \(\mathbb W_p(\Omega )\) (for \(p>1\) and \(\Omega \subset \mathbb R^d\) an open domain). Then for a.e. \(t\in [0,1]\) there exists a vector field \(\mathbf {v}_t\in L^p(\mu _t;\mathbb R^d)\) such that
-
the continuity equation \(\partial _t\mu _t+\nabla \cdot (\mathbf {v}_t\mu _t)=0\) is satisfied in the sense of distributions,
-
for a.e. t we have \(||\mathbf {v}_t||_{L^p(\mu _t)}\le |\mu '|(t)\) (where \(|\mu '|(t)\) denotes the metric derivative at time t of the curve \(t\mapsto \mu _t\), w.r.t. the distance \(W_p\));
Conversely, if \((\mu _t)_{t\in [0,1]}\) is a family of measures in \(\mathcal {P}_p(\Omega )\) and for each t we have a vector field \(\mathbf {v}_t\in L^p(\mu _t;\mathbb R^d)\) with \(\int _0^1||\mathbf {v}_t||_{L^p(\mu _t)}\,\mathrm{d}t<+\infty \) solving \(\partial _t\mu _t+\nabla \cdot (\mathbf {v}_t\mu _t)=0\), then \((\mu _t)_t\) is absolutely continuous in \(\mathbb W_p(\Omega )\) and for a.e. t we have \( |\mu '|(t)\le ||\mathbf {v}_t||_{L^p(\mu _t)}\).
Note that, as a consequence of the second part of the statement, the vector field \(\mathbf {v}_t\) introduced in the first part must a posteriori satisfy \(||\mathbf {v}_t||_{L^p(\mu _t)}= |\mu '|(t)\).
We will not give the proof of this theorem, which is quite involved. The main reference is [3] (but the reader can also find alternative proofs in Chapter 5 of [84], in the case where \(\Omega \) is compact). Yet, if the reader wants an idea of the reason for this theorem to be true, it is possible to start from the case of two time steps: there are two measures \(\mu _t\) and \(\mu _{t+h}\) and there are several ways for moving the particles so as to reconstruct the latter from the former. It is exactly as when we look for a transport. One of these transports is optimal in the sense that it minimizes \(\int |T(x)-x|^p\mathrm{d}\mu _t(x)\) and the value of this integral equals \(W_p^p(\mu _t,\mu _{t+h})\). If we call \(\mathbf {v}_t(x)\) the “discrete velocity of the particle located at x at time t, i.e. \(\mathbf {v}_t(x)=(T(x)-x)/h\), one has \(||\mathbf {v}_t||_{L^p(\mu _t)}=\frac{1}{h} W_p(\mu _t,\mu _{t+h})\). We can easily guess that, at least formally, the result of the previous theorem can be obtained as a limit as \(h\rightarrow 0\).
Once we know about curves in their generality, it is interesting to think about geodesics. The following result is a characterization of geodesics in \(W_p(\Omega )\) when \(\Omega \) is a convex domain in \(\mathbb R^d\). This procedure is also known as McCann’s displacement interpolation.
Theorem 4.7
If \(\Omega \subset \mathbb R^d\) is convex, then all the spaces \(\mathbb W_p(\Omega )\) are length spaces and if \(\mu \) and \(\nu \) belong to \(\mathbb W_p(\Omega )\), and \(\gamma \) is an optimal transport plan from \(\mu \) to \(\nu \) for the cost \(c_p(x,y)=|x-y|^p\), then the curve
where \(\pi _t:\Omega \times \Omega \rightarrow \Omega \) is given by \(\pi _t(x,y)=(1-t)x+ty\), is a constant-speed geodesic from \(\mu \) to \(\nu \). In the case \(p>1\) all the constant-speed geodesics are of this form, and, if \(\mu \) is absolutely continuous, then there is only one geodesic and it has the form
where T is the optimal transport map from \(\mu \) to \(\nu \). In this case, the velocity field \(\mathbf {v}_t\) of the geodesic \(\mu _t\) is given by \(\mathbf {v}_t=(T-id)\circ (T_t)^{-1}\). In particular, for \(t=0\) we have \(\mathbf {v}_0=-\nabla \varphi \) and for \(t=1\) we have \(\mathbf {v}_1=\nabla \psi \), where \(\varphi \) is the Kantorovich potential in the transport from \(\mu \) to \(\nu \) and \(\psi =\varphi ^c\).
The above theorem may be adapted to the case where the Euclidean domain \(\Omega \) is replaced by a Riemannian manifold, and in this case the map \(T_t\) must be defined by using geodesics instead of segments: the point \(T_t(x)\) will be the value at time t of a constant-speed geodesic, parametrized on the interval [0, 1], connecting x to T(x). For the theory of optimal transport on manifolds, we refer to [73].
Using the characterization of constant-speed geodesics as minimizers of a strictly convex kinetic energy, we can also infer the following interesting information.
-
Looking for an optimal transport for the cost \(c(x,y)=|x-y|^p\) is equivalent to looking for a constant-speed geodesic in \(\mathbb W_p\) because from optimal plans we can reconstruct geodesics and from geodesics (via their velocity field) it is possible to reconstruct the optimal transport;
-
constant-speed geodesics may be found by minimizing \(\int _0^1|\mu '|(t)^p\mathrm{d}t\) ;
-
in the case of the Wasserstein spaces, we have \(|\mu '|(t)^p=\int _\Omega |\mathbf {v}_t|^p\,\mathrm{d}\mu _t\), where \(\mathbf {v}\) is a velocity field solving the continuity equation together with \(\mu \) (this field is not unique, but the metric derivative \(|\mu '|(t)\) equals the minimal value of the \(L^p\) norm of all possible fields).
As a consequence of these last considerations, for \(p>1\), solving the kinetic energy minimization problem
selects constant-speed geodesics connecting \(\mu \) to \(\nu \), and hence allows to find the optimal transport between these two measures. This is what is usually called Benamou–Brenier formula (see [11] for the quadratic case, and [21] for a general introduction).
On the other hand, this minimization problem in the variables \((\varrho _t,\mathbf {v}_t)\) has non-linear constraints (due to the product \(\mathbf {v}_t\varrho _t\)) and the functional is non-convex (since \((t,x)\mapsto t|x|^p\) is not convex). Yet, it is possible to transform it into a convex problem. For this, it is sufficient to switch variables, from \((\varrho _t,\mathbf {v}_t)\) into \((\varrho _t,E_t)\) where \(E_t=\mathbf {v}_t\varrho _t\), thus obtaining the following minimization problem
We need to use the properties of the function \(f_p:\mathbb R\times \mathbb R^d\rightarrow \mathbb R\cup \{+\infty \}\), defined through
where \(K_q{:=}\{(a,b)\in \mathbb R\times \mathbb R^d\,:\,a+\frac{1}{q} |b|^q\le 0\}\) and \(q=p/(p-1)\) is the conjugate exponent of p. In particular, \(f_p\) is convex, which makes the above minimization problem convex, and also allows to write what we formally wrote as \(\int _0^1\int _\Omega \frac{|E_t|^p}{\varrho _t^{p-1}}\,\mathrm{d}x\mathrm{d}t\) (an expression which implicitly assumes \(\varrho _t,E_t\ll \mathcal {L}^d\)) into the form
Both the convexity and this last expression will be useful for numerical methods (as it was first done in [11]).
4.3 Minimizing movement schemes in the Wasserstein space and evolution PDEs
Thanks to all the theory which has been described so far, it is natural to study gradient flows in the space \(\mathbb W_2(\Omega )\) (the reason for choosing the exponent \(p=2\) will be clear in a while) and to connect them to PDEs of the form of a continuity equation. The most convenient way to study this is to start from the time-discretized problem, i.e. to consider a sequence of iterated minimization problems:
This iterated minimization scheme is called JKO scheme (after Jordan et al. [54]).
Note that we denote now the measures on \(\Omega \) by the letter \(\varrho \) instead of \(\mu \) or \(\nu \) because we expect them to be absolutely continuous measures with nice (smooth) densities, and we want to study the PDE they solve. The reason to focus on the case \(p=2\) can also be easily understood. Indeed, from the very beginning, i.e. from Sect. 2, we saw that the equation \(x'=-\nabla F(x)\) corresponds to a sequence of minimization problems involving the squaredFootnote 9 distance \(|x-x^\tau _k|^2\), and in the Wasserstein space \(\mathbb W_p\) the distance is defined as the power 1 / p of a transport cost; only in the case \(p=2\) the exponent goes away and we are lead to a minimization problem involving \(F(\varrho )\) and a transport cost of the form
for \(\nu =\varrho ^\tau _k\).
In the particular case of the space \(\mathbb W_2(\Omega )\), which has some additional structure, if compared to arbitrary metric spaces, we would like to give a PDE description of the curves that we obtain as gradient flows, and this will pass through the optimality conditions of the minimization problem (4.10). In order to study these optimality conditions, we introduce the notion of first variation of a functional. This will be done in a very sketchy and formal way (we refer to Chapter 7 in [84] for more details).
Given a functional \(G:\mathcal {P}(\Omega )\rightarrow \mathbb R\) we call \(\frac{\delta G}{\delta \varrho }(\varrho )\), if it exists, the unique (up to additive constants) function such that \(\frac{d}{d\varepsilon }G(\varrho +\varepsilon \chi )_{|\varepsilon =0}=\int \frac{\delta G}{\delta \varrho }(\varrho ) d\chi \) for every perturbation \(\chi \) such that, at least for \(\varepsilon \in [0,\varepsilon _0],\) the measure \(\varrho +\varepsilon \chi \) belongs to \(\mathcal {P}(\Omega )\). The function \(\frac{\delta G}{\delta \varrho }(\varrho )\) is called first variation of the functional G at \(\varrho \). In order to understand this notion, the easiest possibility is to analyze some examples.
The three main classes of examples are the following functionalsFootnote 10
where \(f:\mathbb R\rightarrow \mathbb R\) is a convex superlinear function (and the functional F is set to \(+\infty \) if \(\varrho \) is not absolutely continuous w.r.t. the Lebesgue measure for the semicontinuity of these functionals see, for instance, [24]) and \(V:\Omega \rightarrow \mathbb R\) and \(W:\mathbb R^d\rightarrow \mathbb R\) are regular enough (and W is taken symmetric, i.e. \(W(z)=W(-z)\), for simplicity). In this case it is quite easy to realize that we have
It is clear that the first variation of a functional is a crucial tool to write optimality conditions for variational problems involving such a functional. In order to study the problem (4.10), we need to complete the picture by understanding the first variation of functionals of the form \(\varrho \mapsto \mathcal {T}_c(\varrho ,\nu )\). The result is the following:
Proposition 4.8
Let \(c:\Omega \times \Omega \rightarrow \mathbb R\) be a continuous cost function. Then the functional \(\varrho \mapsto \mathcal {T}_c(\varrho ,\nu )\) is convex, and its subdifferential at \(\varrho _0\) coincides with the set of Kantorovich potentials \(\{\varphi \in C^0(\Omega )\,:\,\int \varphi \,\,\mathrm{d}\varrho _0+\int \varphi ^c\,\,\mathrm{d}\nu =\mathcal {T}_c(\varrho ,\nu )\}\). Moreover, if there is a unique c-concave Kantorovich potential \(\varphi \) from \(\varrho _0\) to \(\nu \) up to additive constants, then we also have \(\frac{\delta \mathcal {T}_c(\cdot ,\nu )}{\delta \varrho }(\varrho _0)=\varphi \).
Even if a complete proof of the above proposition is not totally trivial (and Chapter 7 in [84] only provides it in the case where \(\Omega \) is compact), one can guess why this is true from the following considerations. Start from Proposition 4.1, which provides
This expresses \(\mathcal {T}_c\) as a supremum of linear functionals in \(\varrho \) and shows convexity. Standard considerations from convex analysis allow to identify the subdifferential as the set of functions \(\varphi \) attaining the maximum. An alternative point of view is to consider the functional \(\varrho \mapsto \int \phi \,\mathrm{d}\varrho +\int \phi ^c\,\,\mathrm{d}\nu \) for fixed \(\phi \), in which case the first variation is of course \(\phi \); then it is easy to see that the first variation of the supremum may be obtained (in case of uniqueness) just by selecting the optimal \(\phi \).
Once we know how to compute first variations, we come back to the optimality conditions for the minimization problem (4.10). Which are these optimality conditions? roughly speaking, we should have
(where the reasons for having a constant instead of 0 is the fact that, in the space of probability measures, only zero-mean densities are considered as admissible perturbations, and the first variations are always defined up to additive constants). Note that here \(\varphi \) is the Kantorovich potential associated with the transport from \(\varrho _{k+1}^\tau \) to \(\varrho _k^\tau \) (and not viceversa).
Let us look at the consequences we can get from this optimality condition. Actually, if we combine (4.11) with the fact that the optimal map T satisfies \(T(x)=x-\nabla \varphi (x)\), we get
We denoted by \(-\mathbf {v}\) the ratio \(\frac{T(x)-x}{\tau }\). Why? because, as a ratio between a displacement and a time step, it has the meaning of a velocity, but since it is the displacement associated to the transport from \(\varrho ^\tau _{k+1}\) to \(\varrho ^\tau _{k}\), it is better to view it rather as a backward velocity (which justifies the minus sign).
Since here we have \(\mathbf {v}=-\nabla \big (\frac{\delta F}{\delta \varrho }(\varrho ))\), this suggests that at the limit \(\tau \rightarrow 0\) we will find a solution of
This is a PDE where the velocity field in the continuity equation depends on the density \(\varrho \) itself.
We can see many interesting examples.
First, suppose \(F(\varrho )=\int f(\varrho (x))\mathrm{d}x,\) with \(f(t)=t\log t\). In such a case we have \(f'(t)=\log t +1\) and \(\nabla (f'(\varrho ))=\frac{\nabla \varrho }{\varrho }\): this means that the gradient flow equation associated to the functional F would be the heat equation
Using \(F(\varrho )=\int f(\varrho (x))\,\mathrm{d}x+\int V(x)\,\mathrm{d}\varrho (x)\), we would have the Fokker–Planck Equation
Changing the function f we can obtain other forms of diffusion. For instance, if one uses
the equation we obtain is
When \(m>1\) this equation is called Porous Media Equation (see [89] for a complete monography) and models the diffusion of a fluid into a material whose porosity slows down the diffusion. Among the properties of this equation there is a finite-speed propagation effect, different from linear diffusion (if \(\varrho _0\) is compactly supported the same stays true for \(\varrho _t\), \(t>0\), while this is not the case for the heat equation). Note that linear diffusion can be obtained as a limit \(m\rightarrow 1\) in the following way: in the functional F, we can replace the term \(\frac{1}{m-1}\int \varrho ^m(x)\,\mathrm{d}x\) with \(\frac{1}{m-1}\int (\varrho ^m(x)-\varrho (x))\,\mathrm{d}x\), since they only differ by a constant term (the total mass of \(\varrho \) is fixed). Then, we just observe that we have
which provides the entropy in the limit.
It is also interesting to consider the case \(m<1\): the function \(\varrho ^m-\varrho \) is concave, but the negative coefficient \(1/(m-1)\) finally gives a convex function (which is, unfortunately, not superlinear at infinity, hence more difficult to handle). The PDE that we get as a gradient flow is called Fast diffusion equation, and has opposite properties in terms of diffusion rate than the porous medium one.
In the above equations, it is also possible to get rid of the diffusion part and just study the advection equation
where V is a given potential. This equation is the gradient flow of
It is not difficult to see that the solution of this equation, where particles do not interact with each other, is obtained in the following way: for every initial point x solve the Euclidean gradient flow \(y_x'(t)=-\nabla V(y_x(t))\) with \(y_x(0)=x\), and call \(Y_t\) the map which associates with every initial point x the point \(y_x(t)\). Then we have \(\varrho _t=(Y_t)_\#\varrho _0\).
A more interesting case can be observed when the particles are advected by a potential determined by the superposition of many potentials, each created by one particle. For instance, given a function \(W:\mathbb R^d\rightarrow \mathbb R\), the particle located at x produces a potential \(W(\cdot -x)\) and, globally, the potential is given by \(V(y)=\int W(y-x)\,\mathrm{d}\varrho (x)\), i.e. \(V=W*\varrho \). The equation, if every particle follows \(-\nabla V\) is
where we used \(\nabla (W*\varrho )=(\nabla W)*\varrho \). If W is even (i.e. the interaction between x and y is the same as between y and x), then this is the gradient flow of the functional
When W(z) is an increasing function of |z|, for instance in the quadratic case which gives \(F(\varrho )=\int \!\!\int |x-y|^2\mathrm{d}\varrho (x)\mathrm{d}\varrho (y)\), the equation gives rise to a general aggregation behavior of the particles. When \(t\rightarrow \infty \) one expects \(\varrho _t\rightharpoonup \delta _{x_0}\) (the point \(x_0\) depending on the initial datum \(\varrho _0\): in the quadratic example above it is the barycenter of \(\varrho _0\)). If W is not smooth enough, the aggregation into a unique point can even occur in finite time, see [26].
Most often, the above aggregation energy is studied together with an internal energy and a confining potential energy, using the functional
This gives the following diffusion–advection–interaction equation
(see in particular [27, 28] for physical modelling and convergence results).
We will see in Sect. 4.6 some other variants of these equations. The reader can also look at Section 8.4.2 in [84]. For the moment, we just wanted to show how the iterated variational scheme called JKO is related to the different PDEs. In the next section we will some tools to prove the convergence of this scheme (without full details).
4.4 How to prove convergence of the JKO scheme
Many possible proofs can be built for the convergence of the JKO scheme. In particular, one could follow the general theory developed in [3], i.e. checking all the assumptions to prove existence and uniqueness of an EVI gradient flow for the functional F in the space \(\mathbb W_2(\Omega )\), and then characterizing the velocity field that Theorem 4.6 associates with the curve obtained as a gradient flow. In [3], it is proven, under suitable conditions, that such a vector field \(\mathbf {v}_t\) must belong to what is defined as the Wasserstein sub-differential of the functional F, provided in particular that F is \(\lambda \)-convex. Then, the Wasserstein sub-differential is proven to be of the desired form (i.e. composed only of the gradient of the first variation of F, when F admits a first variation).
This approach has the advantage to use a general theory and to adapt it to the scopes of this particular setting. On the other hand, some considerations seem necessary:
-
the important point when studying these PDEs is that the curves \((\varrho _t)_t\) obtained as a limit are true weak solutions of the continuity equations; from this point of view, the notions of EDE and EVI solutions and the formalism developed in the first part of the book [3] (devoted to the general metric case) are not necessary; if the second part of [3] is exactly concerned with Wasserstein spaces and with the characterization of the limit as \(\tau \rightarrow 0\) as the solution of a PDE, we can say that the whole formalism is sometimes too heavy.
-
after using optimal transport theory to select a suitable distance in the discrete scheme above and a suitable interpolation, the passage to the limit can be done by classical compactness techniques in the space of measures and in functional analysis; on the other hand, there are often some difficulties in handling some non-linear terms, which are not always seen when using the theory of [3] (which is an advantage of this general theory).
-
the \(\lambda \)-convexity assumption is in general not crucial in what concerns existence (but the general theory in [3] has been built under this assumption, as we saw in Sect. 3).
-
as far as uniqueness of the solution is concerned, the natural point of view in the applications would be to prove uniqueness of the weak solution of the equation (or, possibly, to define a more restrictive notion of solution of the PDE for which one can prove uniqueness), and this is a priori very different from the EDE or EVI notions. To this aim, the use of the Wasserstein distance can be very useful, as one can often prove uniqueness by differentiating in time the distance between two solutions, and maybe apply a Gronwall lemma (and this can be done independently of the EVI notion; see for instance the end of Sect. 4.5). On the other hand, uniqueness results are almost never possible without some kind of \(\lambda \)-convexity (or weaker versions of it, as in [34]) of the functional.
Yet, we prefer here to present the ideas to build a more concrete proof, following for instance what is done in Chapter 8 of [84]. For simplicity, we will describe the construction in the model case where F is given by (4.14).
As we said, the final goal is to produce a curve \((\varrho _t)_t\) which is a solution (in the distributional sense on \(\mathbb R^d\), which is the same as imposing suitable no-flux boundary conditions on \(\partial \Omega \)) of the PDE
We will set \(E=\varrho \mathbf {v}\) (the variable E is called momentum, in physics) and require at least that E is a finite vector measure over \(\Omega \times [0,T]\), acting on test functions \(\phi \) via \(\langle E,\phi \rangle {:=}\int dt \int \phi (t,x)\cdot \mathbf {v}_t\,\mathrm{d}\varrho _t\). This condition is equivalent to \(\int _0^T ||\mathbf {v}_t||_{L^1(\varrho _t)}\mathrm{d}t<+\infty \).
We start from the discrete scheme (4.10), which defines a sequence \((\varrho ^{\tau }_k)_k\). We also define a sequence of velocities \(\mathbf {v}^{\tau }_k=(id-\mathrm {T})/\tau \), taking as \(\mathrm {T}\) the optimal transport from \(\varrho ^\tau _k\) to \(\varrho ^{\tau }_{k-1}\). The considerations in the previous sections guarantee that we have
This is proven thanks to the optimality conditions in (4.10) and the only delicate point is to guarantee the uniqueness of the Kantorovich potential in the transport from \(\varrho ^\tau _k\) to \(\varrho ^{\tau }_{k-1}\). In some cases it is possible to prove a priori that the minimizers of \(F(\varrho )+W_2^2(\varrho ,\nu )\) have strictly positive density a.e. whatever is \(\nu \) (this is typically true when \(f'(0)=-\infty \), as it is the case for the entropy functional, and more generally in case of linear or fast diffusion), which provides uniqueness up to additive constants of the potential. In Section. 8.3 of [84] full details are provided for the Fokker–Planck case, and it is indeed proved that the minimizers of the JKO scheme are strictly positive a.e., in this case. When positivity of the minimizer is not available, then one can obtain the same optimality conditions by first approximating \(\nu \) with strictly positive densities, and then passing to the limit.Footnote 11
Then, we build at least two interesting curves in the space of measures:
-
first we can define some piecewise constant curves, i.e. \(\overline{\varrho }^\tau _t{:=}\varrho ^{\tau }_{k+1}\) for \(t\in ]k\tau ,(k+1)\tau ]\); associated with this curve we also define the velocities \(\overline{\mathbf {v}}^{\tau }_t=\mathbf {v}^{\tau }_{k+1}\) for \(t\in ]k\tau ,(k+1)\tau ]\) and the momentum variable \(\overline{E}^\tau =\overline{\varrho }^\tau \overline{\mathbf {v}}^{\tau }\);
-
then, we can also consider the densities \(\widehat{\varrho }^\tau _t\) that interpolate the discrete values \((\varrho ^\tau _k)_k\) along geodesics:
$$\begin{aligned} \widehat{\varrho }^\tau _t = \bigl ( id - (k\tau -t) \mathbf {v}^{\tau }_k \bigr )_{\sharp } \varrho ^\tau _k,\; \text{ for } t\in ](k-1)\tau ,k\tau [; \end{aligned}$$(4.15)the velocities \(\widehat{\mathbf {v}}^{\tau }_t\) are defined so that \((\widehat{\varrho }^\tau , \widehat{\mathbf {v}}^{\tau })\) satisfy the continuity equation, taking
$$\begin{aligned} \widehat{\mathbf {v}}^{\tau }_t=\mathbf {v}^{\tau }_t\circ \bigl ( id - (k\tau -t) \mathbf {v}^{\tau }_k \bigr )^{-1}; \end{aligned}$$moreover, as before, we define: \(\widehat{E}^{\tau }= \widehat{\varrho }^\tau \widehat{\mathbf {v}}^{\tau }\).
After these definitions we look for a priori bounds on the curves and the velocities that we defined. We already know that we have
which is the discrete version of an \(H^1\) estimate in time. As for \(\widehat{\varrho }^\tau _t\), it is an absolutely continuous curve in the Wasserstein space and its velocity on the time interval \([(k-1)\tau ,k\tau ]\) is given by the ratio \(W_2(\varrho ^\tau _{k-1},\varrho ^\tau _k)/\tau \). Hence, the \(L^2\) norm of its velocity on [0, T] is given by
and, thanks to (4.16), it admits a uniform bound independent of \(\tau \). In our case, thanks to results on the continuity equation and the Wasserstein metric, this metric derivative is also equal to \(||\widehat{\mathbf {v}}^{\tau }_t||_{L^2(\widehat{\varrho }^\tau _t)}\). This gives compactness of the curves \(\widehat{\varrho }^\tau \), as well as Hölder estimates (since \(H^1\subset C^{0,1/2}\)). The characterization of the velocities \(\overline{\mathbf {v}}^{\tau }\) and \(\widehat{\mathbf {v}}^{\tau }\) allows to deduce bounds on these vector fields from the bounds on \(W_2(\varrho ^\tau _{k-1},\varrho ^\tau _k)/\tau \).
Considering all these facts, one obtains the following situation (see also [37, 82]):
-
The norm \(\int ||\overline{\mathbf {v}}_t^\tau ||^2_{L^2(\overline{\varrho }^\tau _t)}\mathrm{d}t\) is \(\tau \)-uniformly bounded. This quantity is equal to \(\mathcal {B}_2(\overline{\varrho }^\tau ,\overline{E}^\tau )\).
-
In particular, the bound is valid in \(L^1\) as well, which implies that \(\overline{E}^\tau \) is bounded in the space of measures over \([0,T]\times \Omega \).
-
The very same estimates are true for \(\widehat{\mathbf {v}}^{\tau }\) and \(\widehat{E}^\tau \).
-
The curves \(\widehat{\varrho }^\tau \) are bounded in \(H^1([0,T],\mathbb W_2(\Omega ))\) and hence compact in \(C^0([0,T],\mathbb W_2(\Omega ))\).
-
Up to a subsequence, one has \(\widehat{\varrho }^\tau \rightarrow \varrho \), as \(\tau \rightarrow 0\), uniformly according to the \(W_2\) distance.
-
From the estimate \(W_2(\overline{\varrho }^\tau _t,\widehat{\varrho }^\tau _t)\le C\tau ^{1/2}\) one gets that \(\varrho ^\tau \) converges to the same limit \(\varrho \) in the same sense.
-
If we denote by E a weak limit of \(\widehat{E}^\tau \), since \((\widehat{\varrho }^\tau ,\widehat{E}^\tau )\) solves the continuity equation, by linearity, passing to the weak limit, also \((\varrho ,E)\) solves the same equation.
-
It is possible to prove (see Section 8.3 in [84]) that the weak limits of \(\widehat{E}^\tau \) and \(\overline{E}^\tau \) are the same.
-
From the semicontinuity of \(\mathcal {B}_2\) and the bound \(\mathcal {B}_2(\overline{\varrho }^\tau ,\overline{E}^\tau )\le C\), one gets \(\mathcal {B}_2(\varrho ,E)<+\infty \), which means that E is absolutely continuous w.r.t. \(\varrho \) and has an \(L^2\) density, so that we have for a.e. time t a measure \(E_t\) of the form \(\varrho _t\mathbf {v}_t\).
-
It is only left to prove that one has \(\mathbf {v}_t=-\nabla \left( \frac{\delta F}{\delta \varrho }(\varrho _t)\right) \) \(\varrho _t\)-a.e. and for a.e. t. This means proving
$$\begin{aligned} \overline{E}^\tau =-\overline{\varrho }^\tau \nabla \left( \frac{\delta F}{\delta \varrho }(\overline{\varrho }^\tau )\right) \;\Rightarrow \; E=-\varrho \nabla \left( \frac{\delta F}{\delta \varrho }(\varrho )\right) \end{aligned}$$as a limit as \(\tau \rightarrow 0\). It is crucial in this step to consider the limit of \((\overline{\varrho }^\tau ,\overline{E}^\tau )\) instead of \((\widehat{\varrho }^\tau ,\widehat{E}^\tau )\).
This last step is the most critical one in many cases. It requires passing to the limit (in the sense of distributions) the terms involving \(\frac{\delta F}{\delta \varrho }\) on a sequence \(\varrho _j\rightharpoonup \varrho \) (we don’t care here where this sequence comes from). We can see what happens with the main class of functionals that we introduced so far.
The easiest case is that of the functional \(\mathcal {V}\): we have
This term is linear in \(\varrho \) and \(\varrho _j\rightharpoonup \varrho \) obviously implies \(\varrho _j\nabla V\rightharpoonup \varrho \nabla V\) as soon as \(V\in C^1\) (so that \(\nabla V\) is a continuous function, which is exactly the functional space whose dual is the space of measures). In case \(\nabla V\notin C^0\) it is possible to handle this term as soon as suitable bounds on \(\varrho _j\) provide a better weak convergence (for instance, if the functional F is composed of \(\mathcal {V}\) plus other terms which impose bounds on the \(L^p\) norm of \(\varrho \), then it is enough to suppose \(\nabla V\in L^{p'}\)).
Another easy case is that of the functional \(\mathcal {W}\): we have
If we take a sequence \(\varrho _j\rightharpoonup \varrho \) and a vector test function \(\xi \in C^\infty _c\), then we have
as soon as \(W\in C^1\), because \(\xi \) is fixed and \((\nabla W)*\varrho _j\) uniformly converges to \((\nabla W)*\varrho \) (this is a typical regularization effect of the convolution). As before, if W is less regular, it is possible to compensate this with stronger bounds on \(\varrho _j\).
The most difficult case is that of the functional \(\mathcal {F}\). In the case of the entropy \(\mathcal {F}(\varrho )=\int f(\varrho )\) with \(f(t)=t\log t\) then everything works fine because, again, the corresponding term is linear:
Then, the convergence of this term in the sense of distributions is straightforward when \(\varrho _j\rightarrow \varrho \). By the way, the entropy term \(\mathcal {F}\) is also enough to obtain suitable bounds to handle V or W which are only Lipschitz, as in this case we need to turn the weak convergence \(\varrho _j\rightharpoonup \varrho \) from the sense of measures to the \(L^1\) sense, which is exactly possible because the superlinear bound \(\int \varrho _j\log (\varrho _j)\le C\) provides equi-integrability for \(\varrho _j\).
Yet, for other functions f, there is no more linearity, and we need stronger bounds. For instance, for \(f(t)=t^m/(m-1)\), we have
This means that weak convergence of \(\varrho _j\) is no more enough, but one also needs weak convergence of \(\varrho _j^m\). This could be possible to obtain if one had strong convergence, i.e. stronger bounds (if possible, Sobolev bounds on \(\varrho _j\)).
The main ingredient is indeed an \(H^1\) bound in space, which comes from the fact that we have
Two delicate points arise: first, this is not a full \(H^1\) bound, in the sense that it is only concerned with space derivatives, and not with time derivatives and, second, we presented here an estimate on \(||\varrho ^{m-1/2}||_{H^1}\), but this could not be enough to prove strong convergence for \(\rho ^m\). The first point may be handled via some variants of the so-called Aubin–Lions lemma (see [10]), which interpolates between compactness in space and in time (roughly speaking: in our situation we have an \(L^2\) bound in time with values in a strong space norm, \(H^1\), but also a stronger bound in time, \(H^1\), valued in the \(W_2\) distance, which is weaker as it metrizes weak convergence; this information together can provide strong compactness). For the second point, finer tools on the JKO scheme, together with stronger assumptions on the initial data, can be used. For instance, it is possible to prove \(L^\infty \) bounds on \(\rho \) if \(\rho _0\in L^\infty \) and \(V\in C^2\) (see Section 7.4.1 in [84]), which is enough to transform the \(H^1\) estimates on \(\varrho ^{m-1/2}\) into estimates on \(\varrho ^{m}\). Another possibility is the use of estimates obtained via a technique called flow interchange (where the optimality of a density \(\varrho \) in the JKO scheme is tested against its evolution via the gradient flow of another functional, see [66]), which allows to modify the exponent of \(\varrho \) in (4.18).
4.5 Geodesic convexity in \(\mathbb W_2\)
Even if we insisted on the fact that the most natural approach to gradient flows in \(\mathbb W_2\) relies on the notion of weak solutions to some PDEs and not on the EDE/EVI formulations, surely it could be interesting to check whether the general theory of Sect. 3 could be applied to some model functionals on the space of probabilities, such as \(\mathcal {F}\), \(\mathcal {V}\) or \(\mathcal {W}\). This requires to discuss their \(\lambda \)-convexity, which is also useful because it can provide uniqueness results. As we now know the geodesics in the Wasserstein space, the question of which functionals are geodesically convex is easy to tackle. The notion of geodesic convexity in the \(\mathbb {W}_2\) space, also called displacement convexity, has first been introduced by McCann in [72].
Displacement convexity of \(\mathcal {F}\), \(\mathcal {V}\) and \(\mathcal {W}\) It is not difficult to check that the convexity of V is enough to guarantee geodesic convexity of \(\mathcal {V}\), since
as well as the convexity of W guarantees that of \(\mathcal {W}\):
Similarly, if V or W are \(\lambda \)-convex we get \(\lambda \)-geodesical convexity. Note that in the case of \(\mathcal {V}\) it is easy to see that the \(\lambda \)-convexity of V is also necessary for the \(\lambda \)-geodesical convexity of \(\mathcal {V}\), while the same is not true for W and \(\mathcal {W}\).
The most interesting displacement convexity result is the one for functionals depending on the density. To consider these functionals, we need some technical facts.
The starting point is the computation of the density of an image measure, via standard change-of-variable techniques: if \(T:\Omega \rightarrow \Omega \) is a map smooth enoughFootnote 12 and injective, and \(\det (DT(x))\ne 0\) for a.e. \(x\in \Omega \), if we set \(\varrho _T{:=}T_\#\varrho \), then \(\varrho _T\) is absolutely continuous with density given by
Then, we underline an interesting computation, which can be proven as an exercise.
Lemma 4.9
Let A, B be two \(d\times d\) symmetric and positive-definite matrices. Then \([0,1]\ni t\mapsto g(t){:=} \det ((1-t)A+tB)^{1/d}\) is a concave function
We can now state the main theorem.
Theorem 4.10
Suppose that \(f(0)=0\) and that \(s\mapsto s^{d}f(s^{-d})\) is convex and decreasing. Suppose that \(\Omega \) is convex. Then \(\mathcal {F}\) is geodesically convex in \(\mathbb W_2\).
Proof
Take two measures \(\mu _0\), \(\mu _1\) with \(\mathcal {F}(\mu _0),\mathcal {F}(\mu _1)<+\infty \) (which implies that they are absolutely continuous). There is a unique constant-speed geodesic \(\mu _t\) between them, given by \(\mu _t=(T_t)_\#\mu _0\), where \(T_t=id+t(T-id)\). Note that we have \(T_t(x)=x-t\nabla \varphi (x),\) where \(\varphi \) is such that \(\frac{x^2}{2}-\varphi \) is convex. This implies that \(D^2\varphi \le I\) and \(T_t\) is, for \(t<1\), the gradient of a strictly convex function, hence it is injective. Moreover \(\nabla \varphi \) is countably Lipschitz, and so is \(T_t\). From the formula for the density of the image measure, we know that \(\mu _t\) is absolutely continuous and we can write its density \(\varrho _t\) as \(\varrho _t(y)=\varrho (T_t^{-1}(y))/\det ( \mathrm {I}+tM(T_t^{-1}(y)))\), where \(M=-D^2\varphi \) and \(\varrho \) is the density of \(\mu \), and hence
where we used the change of variables \(y=T_t(x)\) and \(\mathrm{d}y=\det DT_t(x)\,\mathrm{d}x=\det (\mathrm {I}+tM(x))\,\mathrm{d}x\).
Lemma 4.9 applied to \(A=\mathrm I\) and \(B=\mathrm I+M\) implies \(\det (\mathrm {I}+tM(x))=g(t,x)^d\) for a function \(g:[0,1]\times \Omega \), concave in t. The composition of a convex and decreasing function with a concave one gives a convex function, which implies that
is convex. Finally, we proved convexity of \(t\mapsto \mathcal {F}(\mu _t)\).\(\square \)
Remark 4.1
Note that the assumption that \(s\mapsto s^{d}f(s^{-d})\) is convex and decreasing implies that f itself is convex (the reader can check it as an exercise), a property which can be useful to establish, for instance, lower semicontinuity of \(\mathcal {F}\).
Here are some examples of convex functions satisfying the assumptions of Theorem 4.10:
-
the entropy function \(f(t)=t\log t\) satisfies these assumptions: \(s^df(s^{-d})=-d\log s\) is convex and decreasing;
-
all the power functions \(f(t)=t^q\) for \(q>1\) also satisfy the same assumptions, since \(s^df(s^{-d})=s^{-d(q-1)}\) is convex and decreasing;
-
the function \(f(t)=-t^m\) is convex for \(m<1\), and if we compute \(s^df(s^{-d})=-t^{m(1-d)}\) we get a convex and decreasing function as soon as \(1-\frac{1}{d}\le m<1\). Note that in this case f is not superlinear, which requires some attention for the semicontinuity of \(\mathcal {F}\).
Finally, it is interesting to observe that the geodesic convexity of higher-order functionals such as \(\varrho \mapsto \int |\nabla \varrho |^p\) generally fails, or is a very delicate matter, while these functionals are among the most standard examples of convex functionals in the usual sense. See [29] for some examples of first-order geodesically convex functionals (in dimension one).
Convexity on generalized geodesics When studying displacement convexity a bad surprise arrives when considering the functional \(\mu \mapsto W_2^2(\mu ,\nu )\). This functional is not, in general, displacement convex, contrary to the intuition stipulating that squared distances should be nice convex functions.Footnote 13 However, we can see that this fails from the following easy example. Choose \(a,b,c\in \mathbb R^2\) to be the vertices of an equilateral triangle: \(a=(0,0)\), \(b=(1,0)\) and \(c=(1/2,\sqrt{3}/2)\). Set \(\nu =\frac{1}{2}\delta _{a}+\frac{1}{2}\delta _{b}\) and \(\mu _0=\frac{1}{2}\delta _{a}+\frac{1}{2}\delta _{c}\) and \(\mu _0=\frac{1}{2}\delta _{b}+\frac{1}{2}\delta _{c}\). It is not difficult to see that the geodesic between \(\mu _0\) and \(\mu _1\) is given by \(\mu _t=\frac{1}{2}\delta _{(t,0)}+\frac{1}{2}\delta _{c}\) and that
(this is computed by observing that in the transport between \(\mu _t\) and \(\nu \) we can send the atom at (t, 0) to the closest atom of \(\nu \) and the atom at c to the other one, and that we cannot do better, see Fig. 2). But this function is not convex!
The lack of geodesic convexity of this easy functionalFootnote 14 is a problem for many issues, and in particular for the C\(^2\)G\(^2\) condition, and an alternate notion has been proposed, namely that of convexity along generalized geodesics.
Definition 4.3
Given an absolutely continuous probability \(\varrho \in \mathcal {P}(\Omega )\), for every pair \(\mu _0,\mu _1\in \mathcal {P}(\Omega )\) we call generalized geodesic in \(\mathbb W_2(\Omega )\) between \(\mu _0\) and \(\mu _1\) with base \(\varrho \) the curve \(\mu _t=((1-t)T_0+tT_1)_\#\varrho \), where \(T_i\) is the optimal transport map (for the cost \(|x-y|^2\)) from \(\varrho \) to \(\mu _i\) (\(i=0,1\)).
It is clear that \(t\mapsto W_2^2(\mu _t,\varrho )\) satisfies
This provides the desired convexity along the curve \(\mu _t\). Moreover, it is possible, by using similar arguments as those already used in this section, to show that all the functionals \(\mathcal {V}, \mathcal {W}\) and \(\mathcal {F}\) are also convex along generalized geodesics under the same assumptions. For the case of functionals \(\mathcal {V}\) and \(\mathcal {W}\) this is easy, while for the case of the functional \(\mathcal {F}\) we use again Lemma 4.9 (without restricting to the case \(A=\mathrm I\)). We do not develop these proofs here, and we refer to [3] or Chapter 7 in [84] for more details.
Of course, we could wonder whether the assumption C\(^2\)G\(^2\) is satisfied in the Wasserstein space \(\mathbb W_2(\Omega )\) for these functionals \(\mathcal {F}, \mathcal {V}\) and \(\mathcal {W}\): actually, if one looks closer at this questions, it is possible to guess that the very definition of C\(^2\)G\(^2\) has been given on purpose in order to face the case of Wasserstein spaces. Indeed, if we fix \(\nu \in \mathcal {P}(\Omega )\) and take \(\mu _0,\mu _1\) two other probabilities, with \(T_0,T_1\) the optimal transports from \(\nu \) to \(\mu _0\) and \(\mu _1\), respectively, then the curve
connects \(\mu _0\) to \(\mu _1\) and can be used in C\(^2\)G\(^2\).
Displacement convexity and curvature conditions In the proof of the geodesic convexity of the functional \(\mathcal {F}\) we strongly used the Euclidean structure of the geodesics in the Wasserstein space. The key point was the form of the intermediate map \(T_t\): a linear interpolation between the identity map id and the optimal map T, together with the convexity properties of \(t\mapsto \det (I+tA)^{1/d}\). On a Riemannian manifold, this would completely change as the geodesic curves between x and T(x), which are no more segments, could concentrate more or less than what happens in the Euclidean case, depending on the curvature of the manifold (think at the geodesics on a sphere connecting points close to the North Pole to points close to the South Pole: they are much farther from each other at their middle points than at their endpoints). It has been found (see [79]) that the condition of \(\lambda \)-geodesic convexity of the Entropy functional \(\varrho \mapsto \int \varrho \log (\varrho )\mathrm{d}\mathrm {Vol}\) (where \(\varrho \) is absolutely continuous w.r.t. the volume measure \(\mathrm {Vol}\) and densities are computed accordingly) on a manifold characterizes a lower bound on its Ricci curvature:
Proposition 4.11
Let M be a compact manifold of dimension d and \(\mathrm {Vol}\) its volume measure. Let \(\mathcal {E}\) be the entropy functional defined via \(\mathcal {E}(\varrho )=\int \varrho \log \varrho \,\mathrm{d}\mathrm {Vol}\) for all measures \(\varrho \ll \mathrm {Vol}\) (set to \(+\infty \) on non-absolutely continuous measures). Then \(\mathcal {E}\) is \(\lambda \)-geodesically convex in the Wasserstein space \(\mathbb W_2(M)\) if and only if the Ricci curvature \(\mathrm {Ric}_M\) satisfies \(\mathrm {Ric}_M\ge \lambda \). In the case \(\lambda =0\), the same equivalence is true if one replaces the entropy function \(f(t)=t\log t\) with the function \(f_N(t)=-t^{1-1/N}\) with \(N\ge d\).
This fact will be the basis (we will see it in Sect. 5) of a definition based on optimal transport of the notion of Ricci curvature bounds in more abstract spaces, a definition independently proposed in two celebrated papers by Sturm [87, 88] and Lott and Villani [65].
Remark 4.2
The reader who followed the first proofs of this section has surely observed that it is easy, in the space \(\mathbb W_2(\Omega )\), to produce \(\lambda \)-geodesically convex functionals which are not geodesically convex (with \(\lambda <0\), of course), of the form \(\mathcal {V}\) (just take a \(\lambda \)-convex function V which is not convex), but that Theorem 4.10 only provides geodesic convexity (never provides \(\lambda \)-convexity without convexity) for functionals of the form \(\mathcal {F}\): this is indeed specific to the Euclidean case, where the optimal transport has the form \(T(x)=x-\nabla \varphi (x)\); in Riemannian manifolds or other metric measure spaces, this can be different!
Geodesic convexity and uniqueness of gradient flows The fact that \(\lambda \)-convexity is a crucial tool to establish uniqueness and stability results in gradient flows is not only true in the abstract theory of Sect. 3, where we saw that the EVI condition (intimately linked to \(\lambda \)-convexity) provides stability. Indeed, it can also be observed in the concrete case of gradient flows in \(\mathbb W_2(\Omega )\), which take the form of the PDE (4.13). We will see this fact via some formal considerations, starting from an interesting lemma:
Lemma 4.12
Suppose that \(F:\mathbb W_2(\Omega )\rightarrow \mathbb R\cup \{+\infty \}\) is \(\lambda \)-geodesically convex. Then, for every \(\varrho _0,\varrho _1\in \mathcal {P}(\Omega )\) for which the integrals below are well-defined, we have
where \(\varphi \) is the Kantorovich potential in the transport from \(\varrho _0\) to \(\varrho _1\) and \(\psi =\varphi ^c\) is the Kantorovich potential from \(\varrho _1\) to \(\varrho _0\).
Proof
Let \(\varrho _t\) be the geodesic curve in \(\mathbb W_2(\Omega )\) connecting \(\varrho _0\) to \(\varrho _1\) and \(g(t){:=}F(\varrho _t)\). The assumption of \(\lambda \)-convexity means (see the definition in (3.2))
Since the above inequality is an equality for \(t=0,1\), we can differentiate it at these two points thus obtaining
which implies
Then, we compute the derivative of g, formally obtaining
and, using \(\mathbf {v}_0=-\nabla \varphi \) and \(\mathbf {v}_1=\nabla \psi \), we obtain the claim. \(\square \)
With the above lemma in mind, we consider two curves \(\varrho ^0_t\) and \(\varrho ^1_t\), with
where the vector fields \(\mathbf {v}^i_t\) are their respective velocity fields provided by Theorem 4.6. Setting \(d(t){:=}\frac{1}{2} W_2^2(\varrho ^0_t,\varrho ^1_t)\), it is natural to guess that we have
where \(\varphi \) is the Kantorovich potential in the transport from \(\varrho ^0_t\) to \(\varrho ^1_t\) and \(\psi =\varphi ^c\). Indeed, a rigorous proof is provided in [3] or in Section 5.3.5 of [84], but one can guess it from the duality formula
As usual, differentiating an expression written as a max involves the optimal functions \(\phi ,\psi \) in such a max, and the terms \(\partial _t \varrho ^i_t\) have been replaced by \(-\nabla \cdot (\varrho ^i_t \mathbf {v}^i_t)\) as in the proof Lemma 4.12.
When the two curves \(\varrho ^0_t\) and \(\varrho ^1_t\) are solutions of (4.13) we have \(\mathbf {v}^i_t=-\nabla \left( \frac{\delta F}{\delta \varrho }(\varrho ^i_t)\right) \), and Lemma 4.12 allows to obtain the following:
Proposition 4.13
Suppose that \(F:\mathbb W_2(\Omega )\rightarrow \mathbb R\cup \{+\infty \}\) is \(\lambda \)-geodesically convex and that the two curves \(\varrho ^0_t\) and \(\varrho ^1_t\) are solutions of (4.13). Then, setting \(d(t){:=}\frac{1}{2}W_2^2(\varrho ^0_t,\varrho ^1_t)\), we have
This implies uniqueness of the solution of (4.13) for fixed initial datum, stability, and exponential convergence as \(t\rightarrow +\infty \) if \(\lambda >0\).
4.6 Other gradient-flow PDEs and variants
In Sect. 4.3 we saw some examples of evolution PDEs that can be obtained as gradient flows for suitable functionals F in the Wasserstein space of probability measures on a domain \(\Omega \), endowed with the distance \(W_2\). Here we would like to give some more examples and present some variants (the reader can also look at Section 8.4.2 in [84]).
Keller–Segel A model in mathematical biology, first introduced in [56, 57], has received much attention due to its interesting mathematical properties: consider a population \(\varrho \) of bacteria, which move advected by a potential and subject to diffusion. The potential is given by the concentration u of a nutrient, produced by the bacteria themselves, which attracts them by a chemical signal (this is why it is called chemo-attractant, and this kind of phenomenon is also known under the name of chemotaxis). More precisely, the drift is given by the gradient \(\nabla u\), but the distribution of u depends on their density \(\varrho \). The easiest model has linear diffusion and supposes that u is related to \(\varrho \) by the condition \(-\Delta u =\varrho \), with Dirichlet boundary conditions \(u=0\) on \(\partial \Omega \). This gives the system
The parameter \(\alpha \) tunes the attraction intensity of bacteria towards the chemo-attractant.
This system is also studied in the whole space, where (to impose suitable decay at infinity), the function u is defined through the explicit formula
One can see that System (4.20) is the gradient flow of the functional
To check this fact, we only need to compute the first variation of the Dirichlet term \(- \frac{1}{2} \int |\nabla u_\varrho |^2\). Suppose \(\varrho _\varepsilon =\varrho +\varepsilon \chi \) and set \(u_{\varrho +\varepsilon \chi }=u_\varrho +\varepsilon u_\chi \). Then
Note that, due to the wrong sign in the Dirichlet term, the variational problem \(\min F(\varrho )+\frac{W_2^2(\varrho ,\varrho _0)}{2\tau }\) requires some assumptions when proving the existence of minimizers. In particular, it would be possible that the infimum is \(-\infty \), or that the energy is not l.s.c. because of the negative sign. In the case of dimension 2, sophisticated functional inequalities allow to handle the case \(\alpha \le 8\pi \). We refer to [15] and to the references therein for details on the analysis of this equation.
We also remark that the above model, coupling a parabolic equation on \(\varrho \) and an elliptic equation on u, implicitly assumes that the configuration of the chemo-attractant instantaneously follows that of \(\varrho \). More sophisticated models can be expressed in terms of the so-called parabolic–parabolic Keller–Segel equation, in the form
or other variants with different boundary conditions. Equation (4.22) can also be studied as a gradient flow in two variables, using the distance \(W_2\) on \(\varrho \) and an \(L^2\) distance on u; see [16]. In this case, we use
The JKO schemes becomes
(note that we multiplied the \(L^2\) part of the distance times a coefficient \(\alpha \) in order to adjust the coefficients in the parabolic equation on u). At a very formal level, the reader can see that this implies
which gives the first equation in (4.22), and
which gives the second equation in (4.22).
Crowd motion The theory of Wasserstein gradient flows has interestingly been applied to the study of a continuous model of crowd movement under density constraints.
This theory has first been introduced in a microscopic setting (each individual in the crowd being a rigid sphere), but we will not develop this aspect here (the interested reader can refer to the papers by Maury and Venel [69, 70]). We are now interested in the simplest continuous counterpart of this microscopic model. In this case the particles population will be described by a probability density \(\varrho \in \mathcal {P}(\Omega )\), and we are given a so-called “spontaneous” velocity field \(\mathbf {u}\) representing the velocity each particle would follow if the other particles were not present. Yet, the crowd is supposed to be incompressible in the sense that it is subject to a maximal density constraint \(\varrho \le 1\). Then \(\mathbf {u}\) must be adapted so that it belongs to a suitable set of admissible velocities, those which preserve this constraint. (This depends on the sign of the divergence of \(\mathbf {v}\) on the saturated region \(\{\varrho =1\}\)): \(\mathrm {adm}(\varrho )=\big \{ \mathbf {v}:\Omega \rightarrow \mathbb R^d\;:\;\nabla \cdot \mathbf {v}\ge 0\; \text{ on } \{\varrho =1\}\big \}\). The main axiom of the model (presented, for instance, in [67]) is that the density \(\varrho \) will be advected by a vector field \(\mathbf {v}_t\) which is not \(\mathbf {u}\) but its projection, at each time t, on \(\mathrm {adm}(\varrho _t)\). Hence, we look for solutions of the equation
The main difficulty is the fact that the vector field \(\mathbf {v}=P_{\mathrm {adm}(\varrho _t)}\mathbf {u}_t\) is not smooth (since it is only expected to be \(L^2\)), and it has a non-smooth dependance on \(\varrho \) (since passing from a density 1 to a density \(1-\varepsilon \) completely modifies the saturated zone, \(\mathbf {v}\) is very sensitive to small changes in \(\varrho \)).
In [67] these difficulties have been overpassed in the case \(\mathbf {u}=-\nabla V\) (where \(V:\Omega \rightarrow \mathbb R\) is a given Lipschitz function) and the existence of a solution (with numerical simulations) is proven via a gradient flow method. Indeed, (4.23) turns out to be the gradient flow in \(\mathbb W_2\) of the energy
where we define the set \(K=\{\varrho \in \mathcal {P}(\Omega )\,:\,\varrho \le 1\}\). The case \(\mathbf {u}=-\nabla V\) is already quite satisfying for applications, since a typical case is obtained when \(V(x)=\mathrm {dist}(x,\Gamma )\) and \(\Gamma \subset \partial \Omega \) is a part of the boundary playing the role of an exit.
We do not enter into the details of the study of this equation, but we just make a little bit more precise the definitions above. Actually, it is more convenient to give a better description of \(\mathrm {adm}(\varrho )\) by duality:
In this way we characterize \(\mathbf {v}=P_{\mathrm {adm}(\varrho )}(u)\) through
where \(\mathrm {press}(\varrho )\) is the space of “pressure” functions p (used as test functions in the dual definition of \(\mathrm {adm}(\varrho )\)). The two cones \(\nabla \mathrm {press}(\varrho )\) and \(\mathrm {adm}(\varrho )\) are in duality (i.e. one is defined as the set of vectors with negative scalar product with the elements of the other) for the \(L^2\) scalar product. This provides a unique orthogonal decomposition \(\mathbf {u}_t=\mathbf {v}_t+\nabla p_t\), and allows to re-write Eq. (4.23) as
More details can be found in [67, 68, 81]. Variants where the motion of the crowd also undergoes diffusion are also studied, as in [38, 76] (in particular, [76] provides interesting uniqueness results) as well as variants with several species (see [25]).
Dirichlet boundary conditions So far, all the equations that we dealt with were always accompanied by Neumann-type boundary conditions: they are based on the continuity equation, which represents the conservation of mass, and we are describing the motion of a population \(\varrho \) of particles forced to stay inside a given domain \(\Omega \). These boundary conditions exactly impose the fact that particles do not exit the domain. Note that this does not mean that particles on the boundary should stay on the boundary forever (this is what formally happens under the condition \(\mathbf {v}\cdot \mathbf {n}=0\)); the boundary condition here is rather \(\varrho \mathbf {v}\cdot \mathbf {n}=0\): if at a given instant of time the particles enter from \(\partial \Omega \) into the interior, then immediately after there will be (locally) no mass on the boundary, and the condition is not violated, hence. On the contrary, should some mass go from the interior to outside \(\Omega \), then there would be indeed a violation, since there would be some mass \(\varrho >0\) on the boundary with velocity directed outwards.
Anyway, we see that Dirichlet conditions do not find their translation into \(W_2\) gradient flows! To cope with Dirichlet boundary conditions, Figalli and Gigli defined in [42] a sort of modified Wasserstein distance, with a special role played by the boundary \(\partial \Omega \), in order to study the heat equation \(\partial _t\varrho =\Delta \varrho \) with Dirichlet b.c. \(\varrho =1\) on \(\partial \Omega \).
Given two finite positive measures \(\mu ,\nu \in \mathcal {M}_+(\overset{\circ }{\Omega })\) (not necessarily with the same mass), set
Then, define
The index b reminds the special role of the boundary. Informally, this means that the transport from \(\mu \) to \(\nu \) may be done either usually, or by moving some mass to/from \(\partial \Omega \), the transport being free on \(\partial \Omega \).
In [42] the authors prove that \(Wb_2\) is a distance, that the space \(\mathcal {M}_+(\overset{\circ }{\Omega })\) is a geodesic space (even if \(\Omega \) is not convex) and they study the gradient flow, for this distance, of the functional \(F(\varrho )=\int (\varrho \log \varrho -\varrho )\,\mathrm{d}x\) (note that here the term \(-\int \varrho \) is important, because the total mass is not fixed), which corresponds to the heat equation with the particular boundary condition \(\varrho =1\) on \(\partial \Omega \) (the value 1 is the value which minimizes \(t\mapsto f(t)=t\log t-t\)).
We observe that this kind of distances with free transport on the boundary were already present in [17, 18], but for the Wasserstein distance \(W_1\), and with no applications to gradient flows.
Reaction–diffusion equations A class of equations which are widely studied in applied mathematics and in particular mathematical biology are PDEs including both a divergence term and a source term, which injects or removes some mass into the system, thus getting an equation of the form
where \(\mathbf {v}_t\) and \(f_t\) could depend on \(\varrho _t\). When the divergence term is of the form \(\Delta (g(\varrho _t))\), then we face a diffusion behavior, while the source term f, usually depending on \(\varrho \), stands for internal reactions which provide growth or reproduction of the particle population \(\varrho \). This is why these equations are usually called reaction–diffusion equations. Yet, it is clear that they cannot be dealt with with the theory of Wasserstein gradient flows, because of the reaction term, which is not in conservative form (and, by the way, the mass could not be preserved in time).
To consider equations of this form, a new distance has been recently introduced. This very interesting distance has been independently introduced by three different groups around the year 2015, and we refer to [31, 32, 60, 63, 64] for the different presentations and the different considerations which have been investigated so far. We give here only a very short description of this distance; the names chosen for it were of course different according to the different groups, and we decided to stick to the terminology chosen in [44] where it is called Kantorovich–Fisher–Rao distance to take into account all the contributions.
The starting point is the definition of \(W_2^2\) via (4.8) (in the case \(p=2\)), which can be modified as follows
This means that one can transform \(\mu \) into \(\nu \) either by displacing some masses, with velocity \(\mathbf {v}\), and paying for the \(L^2\) norm of \(\mathbf {v}\), or by adding/removing some masses, using a scalar field u, and paying as well its \(L^2\) norm. We do not enter into details of the minimization problem defining \(\mathtt {KFR}\), which could also be turned into a convex one by using the variables \((\varrho , \varrho \mathbf {v}, \varrho u)\), and of its equivalent formulations. We only note that the optimal \(\mathbf {v}\) and the optimal u are linked by \(\mathbf {v}=\nabla u\). Using this fact, one can guess and then prove that the limit as \(\tau \rightarrow 0\) of the curves obtained by interpolating the solutions of the iterated minimization problems
will provide a solution of the PDE
Even if the introduction of the distance \(\mathtt {KFR}\) has the advantage to allow for reaction terms in the PDE, it is clear that the rigidity which connects the reaction and the diffusion terms is a serious draw-back which prevents to consider a wide class of equations. To cope with this, and consider equations of the form
in [44] a splitting scheme is proposed. More precisely, one can define the following iterative algorithm:
where the distance \(\mathtt {FR}\) is defined only using the reaction part:
For this last distance,Footnote 15 which is indeed known as Fisher–Rao, or Hellinger, distance, there is an explicit formula:
where \(\lambda \) is any reference positive measure such that \(\mu ,\nu \ll \lambda \) (for instance \(\lambda =\mu +\nu \); anyway, the functional being one-homogeneous, the result does not depend on the reference measure).
4.7 Numerical methods from the JKO scheme
We present in this section two different numerical methods which have been recently proposed to tackle evolution PDEs which have the form of a gradient flow in \(\mathbb W_2(\Omega )\) via their variational JKO scheme (the reader can consult [58] for an early treatment of these equations via the JKO scheme). We will only be concerned with discretization methods allowing the numerical treatment of one step of the JKO scheme, i.e. solving problems of the form
for suitable \(\nu \) (to be taken equal to \(\varrho ^\tau _k\)) and suitable F (including the \(\tau \) factor). We will not consider the convergence as \(\tau \rightarrow 0\) of the iterations to solutions of the evolution equation.
We will present two methods. One, essentially taken from [13], is based on the Benamou–Brenier formula first introduced in [11] as a numerical tool for optimal transport (also see [12]). This method is well-suited for the case where the energy \(F(\varrho )\) used in the gradient flow is a convex function of \(\varrho \). For instance, it works for functionals of the form \(F(\varrho )=\int f(\varrho (x))\mathrm{d}x+\int V\mathrm{d}\varrho \) and can be used for Fokker–Planck and porous medium equations. The second method is based on methods from semi-discrete optimal transport, essentially developed by Q. Mérigot using computational geometry (see [59, 62, 74] for 3D implementation) and allows to translate the problem into an optimization problem in the class of convex functions; it is well suited for the case where F is geodesically convex, which means that the term \(\int f(\varrho (x))\,\mathrm{d}x\) is only admissible if f satisfies McCann’s condition, the term \(\int V\mathrm{d}\varrho \) needs V to be convex, but interaction terms such as \(\int \int W(x-y)\,\mathrm{d}\varrho (x)\,\mathrm{d}\varrho (y)\) are also allowed, if W is convex.
Augmented Lagrangian methods Let us recall the basis of the Benamou–Brenier method. This amounts to solve the variational problem (4.9) which reads, in the quadratic case, as
where \(K_2=\{(a,b)\in \mathbb R\times \mathbb R^d\;:\;a+\frac{1}{2} |b|^2\le 0\}\). We then use the fact that the continuity equation constraint can also be written as a sup penalization, by adding to the functional
which is 0 in case the constraint is satisfied, and \(+\infty \) if not.
It is more convenient to express everything in the space-time formalism, i.e. by writing \(\nabla _{t,x}\phi \) for \(( \partial _t\phi ,\nabla \phi )\) and using the variable \(\mathfrak m\) for \((\varrho ,E)\) and A for (a, b). We also set \(G(\phi ){:=}\int \phi _1\mathrm{d}\nu -\int \phi _0\mathrm{d}\mu \). Then the problem becomes
where the scalar product is here an \(L^2\) scalar product, but becomes a standard Euclidean scalar product as soon as one discretizes (in time-space). The function \(I_{K_2}\) denotes the indicator function in the sense of convex analysis, i.e. \(+\infty \) if the condition \(A\in K_2\) is not satisfied, and 0 otherwise.
The problem can now be seen as the search for a saddle-point of the Lagrangian
which means that we look for a pair \((\mathfrak m,(A,\phi )) \) (actually, a triple, but A and \(\phi \) play together the role of the second variable) where \(\mathfrak m\) minimizes for fixed \((A,\phi )\) and \((A,\phi )\) maximizes for fixed \(\mathfrak m\). This fits the following framework, where the variables are X and Y and the Lagrangian has the form \(L(X,Y){:=}X\cdot \Lambda Y-H(Y)\). In this case one can use a very smart trick, based on the fact that the saddle points of this Lagrangian are the same of the augmented Lagrangian \(\tilde{L}\) defined as \(\tilde{L}(X,Y){:=}X\cdot \Lambda Y-H(Y)-\frac{r}{2}|\Lambda Y|^2\), whatever the value of the parameter \(r>0\) is (see, for instance, [43]). Indeed, the saddle-point of L are characterized by (we assume all the functions we minimize are convex and all the functions we maximize are concave)
while those of \(\tilde{L}\) are characterized by
which is the same since the first equation implies that the extra term in the second vanishes.
In this case, we obtain a saddle point problem of the form
(where the squared norm in the last term is an \(L^2\) norm in time and space), which is then solved by iteratively repeating three steps: for fixed A and \(\mathfrak m\), finding the optimal \(\phi \) (which amounts to minimizing a quadratic functional in calculus of variations, i.e. solving a Poisson equation in the space-time \([0,1]\times \Omega \), with Neumann boundary conditions, homogeneous on \(\partial \Omega \) and non-homogeneous on \(t=0\) and \(t=1\), due to the term G); then for fixed \(\phi \) and \(\mathfrak m\) find the optimal A (which amounts to a pointwise minimization problem, in this case a projection on the convex set \(K_2\)); finally update \(\mathfrak m\) by going in the direction of the gradient descent, i.e. replacing \(\mathfrak m\) with \(\mathfrak m-r(A-\nabla _{t,x}\phi )\) (it is convenient to choose the parameter of the gradient descent to be equal to that the Augmented Lagrangian).
This is what is done in the case where the initial and final measures are fixed. At every JKO step, one is fixed (say, \(\mu \)), but the other is not, and a penalization on the final \(\varrho _1\) is added, of the form \(\tau F(\varrho _1)\). Inspired from the considerations above, the saddle point below allows to treat the problem
by formulating it as
where we re-inserted the integration signs to underline the difference between integrals in space-time (with \(\mathfrak m, A\) and \(\phi \)) and in space only (with \(\phi _0,\phi _1,\varrho _1,V\) and \(\lambda \)). The role of the variable \(\lambda \) is to be dual to \(\varrho _1\), which allows to express \(f(\varrho _1)\) as \(\sup _\lambda \, \varrho _1\lambda -f^*(\lambda )\).
To find a solution to this saddle-point problem, an iterative procedure is also used, as above. The last two steps are the update via a gradient descent of \(\mathfrak m\) and \(\varrho _1\), and do not require further explications. The first three steps consist in the optimization of \(\phi \) (which requires the solution of a Poisson problem) and in two pointwise minimization problems in order to find A (which requires a projection on \(K_2\)) and \(\lambda \) (the minimization of \(f^*(\lambda )+\frac{r}{2}|\phi _1(x)+\lambda +V(x)|^2-\varrho _1(x)\lambda \), for fixed x).
For the applications to gradient flows, a small time-step \(\tau >0\) has to be fixed, and this scheme has to be done for each k, using \(\mu =\varrho ^\tau _k\) and setting \(\varrho ^\tau _{k+1}\) equal to the optimizer \(\varrho _1\) and the functions f and V must include the scale factor \(\tau \). The time-space \([0,1]\times \Omega \) has to be discretized but the evolution in time is infinitesimal (due to the small time scale \(\tau \)), which allows to choose a very rough time discretization. In practice, the interval [0, 1] is only discretized using less than 10 time steps for each k...
The interested reader can consult [13] for more details, examples and simulations.
Optimization among convex functions It is clear that the optimization problem
can be formulated in terms of transport maps as
Also, it is possible to take advantage of Brenier’s theorem which characterizes optimal transport maps as gradient of convex functions, and recast it as
It is useful to note that in the last formulation the convexity of u is not necessary to be imposed, as it would anyway come up as an optimality condition. On the other hand, very often the functional F involves explicity the density of the image measure \((\nabla u)_\#\mu \) (as it is the case for the typical example \(\mathcal {F}\)), and in this case convexity of u helps in computing this image measure. Indeed, whenever u is convex we can say that the density of \(\varrho {:=}(\nabla u)_\#\mu \) (if \(\mu \) itself is absolutely continuous, with a density that we will denote by \(\varrho _0\)) is given byFootnote 16
Hence, we are facing a calculus of variations problem in the class of convex functions. A great difficulty to attack this class of problems is how to discretize the space of convex functions. The first natural approach would be to approximate them by piecewise linear functions over a fixed mesh. In this case, imposing convexity becomes a local feature, and the number of linear constraints for convexity is proportional to the size of the mesh. Yet, Choné and Le Meur showed in [33] that we cannot approximate in this way all convex functions, but only those satisfying some extra constraints on their Hessian (for instance, those which also have a positive mixed second derivative \(\partial ^2 u/\partial x_i\partial x_j\)). Because of this difficulty, a different approach is needed. For instance, Ekeland and Moreno- Bromberg used the representation of a convex function as a maximum of affine functions [40], but this needed many more linear constraints; Oudet and Mérigot [75] decided to test convexity on a different (and less refined) grid than that where the functions are defined... These methods give somehow satisfactory answers for functionals involving u and \(\nabla u\), but are not able to handle terms involving the Monge–Ampère operator \(\det (D^2 u)\).
The method proposed in [14], that we will roughly present here, does not really use a prescribed mesh. The idea is the following: suppose that \(\mu \) is a discrete measure of atomic type, i.e. of the form \(\sum _j a_j\delta _{x_j}\). A convex function defined on its support \(S{:=}\{x_j\}_j\) will be a function \( u:S\rightarrow \mathbb R\) such that at each point \(x\in S\) the subdifferential
is non-empty. Also, the Monge–Ampère operator will be defined by using the sub-differential, and more precisely the equality
which is valid for smooth convex functions u and arbitrary open sets B. An important point is the fact that whenever f is superlinear, functionals of the form \(\int f(\varrho (x))\mathrm{d}x\) impose, for their finiteness, the positivity of \(\det (D^2 u)\), which will in turn impose that the sub-differential has positive volume, i.e. it is non-empty, and hence convexity...
More precisely, we will minimize over the set of pairs \(( u,P):S\rightarrow \mathbb R\times \mathbb R^d\) where \(P(x)\in \partial u(x)\) for every \(x\in S\). For every such pair (u, P) we weed to define G(u, P) which is meant to be \((\nabla u)_\#\mu \), and define F(G(u, P)) whenever F has the form \(F=\mathcal {F}+\mathcal {V}+\mathcal {W}\). We will simply define
which means that we just use \(P_\#\mu \) instead of \((\nabla u)_\#\mu \). Unfortunately, this choice is not adapted for the functional \(\mathcal {F}\), which requires absolutely continuous measures, and \(P_\#\mu \) is atomic. In this case, instead of concentrating all the mass \(a_j\) contained in the point \(x_j\in S\) on the unique point \(P(x_j)\), we need to spread it, and we will spread it uniformly on the whole subdifferential \(\partial u(x_j)\). This means that we also define a new surrogate of the image measure \((\nabla u)_\#\mu \), called \(G^{ac}( u,P)\) (where the superscript ac stands for absolutely continuous), given by
where \(A_j{:=}\partial u(x_j)\cap \Omega \) (the intersection with \(\Omega \) is done in order to take care of the constraint \(\nabla u\in \Omega \)). Computing \(\mathcal {F}(G^{ac}( u,P))\) gives hence
It is clear that the discretization of \(\mathcal {V}\) and \(\mathcal {W}\) in terms of G(u, P) are convex functions of (u, P) (actually, of P, and the constraint relating P and u is convex) whenever V and W are convex; concerning \(\mathcal {F}\), it is possible to prove, thanks to the concavity properties of the determinant or, equivalently, to the Brunn–Minkowski inequality (see for instance [86]) that \(\mathcal {F}(G^{ac}( u,P))\) is convex in u as soon as f satisfies McCann’s condition. Globally, it is not surprising to see that we face a convex variational problem in terms of u (or of \(\nabla u\)) as soon as F is displacement convex in \(\varrho \) (actually, convexity on generalized geodesics based at \(\mu \) should be the correct notion).
Then, we are lead to study the variational problem
under the constraints \(P(x_j)\in A_j{:=}\partial u(x_j)\cap \Omega \). Note that we should incorporate the scale factor \(\tau \) in the functional F which means that, for practical purposes, convexity in P is guaranteed as soon as V and W have second derivatives which are bounded from below (they are semi-convex) and \(\tau \) is small (the quadratic term coming from \(W_2^2\) will always overwhelm possible concavity of the other terms). The delicate point is how to compute the subdifferentials \(\partial u(x_j)\), and optimize them (i.e. compute derivatives of the relevant quantity w.r.t. u).
This is now possible, and in a very efficient way, thanks to tools from computational geometry. Indeed, in this context, subdifferentials are exactly a particular case of what are called Laguerre cells, which in turn are very similar to Voronoi cells. We remind that, given some points \((x_j)_j\), their Voronoi cells \(V_j\) are defined by
(of course the squares of the distances could be replaced by the distances themselves, but in this way it is evident that the cells \(V_j\) are given by a finite number of linear inequalities, and are thus convex polyhedra; the factors \(\frac{1}{2}\) are also present only for cosmetic reasons). Hence, Voronoi cells are the cells of points which are closer to one given point \(x_j\in S\) than to the others.
In optimal transport a variant of these cells is more useful: given a set of values \(\psi _j\), we look for the cells (called Laguerre cells)
This means that we look at points which are closer to \(x_j\) than to the other points \(x_{j'}\), up to a correctionFootnote 17 given by the values \(\psi _j\). It is not difficult to see that also in this case cells are convex polyhedra. And it is also easy to see that, if \(\varrho \) is an absolutely continuous measure on \(\Omega \) and \(\mu =\sum _j a_j\delta _{x_j}\), then finding an optimal transport map from \(\varrho \) to \(\mu \) is equivalent to finding values \(\psi _j\) such that \(\varrho (W_j)=a_j\) for every j (indeed, in this case, the map sending every point of \(W_j\) to \(x_j\) is optimal, and \(-\psi _j\) is the value of the corresponding Kantorovich potential at the point \(x_j\)). Finally, it can be easily seen that the Laguerre cells corresponding to \(\psi _j{:=} u(x_j)-\frac{1}{2}|x_j|^2\) are nothing but the subdifferentials of u (possibly intersected with \(\Omega \)).
Handling Laguerre cells from the computer point of view has for long been difficult, but it is now state-of-the-art in computational geometry, and it is possible to compute very easily their volumes (incidentally, also find some points P belonging to them, which is useful so as to satisfy the constraints of Problem (4.25)), as well as the derivatives of their volumes (which depend on the measures of each faces) w.r.t. the values \(\psi _j\). For the applications to semi-discreteFootnote 18 optimal transport problems, the results are now very fast (with discretizations with up to 10\(^6\) points in some minutes, in 3D; the reader can have a look at [59, 62, 74] but also to Section 6.4.2 in [84]), and the same tools have been used for the applications to the JKO scheme that we just described.
In order to perform an iterated minimization, it is enough to discretize \(\varrho _0\) with a finite number of Dirac masses located at points \(x_j\), to fix \(\tau >0\) small, then to solve (4.25) with \(\mu =\varrho ^\tau _{k}\) and set \(\varrho ^\tau _{k+1}{:=}G( u,P)\) for the optimal (u, P). Results, proofs of convergence and simulations are in [14].
5 The heat flow in metric measure spaces
In this last section we will give a very sketchy overview of an interesting research topic developed by Ambrosio, Gigli, Savaré and their collaborators, which is in some sense a bridge between
-
the theory of gradient flows in \(\mathbb W_2\), seen from an abstract metric space point of view (which is not the point of view that we underlined the most in the previous section),
-
and the current research topic of analysis and differential calculus in metric measure spaces (see, for instance, [49]).
This part of their work is very ambitious, and really aims at studying analytical and geometrical properties of metric measure spaces; what we will see here is only a starting point.
The topic that we will briefly develop here is concerned with the heat flow, and the main observation is the following: in the Euclidean space \(\mathbb R^d\) (or in a domain \(\Omega \subset \mathbb R^d)\), the heat flow \(\partial _t \varrho =\Delta \varrho \) may be seen as a gradient flow in two different ways:
-
first, it is the gradient flow in the Hilbert space \(L^2(\Omega )\), endowed with the standard \(L^2\) norm, of the functional consisting in the Dirichlet energy \(\mathcal {D}(\varrho )=\int |\nabla \varrho |^2\mathrm{d}x\) (a functional which is set to \(+\infty \) if \(\varrho \notin H^1(\Omega )\)); in this setting, the initial datum \(\varrho _0\) could be any function in \(L^2(\Omega )\), but well-known properties of the heat equation guarantee \(\varrho _0\ge 0\Rightarrow \varrho _t\ge 0\) and, if \(\Omega \) is the whole space, or boundary conditions are Neumann, then \(\int \varrho _0\mathrm{d}x=1\Rightarrow \int \varrho _t\mathrm{d}x=1\); it is thus possible to restrict to probability densities (i.e. positive densities with mass one);
-
then, if we use the functional \(\mathcal {E}\) of the previous section (the entropy defined with \(f(t)=t\log t\)), the heat flow is also a gradient flow in the space \( \mathbb W_2(\Omega )\).
A natural question arises: is the fact that these two flows coincide a general fact? How to analyze this question in a general metric space? In the Euclidean space this is easy: we just write the PDEs corresponding to each of these flows and, as they are the same PDE, for which we know uniqueness results, then the two flows are the same.
First, we realize that the question is not well-posed if the only structure that we consider on the underlining space is that of a metric space. Indeed, we also need a reference measure (a role played by the Lebesgue measure in the Euclidean space). Such a measure is needed in order to define the integral \(\int |\nabla \varrho |^2\mathrm{d}x\), and also the entropy \(\int \varrho \log \varrho \,\mathrm{d}x\). Roughly speaking, we need to define “\(\mathrm{d}x\)”.
Hence, we need to consider metric measure spaces, (X, d, m), where \(m\ge 0\) is a reference measure (usually finite) on the Borel tribe of X. The unexperienced reader should not be surprised: metric measure spaces are currently the new frontier of some branches of geometric analysis, as a natural generalization of Riemannian manifolds. In order not to overburden the reference list, we just refer to the following papers, already present in the bibliography of this survey for other reasons: [2, 4, 6, 7, 30, 48, 50,51,52,53, 65, 85, 87].
5.1 Dirichlet and Cheeger energies in metric measure spaces
In order to attack our question about the comparison of the two flows, we first need to define and study the flow of the Dirichlet energy, and in particular to give a suitable definition of such an energy. This more or less means defining the space \(H^1(X)\) whenever X is a metric measure space (MMS). This is not new, and many authors studied it: we cite in particular [30, 51, 52, 85]). Moreover, in their recent works, [4, 6], Ambrosio, Gigli and Savaré presented some results in this direction, useful for the analysis of the most general case (consider that most of the previous results require a doubling assumption and the existence of a Poincaré inequality, see also [53], and this assumption on (X, d, m) is not required in their papers). One of the first definition of Sobolev spaces on a MMS had been given by Haiłasz, who used the following definition
This property characterizes Sobolev spaces in \(\mathbb R^d\) by choosing
where M[u] denotes the maximal function of u: (the important point here is the classical result in harmonic analysis guaranteeing \(||M[u]||_{L^2}\le C(d)||u||_{L^2}\)) and c is a suitable constant only depending on the dimension d. As this definition is not local, almost all the recent investigations on these topics are rather based on some other ideas, due to Cheeger [30], using the relaxation starting from Lipschitz functions, or to Shanmuganlingam [85], based on the inequality
required to hold on almost all curves, in a suitable sense. In the recent paper [4] the authors present a classification of the various weak notions of modulus of the gradient in a MMS and analyze their equivalence. On the contrary, here we will only choose one unique definition for \(\int |\nabla f|^2\mathrm{d}m\), the one which seems the simplest.
For every Lipschitz function f on X, let us take its local Lipschitz constant \(|\nabla f|\), defined in (3.1), and set \(\mathcal {D}(f){:=}\int |\nabla f|^2(x) \mathrm{d}m\). Then, by relaxation, we define the Cheeger Energy Footnote 19 \(\mathcal {C}(f)\):
We then define the Sobolev space \(H^1(X,d,m)\) as the space of functions such that \(\mathcal {C}(f)<+\infty \). This space will be a Banach space, endowed with the norm \(f\mapsto \sqrt{\mathcal {C}(f)}\) and the function \(f\mapsto \mathcal {C}(f)\) will be convex. We can also define \(-\Delta f\) as the element of minimal norm of the subdifferential \(\partial \mathcal {C}(f)\) (an element belonging to the dual of \(H^1(X,d,m)\)). Beware that, in general, the map \(f\mapsto -\Delta f\) will not be linear (which corresponds to the fact that the norm \(\sqrt{\mathcal {C}(f)}\) is in general not Hilbertian, i.e. it does not come from a scalar product).
Defining the flow of \(\mathcal {C}\) in the Hilbert space \(L^2(X,m)\) is now easy, and fits well the classical case of convex functionals on Hilbert spaces or, more generally, of monotone maximal operators (see [22]). This brings very general existence and uniqueness results.
5.2 A well-posed gradient flow for the entropy
A second step (first developed in [48] and then generalized in [5, 6]) consists in providing existence and uniqueness conditions for the gradient flow of the entropy, w.r.t. the Wasserstein distance \(W_2\). To do so, we consider the functional \(\mathcal {E}\), defined on the set of densities f such that \(\varrho {:=}f\cdot m\) is a probability measure via \(\mathcal {E}(f){:=}\int f\log f \,\mathrm{d}m\) and we look at its gradient flow in \( \mathbb W_2\) in the EDE sense. In order to apply the general theory of Sect. 3, as we cannot use the notion of weak solutions of the continuity equation, it will be natural to suppose that this functional \(\mathcal {E}\) is \(\lambda \)-geodesically convex for some \(\lambda \in \mathbb R\). This means, in the sense of Sturm and Lott–Villani, that the space (X, d, m) is a MMS with Ricci curvature bounded from below. We recall here the corresponding definition, based on the characteristic property already evoked in the Sect. 4.5, which was indeed a theorem (Proposition 4.11) in the smooth case.
Definition 5.1
A metric measure space (X, d, m) is said to have a Ricci curvature bounded from below by a constant \(K\in \mathbb R\) in the sense of Sturm and Lott-Villani if the entropy functional \(\mathcal {E}:\mathcal {P}(X)\rightarrow \mathbb R\cup \{+\infty \}\) defined through
is K-geodesically convex in the space \( \mathbb W_2(X)\). In this case we say that (X, d, m) satisfies the conditionFootnote 20 \(CD(K,\infty )\).
Note that the EVI formulation is not available in this case, as we do not have the geodesic convexity of the squared Wasserstein distance. Moreover, on a general metric space, the use of generalized geodesics is not always possible. This is the reason why we will define the gradient flow of \(\mathcal {E}\) by the EDE condition and not the EVI, but this requires to prove via other methods the uniqueness of such a gradient flow. To do so, Gigli introduced in [48] an interesting strategy to prove uniqueness for EDE flows, which is based on the following proposition.
Proposition 5.1
If \(F:\mathcal {P}(X)\rightarrow \mathbb R\cup \{+\infty \}\) is a strictly convex functional (w.r.t. usual convex combinations \(\mu _s{:=}(1-s)\mu _0+s\mu _1\), which is meaningful in the set of probability measures), such that \(|\nabla ^-F |\) is an upper gradient for F and such that \(|\nabla ^- F|^2\) is convex, then for every initial measure \(\bar{\mu }\) there exists at most one gradient flow \(\mu (t)\) in the EDE sense for the functional F satisfying \(\mu (0)=\bar{\mu }\).
In particular, this applies to the functional \(\mathcal {E}\): the strict convexity is straightforward, and the squared slope can be proven to be convex with the help of the formula (3.4) (it is interesting to observe that, in the Euclidean case, an explicit formula for the slope is known:
whenever \(\varrho =f\cdot \mathcal {L}^d\)).
5.3 Gradient flows comparison
The last point to study (and it is not trivial at all) is the fact that every gradient flow of \(\mathcal {C}\) (w.r.t. the \(L^2\) distance) is also an EDE gradient flow of \(\mathcal {E}\) for the \(W_2\) distance. This one (i.e. \((\mathcal {C},L^2)\Rightarrow (\mathcal {E},W_2)\)) is the simplest direction to consider in this framework, as computations are easier. This is a consequence of the fact that the structure of gradient flows of convex functionals in Hilbert spaces is much more well understood. In order to do so, it is useful to compute and estimate
This computation is based on a strategy essentially developed in [50] and on a lemma by Kuwada. The initial proof, contained in [50], is valid for Alexandroff spaces.Footnote 21 The generalization of the same result to arbitrary MMS satisfying \(CD(K,\infty )\) is done in [6].
Proposition 5.2
If \(f_t\) is a gradient flow of \(\mathcal {C}\) in \(L^2(X,\mathcal {H}^d)\), then we have the following equality with the Fisher information:
Moreover, for every \(\varrho =f\cdot \mathcal {H}^d\in \mathcal {P}(X)\) we have
(where the slope of \(\mathcal {E}\) is computed for the \(W_2\) distanceFootnote 22). Also, if we consider the curve \(\varrho _t=f_t\cdot \mathcal {H}^d\), it happens that \(\varrho _t\) is an AC curve in the space \( \mathbb W_2(X)\) and
These three estimates imply that \(\varrho _t\) is a gradient flow of \(\mathcal {E}\) w.r.t. \(W_2\).
Once this equivalence is established, we can wonder about the properties of this gradient flow. The \(L^2\) distance being Hilbertian, it is easy to see that the C\(^2\)G\(^2\) property is satisfied, and hence this flow also satisfies EVI. On the contrary, it is not evident that the same is true when we consider the same flow as the gradient flow of \(\mathcal {E}\) for the distance \(W_2\). Indeed, we can check that the following three conditions are equivalent (all true or false depending on the space (X, d, m), which is supposed to satisfy \(CD(K,\infty )\); see [7] for the proofs):
-
the unique EDE gradient flow of \(\mathcal {E}\) for \(W_2\) also satisfies EVI;
-
the heat flow (which is at the same time the gradient flow of \(\mathcal {E}\) for \(W_2\) and of \(\mathcal {C}\) for \(L^2\)) depends linearly on the initial datum;
-
(if we suppose that (X, d, m) is a Finsler manifold endowed with its natural distance and its volume measure), X is a Riemannian manifold.
As a consequence, Ambrosio, Gigli and Savaré proposed in [7] a definition of MMS having a Riemanniann ricci curvature bounded from below by requiring both to satisfy the \(CD(K,\infty )\) condition, and the linearity of the heat flow (this double condition is usually written \(RCD(K,\infty )\)). This is the notion of infinitesimally Hilbertian space that we mentioned at the end of Sect. 3.
It is important to observe (but we will not develop this here) that these notions of Ricci bounds (either Riemannian or not) are stable via measured Gromov–Hausdorff convergence (a notion of convergence similar to the Gromov–Hausdorff convergence of metric spaces, but considering the minimal Wasserstein distance between the images of two spaces via isometric embeddings into the same space). This can be surprising at a first sight (curvature bounds are second-order objects, and we are claiming that they are stable via a convergence which essentially sounds like a uniform convergence of the spaces, with a weak convergence of the measures), but not after a simple observation: also the class of convex, or \(\lambda \)-convex, functions is stable under uniform (or even weak) convergence! but, of course, proving this stability is not a trivial fact.
Notes
The attentive reader can observe that, setting \(y{:=}(x+x^\tau _k)/2\), this minimization problem becomes \(\min _y 2F(y)+2|y-x^\tau _k|^2/\tau \). Yet, when acting on a metric space, or simply on a manifold or a bounded domain, there is an extra constraint on y: the point y must be the middle point of a geodesic between \(x^\tau _k\) and a point x (on a sphere, for instance, this means that if \(x^\tau _k\) is the North Pole, then y must lie in the northern hemisphere).
We prefer not to make any distinction here between generalized minimizing movements and minimizing movements.
Warning: this definition is not equivalent to true convexity along the geodesic, since we only compare intermediate instants t to 0 and 1, and not to other intermediate instants; however, in case of uniqueness of geodesics, or if we required the same condition to be true for all geodesics, then we would recover the same condition. Also, let us note that we will only need the existence of geodesics connecting pairs of points where \(F<+\infty \).
Warning: we get here semi-continuity w.r.t. the topology induced by the distance d, which only allows to handle the case where the set \(\{F\le c\}\) are d-compacts.
This is surely exaggerated, as we could think for instance at the theory of geometrical evolutions of shapes and sets, even if it seems that this metric approach has not yet been generalized in this framework.
Lighter versions exist, such as [8], or the recent User’s Guide to Optimal Transport [2], which is a good reference for many topics in this survey, as it deals for one half with optimal transport (even if the title suggests that this is the only topic of the guide), then for one sixth with the general theory of gradient flows (as in our Sect. 3), and finally for one third with metric spaces with curvature bounds (that we will briefly sketch in Sect. 5).
They are named after L. Vaserstein (whose name is sometimes spelled Wasserstein), but this choice is highly debated, in particular in Russia, as the role he played in these distances is very marginal. However, this is now the standard name used in Western countries, probably due to the terminology used in [54, 78], even if other names have been suggested, such as Monge–Kantorovich distances, Kantorovich–Rubinstein... and we will stick to this terminology.
If we change the exponent we could consider
$$\begin{aligned} \min _x\; F(x)+\frac{1}{p} \cdot \frac{|x-x^\tau _k|^p}{\tau ^{p-1}}, \end{aligned}$$but this gives raise to the equation \(x'=-|\nabla F(x)|^{q-2}\nabla F(x),\) where \(q=p/(p-1)\) is the conjugate exponent of p.
Note that in some cases the functionals that we use are actually valued in \(\mathbb R\cup \{+\infty \}\), and we restrict to a suitable class of perturbations \(\chi \) which make the corresponding functional finite.
Note that, for fixed \(\tau >0\), the minimizers usually satisfy strong regularity estimates, and the convergence of the minimizers is strong enough to pass the optimality condition to the limit, obtaining “there exists a Kantorovich potential \(\varphi \) such that \(\delta F/\delta \varrho +\varphi /\tau \) is constant”.
We need at least T to be countably Lipschitz, i.e. \(\Omega \) may be written as a countable union of measurable sets \((\Omega _i)_{i\ge 0}\) with \(\Omega _0\) negligible and Lipschitz continuous for every \(i\ge 1\).
Actually, this is true in normed spaces, but not even true in Riemannian manifolds, as it depends on curvature properties.
By the way, this functional can even be proven to be somehow geodesically concave, as it is shown in [3], Theorem 7.3.2.
Note that this second distance does not require the same mass for \(\mu \) and \(\nu \), while the Wasserstein distance in the other minimization problem requires mass equality, but has to be extended to measures with equal mass, but not necessarily probability measures.
This same formula takes into account the possibility that \(\nabla u\) could be non-injective, i.e. u non-strictly convex, in which case the value of the density could be \(+\infty \) due to the determinant at the denominator which would vanish.
If the points \(x_j\) are the locations of some ice-cream sellers, we can think that \(\psi _j\) is the price of an ice-cream at \(x_j\), and the cells \(W_j\) will represent the regions where customers will decide to go to the seller j, keeping into account both the price and the distance.
One measure being absolutely continuous, the other atomic.
The name has been chosen because Cheeger also gave a definition by relaxation; moreover, the authors did not wish to call it Dirichlet energy, as generally this name is used fro quadratic forms.
The general notation CD(K, N) is used to say that a space has curvature bounded from below by K and dimension bounded from above by N (CD stands for “curvature-dimension”).
These spaces, see [23], are metric spaces where triangles are at least as fat as the triangles of a model comparison manifold with constant curvature equal to K, the comparison being done in terms of the distances from a vertex of a triangle to the points of a geodesic connecting the two other vertices. These spaces can be proven to have always an integer dimension \(d\in \mathbb N\cup \{\infty \}\), and can be considered as MMS whenever \(d<\infty \), by endowing them with their Hausdorff measure \(\mathcal {H}^d\). Note anyway that the comparison manifold with constant curvature can be, anyway, taken of dimension 2, as only triangles appear in the definition.
As in (5.1).
References
Ambrosio, L.: Movimenti minimizzanti. Rend. Accad. Naz. Sci. XL Mem. Mat. Sci. Fis. Natur. 113, 191–246 (1995)
Ambrosio, L., Gigli, N.: A user’s guide to optimal transport. In: Modelling and Optimisation of Flows on Networks. Lecture Notes in Mathematics, vol. 2062, pp. 1–155 Springer-Verlag (2013)
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows in Metric Spaces and in the Spaces of Probability Measures. Lectures in Mathematics. ETH Zurich, Birkhäuser (2005)
Ambrosio, L., Gigli, N., Savaré, G.: Density of Lipschitz functions and equivalence of weak gradients in metric measure spaces. Rev. Mat. Iberoam. 29, 969–986 (2013)
Ambrosio, L., Gigli, N., Savaré, G.: Heat flow and calculus on metric measure spaces with Ricci curvature bounded below—the compact case. In: Brezzi, F., Colli Franzone, P., Gianazza, U.P., Gilardi, G. (eds.) Analysis and Numerics of Partial Differential Equations. “INDAM”, vol. 4, pp. 63–116. Springer, Berlin (2013)
Ambrosio, L., Gigli, N., Savaré, G.: Calculus and heat flow in metric measure spaces and applications to spaces with Ricci bounds from below. Inv. Math. 195(2), 289–391 (2014)
Ambrosio, L., Gigli, N., Savaré, G.: Metric measure spaces with Riemannian Ricci curvature bounded from below. Duke Math. J. 163, 1405–1490 (2014)
Ambrosio, L., Savaré, G.: Gradient flows of probability measures. In: Dafermos, C.M., Feireisl, E. (eds.) Handbook of Differential Equations: Evolutionary equations, vol. 3. Elsevier, Amsterdam (2007)
Ambrosio, L., Tilli, P.: Topics on Analysis in Metric Spaces. Oxford Lecture Series in Mathematics and its Applications, vol. 25. Oxford University Press, Oxford (2004)
Aubin, J.-P.: Un théorème de compacité. C. R. Acad. Sci. Paris 256, 5042–5044 (1963)
Benamou, J.-D., Brenier, Y.: A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem. Numer. Math. 84, 375–393 (2000)
Benamou, J.-D., Carlier, G.: Augmented Lagrangian methods for transport optimization, mean field games and degenerate elliptic equations. J. Opt. Theor. Appl. 167(1), 1–26 (2015)
Benamou, J.-D., Carlier, G., Laborde, M.: An augmented Lagrangian approach to Wasserstein gradient flows and applications. ESAIM Proc. 54, 1–17 (2016)
Benamou, J.-D., Carlier, G., Mérigot, Q., Oudet, É.: Discretization of functionals involving the Monge–Ampère operator. Numerische Mathematik 134(3), 611–636 (2016)
Blanchet, A., Calvez, V., Carrillo, J.A.: Convergence of the mass-transport steepest descent scheme for the subcritical Patlak–Keller–Segel model. SIAM J. Numer. Anal. 46(2), 691–721 (2008)
Blanchet, A., Carrillo, J.-A., Kinderlehrer, D., Kowalczyk, M., Laurençot, P., Lisini, S.: A hybrid variational principle for the Keller–Segel system in \({\mathbb{R}}^2\). ESAIM M2AN 49(6), 1553–1576 (2015)
Bouchitté, G., Buttazzo, G.: Characterization of optimal shapes and masses through Monge–Kantorovich equation. J. Eur. Math. Soc. 3(2), 139–168 (2001)
Bouchitté, G., Buttazzo, G., Seppecher, P.: Shape optimization solutions via Monge–Kantorovich equation. C. R. Acad. Sci. Paris Sér. I Math 324(10), 1185–1191 (1997)
Brenier, Y.: Décomposition polaire et réarrangement monotone des champs de vecteurs. C. R. Acad. Sci. Paris Sér. I Math 305(19), 805–808 (1987)
Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Math. 44, 375–417 (1991)
Brenier, Y.: Extended Monge–Kantorovich theory. In: Optimal transportation and Applications, 91–121, 2003, Lectures given at the C.I.M.E. Summer School held in Martina Franca, Italy, September 2–8 (2001)
Brezis, H.: Opérateurs maximaux monotones et semi-groupes de contractions dans les spaces de Hilbert. North-Holland Mathematics Studies, Amsterdam (1973)
Burago, Yu., Gromov, M., Perelman G.: A. D. Alexandrov spaces with curvatures bounded below. Uspekhi Mat. Nauk47, 3–51 (1992); English translation: Russ. Math. Surv.47, 1–58 (1992)
Buttazzo, G.: Semicontinuity, Relaxation, and Integral Representation in the Calculus of Variations. Longman Scientific and Technical, New York (1989)
Cancès, C., Gallouët, T., Monsaingeon, L.: Incompressible immiscible multiphase flows in porous media: a variational approach https://arxiv.org/abs/1607.04009 (2016) (preprint)
Carrillo, J.-A., Di Francesco, M., Figalli, A., Laurent, T., Slepčev, D.: Global-in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations. Duke Math J. 156(2), 229–271 (2011)
Carrillo, J.-A., McCann, R.J., Villani, C.: Kinetic equilibration rates for granular media and related equations: entropy dissipation and mass transportation estimates. Rev. Mat. Iberoam. 19, 1–48 (2003)
Carrillo, J.-A., McCann, R.J., Villani, C.: Contractions in the 2-Wasserstein length space and thermalization of granular media. Arch. Ration. Mech. Anal. 179, 217–263 (2006)
Carrillo, J.-A., Slepčev, D.: Example of a displacement convex functional of first order. Calc. Var. Partial Differ. Equ. 36(4), 547–564 (2009)
Cheeger, J.: Differentiability of Lipschitz functions on metric measure spaces. Geom. Funct. Anal. 9, 428–517 (1999)
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: Unbalanced optimal transport: geometry and Kantorovich formulation. arXiv preprint arXiv:1508.05216 (2015)
Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.-X.: An interpolating distance between optimal transport and Fisher–Rao. arXiv preprint arXiv:1506.06430 (2015)
Choné, P., Le Meur, H.: Non-convergence result for conformal approximation of variational problems subject to a convexity constraint. Numer. Funct. Anal. Optim. 5–6(22), 529–547 (2001)
Craig, K.: Nonconvex gradient flow in the Wasserstein metric and applications to constrained nonlocal interactions, preprint available at arXiv:1512.07255
Daneri, S., Savaré, G.: Eulerian calculus for the displacement convexity in the Wasserstein distance. SIAM J. Math. Anal. 40, 1104–1122 (2008)
De Giorgi, E.: New problems on minimizing movements. In: Baiocchi, C., Lions, J.L. (eds.) Boundary Value Problems for PDE and Applications, pp. 81–98. Masson, Paris (1993)
Di Marino, S., Maury, B., Santambrogio, F.: Measure sweeping processes. J. Convex Anal. 23(1), 567–601 (2016)
Di Marino, S., Mészáros, A.R.: Uniqueness issues for evolution equations with density constraints. Math. Models Methods Appl. Sci. 26(09), 1761–1783 (2016)
Ekeland, I.: On the variational principle. J. Math. Anal. Appl. 47(2), 324–353 (1974)
Ekeland, I., Moreno-Bromberg, S.: An algorithm for computing solutions of variational problems with global convexity constraints. Numerische Mathematik 115(1), 45–69 (2010)
Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. Society for Industrial and Applied Mathematics, Classics in Mathematics, Philadelphia (1999)
Figalli, A., Gigli, N.: A new transportation distance between non-negative measures, with applications to gradients flows with Dirichlet boundary conditions. J. Math. Pures Appl. 94(2), 107–130 (2010)
Fortin, M., Glowinski, R.: Augmented Lagrangian Methods Applications to the Numerical Solution of Boundary-Value Problems. North-Holland, Amsterdam (1983)
Gallouët, T., Monsaingeon, L.: A JKO splitting scheme for Kantorovich-Fisher–Rao gradient flows, preprint available at arXiv:1602.04457
Gangbo, W.: An elementary proof of the polar factorization of vector-valued functions. Arch. Ration. Mech. Anal. 128, 381–399 (1994)
Gangbo, W.: The Monge mass transfer problem and its applications. Contemp. Math. 226, 79–104 (1999)
Gangbo, W., McCann, R.: The geometry of optimal transportation. Acta Math. 177, 113–161 (1996)
Gigli, N.: On the Heat flow on metric measure spaces: existence, uniqueness and stability. Calc. Var. Partial Differ. Equ. 39, 101–120 (2010)
Gigli, N.: Propriétés géométriques et analytiques de certaines structures non lisses. Mémoire HDR, Univ. Nice-Sophia-Antipolis (2011)
Gigli, N., Kuwada, K., Ohta, S.I.: Heat flow on Alexandrov spaces. Commun. Pure Appl. Math. LXVI, 307–331 (2013)
Hajłasz, P.: Sobolev spaces on an arbitrary metric space. Potential Anal. 5, 403–415 (1996)
Hajłasz, P.: Sobolev spaces on metric-measure spaces. Contemp. Math. 338, 173–218 (2003)
Hajłasz, P., Koskela, P.: Sobolev met Poincaré. Mem. Am. Math. Soc. 688, 1–101 (2000)
Jordan, R., Kinderlehrer, D., Otto, F.: The variational formulation of the Fokker–Planck equation. SIAM J. Math. Anal. 29(1), 1–17 (1998)
Kantorovich, L.: On the transfer of masses. Dokl. Acad. Nauk. USSR 37, 7–8 (1942)
Keller, E.F., Segel, L.A.: Initiation of slide mold aggregation viewed as an instability. J. Theor. Biol. 26, 399–415 (1970)
Keller, E.F., Segel, L.A.: Model for chemotaxis. J. Theor. Biol. 30, 225–234 (1971)
Kinderlehrer, D., Walkington, N.J.: Approximation of parabolic equations using the Wasserstein metric. ESAIM Math. Model. Numer. Anal. 33(04), 837–852 (1999)
Kitagawa, J., Mérigot, Q., Thibert, B.: Convergence of a Newton algorithm for semi-discrete optimal transport. J. Eur. Math. Soc. (2016). Accepted 2017
Kondratyev, S., Monsaingeon, L., Vorotnikov, D.: A new optimal transport distance on the space of finite radon measures. arXiv preprint arXiv:1505.07746 (2015)
Legendre, G., Turinici, G.: Second order in time schemes for gradient flows in Wasserstein and geodesic metric spaces https://hal.archives-ouvertes.fr/hal-01317769/document (2016) (preprint)
Lévy, B.: A numerical algorithm for \(L^2\) semi-discrete optimal transport in 3D. ESAIM Math. Model. Numer. Anal 49(6), 1693–1715 (2015)
Liero, M., Mielke, A., Savaré, G.: Optimal entropy-transport problems and a new Hellinger–Kantorovich distance between positive measures. arXiv preprint arXiv:1508.0794 (2015)
Liero, M., Mielke, A., Savaré, G.: Optimal transport in competition with reaction: the Hellinger–Kantorovich distance and geodesic curves. arXiv preprint arXiv:1509.00068 (2015)
Lott, J., Villani, C.: Ricci curvature for metric-measure spaces via optimal transport. Ann. Math. 169, 903–991 (2009)
Matthes, D., McCann, R.J., Savaré, G.: A family of nonlinear fourth order equations of gradient flow type. Commun. Partial Differ. Equ. 34(10–12), 1352–1397 (2009)
Maury, B., Roudneff-Chupin, A., Santambrogio, F.: A macroscopic crowd motion model of gradient flow type. Math. Models Methods Appl. Sci. 20(10), 1787–1821 (2010)
Maury, B., Roudneff-Chupin, A., Santambrogio, F., Venel, J.: Handling congestion in crowd motion modeling. Net. Het. Media 6(3), 485–519 (2011)
Maury, B., Venel, J.: A mathematical framework for a crowd motion model. C. R. Acad. Sci. Paris, Ser. I 346, 1245–1250 (2008)
Maury, B., Venel, J.: A discrete contact model for crowd motion. ESAIM Math. Model. Numer. Anal. 45(1), 145–168 (2011)
McCann, R.J.: Existence and uniqueness of monotone measure preserving maps. Duke Math. J. 80, 309–323 (1995)
McCann, R.J.: A convexity principle for interacting gases. Adv. Math. 128(1), 153–159 (1997)
McCann, R.J.: Polar factorization of maps on Riemannian manifolds. Geom. Funct. Anal. 11(3), 589–608 (2001)
Mérigot, Q.: A multiscale approach to optimal transport. Comput. Gr. Forum 30(5), 1583–1592 (2011)
Mérigot, Q., Oudet, É.: Handling convexity-like constraints in variational problems. SIAM J. Numer. Anal. 52(5), 2466–2487 (2014)
Mészáros, A.R., Santambrogio, F.: Advection–diffusion equations with density constraints. Anal. PDE 9–3, 615–644 (2016)
Monge, G.: Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris 666–704 (1781)
Otto, F.: The geometry of dissipative evolution equations: the porous medium equation. Commun. Partial Differ. Equ. 26, 101–174 (2011)
von Renesse, M.-K., Sturm, K.-T.: Entropic measure and Wasserstein diffusion. Ann. Probab. 37, 1114–1191 (2009)
Rockafellar, R.T.: Convex Anal. Princeton University Press, Princeton (1970)
Roudneff-Chupin, A.: Modélisation macroscopique de mouvements de foule, PhD Thesis, Université Paris-Sud (2011)
Santambrogio, F.: Gradient flows in Wasserstein spaces and applications to crowd movement, Séminaire Laurent Schwartz no 27, École Polytechnique (2010)
Santambrogio, F.: Flots de gradient dans les espaces métriques et leurs applications (d’après Ambrosio–Gigli–Savaré). In: Proceedings of the Bourbaki Seminar (2013) (in French)
Santambrogio, F.: Optimal Transport for Applied Mathematicians, Progress in Nonlinear Differential Equations and Their Applications no 87, Birkhäuser Basel (2015)
Shanmugalingam, N.: Newtonian spaces: an extension of Sobolev spaces to metric measure spaces. Rev. Mat. Iberoam. 16, 243–279 (2000)
Schneider, R.: Convex Bodies: The Brunn–Minkowski Theory, vol. 44. Cambridge University Press, Cambridge (1993)
Sturm, K.-T.: On the geometry of metric measure spaces. I. Acta Math. 196, 65–131 (2006)
Sturm, K.-T.: On the geometry of metric measure spaces. II. Acta Math. 196, 133–177 (2006)
Vázquez, J.L.: The Porous Medium Equation. Mathematical Theory. Oxford University Press, Oxford (2007)
Villani, C.: Topics in Optimal Transportation. Graduate Studies in Mathematics. AMS, Providence (2003)
Villani, C.: Optimal Transport: Old and New. Springer, Berlin (2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Neil Trudinger.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Santambrogio, F. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bull. Math. Sci. 7, 87–154 (2017). https://doi.org/10.1007/s13373-017-0101-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13373-017-0101-1
Keywords
- Cauchy problem
- Subdifferential
- Analysis in metric spaces
- Optimal transport
- Wasserstein distances
- Heat flow
- Fokker–Planck equation
- Numerical methods
- Contractivity
- Metric measure spaces