Regularization in Reproducing Kernel Hilbert Spaces

Pillonetto, Gianluigi; Chen, Tianshi; Chiuso, Alessandro; De Nicolao, Giuseppe; Ljung, Lennart

doi:10.1007/978-3-030-95860-2_6

Gianluigi Pillonetto¹⁰,
Tianshi Chen¹¹,
Alessandro Chiuso¹⁰,
Giuseppe De Nicolao¹² &
…
Lennart Ljung¹³

Part of the book series: Communications and Control Engineering ((CCE))

5286 Accesses
1 Citations

Abstract

Methods for obtaining a function g in a relationship $y=g(x)$ from observed samples of y and x are the building blocks for black-box estimation. The classical parametric approach discussed in the previous chapters uses a function model that depends on a finite-dimensional vector, like, e.g., a polynomial model. We have seen that an important issue is the model order choice. This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, the function will be searched over a possibly infinite-dimensional functional space. Overfitting and ill-posedness are circumvented by using reproducing kernel Hilbert spaces as hypothesis spaces and related norms as regularizers. Such kernel-based approaches thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory.

You have full access to this open access chapter, Download chapter PDF

6.1 Preliminaries

Techniques for reconstructing a function g in a functional relationship $y=g(x)$ from observed samples of y and x are the fundamental building blocks for black-box estimation. As already seen in Chap. 3 when treating linear regression, given a finite set of pairs $(x_i, y_i)$ the aim is to determine a function g having a good prediction capability, i.e., for a new pair (x, y) we would like the prediction g(x) close to y (e.g., in the MSE sense).

The classical parametric approach discussed in Chap. 3 uses a model $g_{\theta }$ that depends on a finite-dimensional vector $\theta $. A very simple example is a polynomial model, treated in Example 3.1, given, e.g., by $g_{\theta }(x)=\theta _1+\theta _2 x +\theta _3x^2$ whose coefficients $\theta _i$ can be estimated by fitting the data via least squares. In this parametric scenario, we have seen that an important issue is the model order choice. In fact, the least squares objective improves as the dimension of $\theta $ increases, eventually leading to data interpolation. But overparametrized models, as a rule, perform poorly when used to predict future output data, even if benign overfitting may sometimes happen, as e.g., described in the context of deep networks [17, 55, 75]. Another drawback related to overparameterization is that the problem may become ill-posed in the sense of Hadamard, i.e., the solution may be non-unique, or ill-conditioned. This means that the estimate may be highly sensitive even to small perturbations of the outputs $y_i$ as, e.g., illustrated in Fig. 1.3 of Sect. 1.2.

This chapter describes some regularization approaches which permit to reconcile flexibility of the model class with well-posedness of the solution exploiting an alternative paradigm to traditional parametric estimation. Instead of constraining the unknown function to a specific parametric structure, g will be searched over a possibly infinite-dimensional functional space. Overfitting and ill-posedness is circumvented by using reproducing kernel Hilbert spaces (RKHSs) as hypothesis spaces and related norms as regularizers. Such norms generalize the quadratic penalties seen in Chap. 3. In this scenario, the estimator is completely defined by a positive definite kernel which has to encode the expected function properties, e.g., the smoothness level. Furthermore we will see that, even when the model class is infinite dimensional, the function estimate turns out a finite linear combination of basis functions computable from the kernel. The estimator also enjoys strong asymptotic properties, permitting (under reasonable assumptions on data generation) to achieve the optimal predictor as the data set size grows to infinity.

The kernel-based approaches described in the following sections thus permit to cast all the regularized estimators based on quadratic penalties encountered in the previous chapters as special cases of a more general theory. In addition, RKHS theory paves the way to the development of other powerful techniques, e.g., for estimation of an infinite number of impulse response coefficients (IIR models estimation), for continuous-time linear system identification and also for nonlinear system identification.

The reader not familiar with functional analysis finds in the first part of the appendix of this chapter a brief overview on the basic results used in the next sections, like, e.g., the concept of linear and bounded functional which is key to define a RKHS.

6.2 Reproducing Kernel Hilbert Spaces

In what follows, we use $\mathscr {X}$ to indicate domains of functions. In machine learning, this set is often referred to as the input space with its generic element $x \in \mathscr {X}$ called input location. Sometimes, $\mathscr {X}$ is assumed to be a compact metric space, e.g., one can think of $\mathscr {X}$ as a closed and bounded set in the familiar space $\mathbb {R}^m$ equipped with the Euclidean norm. In what follows, all the functions are real valued, so that $f: \mathscr {X} \rightarrow \mathbb {R}$.

Reproducing kernel Hilbert spaces We now introduce a class of Hilbert spaces $\mathscr {H}$ which play a fundamental role as hypothesis spaces for function estimation problems. Our goal is to estimate maps which permit to make predictions over the whole $\mathscr {X}$. Thus, a basic requirement is to search for the predictor in a space containing functions which are well-defined pointwise for any $x \in \mathscr {X}$. In particular, we assume that all the pointwise evaluators $g \rightarrow g(x)$ are linear and bounded over $\mathscr {H}$. This means that $\forall x \in \mathscr {X}$ there exists $C_x< \infty $ such that

$$\begin{aligned} |g(x)| \le C_x\Vert g\Vert _{\mathscr {H}}, \quad \forall g \in \mathscr {H}. \end{aligned}$$

(6.1)

The above condition is stronger than requiring $g(x) < \infty \ \forall x$ since $C_x$ can depend on x but not on g. This property already leads to the function spaces of interest. The following definitions are taken from [13].

Definition 6.1

(RKHS, based on [13]) A reproducing kernel Hilbert space (RKHS) over a non-empty set $\mathscr {X}$ is a Hilbert space of functions $g:\mathscr {X} \rightarrow \mathbb {R}$ such that (6.1) holds.

As suggested by the name itself, RKHSs are related to the concept of positive definite kernel [13, 20], a particular function defined over $\mathscr {X}\times \mathscr {X}$. In the literature it is also called positive semidefinite kernel, hence in what follows positive definite kernel and positive semidefinite kernel will define the same mathematical object. This is also specified in the next definition.

Definition 6.2

(Positive definite kernel, Mercer kernel and kernel section, based on [13]) Let $\mathscr {X}$ denote a non-empty set. A symmetric function $K:\mathscr {X}\times \mathscr {X} \rightarrow \mathbb {R}$ is called positive definite kernel or positive semidefinite kernel if, for any finite natural number p, it holds

$$ \sum _{i=1}^{p}\sum _{j=1}^{p}a_ia_j K(x_i,x_j) \ge 0, \quad \forall (x_k,a_k) \in \left( \mathscr {X},\mathbb {R}\right) , \quad k=1,\ldots , p. $$

If strict inequality holds for any set of p distinct input locations $x_k$, i.e.,

$$ \sum _{i=1}^{p}\sum _{j=1}^{p}a_ia_j K(x_i,x_j) > 0, $$

then the kernel is strictly positive definite.

If $\mathscr {X}$ is a metric space and the positive definite kernel is also continuous, then K is said to be a Mercer kernel.

Finally, given a kernel K, the kernel section $K_x$ centred at x is the function $\mathscr {X} \rightarrow \mathbb {R}$ defined by

$$ K_x(y) = K(x,y) \quad \forall y \in \mathscr {X}. $$

Hence, in the sense given above, a positive definite kernel “contains” matrices which are all at least positive semidefinite.

We are now in a position to state a fundamental theorem from [13] here specialized to Mercer kernels which lead to RKHSs containing continuous functions (the proof is reported in Sect. 6.9.2).

Theorem 6.1

(RKHSs induced by Mercer kernels, based on [13]) Let $\mathscr {X}$ be a compact metric space and let $K:\mathscr {X}\times \mathscr {X} \rightarrow \mathbb {R}$ be a Mercer kernel. Then, there exists a unique (up to isometries) Hilbert space $\mathscr {H}$ of functions, called RKHS associated to K, such that

1.
all the kernel sections belong to $\mathscr {H}$, i.e.,
$$\begin{aligned} K_x \in \mathscr {H} \quad \forall x \in \mathscr {X}; \end{aligned}$$
(6.2)
2.
the so-called reproducing property holds, i.e.,
$$\begin{aligned} \langle K_x, g \rangle _{\mathscr {H}} = g(x) \quad \forall (x, g) \in \left( \mathscr {X},\mathscr {H}\right) . \end{aligned}$$
(6.3)

In addition, $\mathscr {H}$ is contained in the space $\mathscr {C}$ of continuous functions.

Remark 6.1

Note that the space $\mathscr {H}$ characterized in Theorem 6.1 is indeed a RKHS according to Definition 6.1. In fact, for any input location x the kernel section $K_x$ belongs to the space and, according to the reproducing property, represents the evaluation functional at x. Then, Theorem 6.27 (Riesz representation theorem), reported in the appendix to this chapter, permits the conclusion that all the pointwise evaluators over $\mathscr {H}$ are linear and bounded.

While Theorem 6.1 establishes a link between Mercer kernels (which enjoy continuity properties) and RKHSs, it is possible also to state a one-to-one correspondence with the entire class of positive definite kernels (not necessarily continuous). In particular, the following result holds.

Theorem 6.2

(Moore–Aronszajn, based on [13]) Let $\mathscr {X}$ be any non-empty set. Then, to every RKHS $\mathscr {H}$ there corresponds a unique positive definite kernel K such that the reproducing property (6.3) holds. Conversely, given a positive definite kernel K, there exists a unique RKHS of real-valued functions defined over $\mathscr {X}$ where (6.2) and (6.3) hold.

The proof can be quite easily obtained using Theorem 6.27 (Riesz representation theorem) and arguments similar to those contained in the proof of Theorem 6.1.

Further notes and RKHSs examples Thus, a RKHS $\mathscr {H}$ can be defined just by specifying a kernel K, also called the reproducing kernel of $\mathscr {H}$. In particular, any RKHS is generated by the kernel sections. More specifically, let $S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })$ and define the following norm in S

$$\begin{aligned} \Vert f \Vert _{\mathscr {H}}^2 = \sum _{i=1}^p \sum _{j=1}^p c_i c_j K(x_i,x_j), \end{aligned}$$

(6.4)

where

$$\begin{aligned} f(\cdot ) = \sum _{i=1}^{p} c_i K_{x_i}(\cdot ). \end{aligned}$$

Then, one has

$$ \mathscr {H} = S \ \cup \ \left\{ \text{ all } \text{ the } \text{ limits } \text{ w.r.t. } \Vert \cdot \Vert _{\mathscr {H}} \text{ of } \text{ Cauchy } \text{ sequences } \text{ contained } \text{ in } S \right\} . $$

Summarizing, one has

all the kernel sections $K_x(\cdot )$ belong to the RKHS $\mathscr {H}$ induced by K;
$\mathscr {H}$ contains also all the finite linear combinations of kernel sections along with some particular infinite sums, convergent w.r.t. the norm (6.4);
every $f \in \mathscr {H}$ is thus a linear combination of a possibly infinite number of kernel sections.

Assume for instance $K(x_1,x_2) = \exp \left( - \Vert x_1-x_2\Vert ^2\right) $, which is the so-called Gaussian kernel. Then, all the functions in the corresponding RKHS are sums, or limits of sums, of functions proportional to Gaussians. As further elucidated later on, this means that every function of $\mathscr {H}$ inherits properties such as smoothness and integrability of the kernel, e.g., we have seen in Theorem 6.1 that kernel continuity implies $\mathscr {H} \subset \mathscr {C}$. This fact has an important consequence on modelling: instead of specifying a whole set of basis functions, it suffices to choose a single positive definite kernel that encodes the desired properties of the function to be synthesized.

Example 6.3

(Norm in a two-dimensional RKHS) We introduce a very simple RKHS to illustrate how the kernel K can be seen as a similarity function that establishes the norm (complexity) of a function by comparing function values at different input locations.

When $\mathscr {X}$ has finite cardinality m, the functions are evaluated just on a finite number of input locations. Hence, each function f is in one-to-one correspondence with the m-dimensional vector

$$ \mathbf {f} = \left( \begin{array}{c} f(1) \\ f(2) \\ \vdots \\ f(m) \end{array}\right) . $$

In addition, any kernel is in one-to-one correspondence with one symmetric positive semidefinite matrix $\mathbf {K} \in \mathbb {R}^{m \times m}$ with (i, j)-entry $\mathbf {K}_{ij} = K(i,j)$. Finally, the kernel sections can be seen as the columns of $\mathbf {K}$.

Assume, e.g., $m=2$ with $\mathscr {X}=\{1,2\}$. Then, the functions can be seen as two-dimensional vectors and any kernel K is in one-to-one correspondence with one symmetric positive semidefinite matrix $\mathbf {K} \in \mathbb {R}^{2 \times 2}$. The RKHS $\mathscr {H}$ associated to K is finite-dimensional being spanned just by the two kernel sections $K_1(\cdot )$ and $K_2(\cdot )$ which can be seen as the two columns of $\mathbf {K}$. Hence, the functions f in $\mathscr {H}$ are in one-to-one correspondence with the vectors

$$ \mathbf {f} = \left( \begin{array}{c} f(1) \\ f(2) \end{array}\right) = \mathbf {K} c, \quad c \in \mathbb {R}^2. $$

If $\mathbf {K}$ is full rank, $\mathscr {H}$ covers the whole $\mathbb {R}^2$ and from (6.4) we have

$$ \Vert f \Vert ^2_{\mathscr {H}} = c^T \mathbf {K} c = \mathbf {f}^T \mathbf {K}^{-1} \mathbf {f}. $$

For the sake of simplicity, assume also that $\mathbf {K}_{11}=\mathbf {K}_{22}=1$ so that it must hold $-1<\mathbf {K}_{12}<1$. Then, considering, e.g., the function $f(i)=i$, one has

$$\begin{aligned} \Vert f \Vert ^2_{\mathscr {H}}= & {} [1 \ \ 2] \ \mathbf {K}^{-1} \ [1 \ \ 2]^T\\= & {} \frac{5-4\mathbf {K}_{12}}{1-\mathbf {K}_{12}^2}, \quad -1<\mathbf {K}_{12}<1. \end{aligned}$$

Figure 6.1 displays $\Vert f \Vert ^2_{\mathscr {H}}$ as a function of $\mathbf {K}_{12}$. One can see that the norm diverges as $|\mathbf {K}_{12}|$ approaches 1.

If, e.g., $\mathbf {K}_{12}=1$ the kernel function becomes constant over $\mathscr {X} \times \mathscr {X}$. Hence, the two kernel sections $K_1(\cdot )$ and $K_2(\cdot )$ coincide, being constant with $K_1(i)=K_2(i)=1$ for $i=1,2$. This means that $\mathbf {K}_{12}=1$ induces a space $\mathscr {H}$ containing only constant functions.^{Footnote 1} This explains why the norm (complexity) of f becomes large if $\mathbf {K}_{12}$ is close to 1: the space becomes less and less “tolerant” of functions with $f(1)\ne f(2)$.

Letting now $f(1)=1$ and $f(2)=a$, the joint effect of $\mathbf {K}_{12}$ and a is explained by the formula

$$\begin{aligned} \Vert f \Vert ^2_{\mathscr {H}}= & {} [1 \ \ a] \ \mathbf {K}^{-1} \ [1 \ \ a]^T\\= & {} \frac{(a-\mathbf {K}_{12})^2}{1-\mathbf {K}_{12}^2}+1, \quad -1<\mathbf {K}_{12}<1. \end{aligned}$$

Note that, thinking now of $\mathbf {K}_{12}$ as fixed, the function with minimal RKHS norm (complexity) is obtained with $a=\mathbf {K}_{12}$ and has a norm equal to one. $\square $

Example 6.4

($\mathscr {L}_2^{\mu }$ and $\ell _2$ ) Let $\mathscr {X}=\mathbb {R}$ and consider the classical Lebesgue space of square summable functions with $\mu $ equal to the Lebesgue measure. Recall that this is a Hilbert space whose elements are equivalence classes of functions measurable w.r.t. Lebesgue: any group of functions which differ only on a set of null measure (e.g., containing only a countable number of input locations) identifies the same vector. Hence, $\mathscr {L}_2^{\mu }$ cannot be a RKHS since pointwise evaluation is not even well defined.

Let instead $\mathscr {X}={\mathbb N}$ (the set of natural numbers) and define the identity kernel

$$\begin{aligned} K(i,j)=\delta _{ij}, \ \ (i,j) \in {\mathbb N}\times {\mathbb N}, \end{aligned}$$

(6.5)

where $\delta _{ij}$ is the Kronecker delta. Clearly, K is symmetric and positive definite according to Definition 6.2 (it can be associated with an identity matrix of infinite size). Hence, it induces unique RKHS $\mathscr {H}$ that contains all the finite combinations of the kernel sections. In particular, any finite sum can be written as $f(\cdot ) = \sum _{i=1}^{m} f_i K_{i}(\cdot )$, where some of the $f_i$ may be null, and corresponds to a sequence with a finite number of non null components. To obtain the entire $\mathscr {H}$, we need also to add all the Cauchy sequences limits w.r.t. the norm (6.4) given by

$$\begin{aligned} \Vert f\Vert _{\mathscr {H}}^2= & {} \left\| \sum _{i=1}^{m} f_i K_{i}(\cdot ) \right\| _{\mathscr {H}}^2 \\= & {} \sum _{i=1}^m \sum _{j=1}^m f_i f_j K(i,j) = \sum _{i=1}^m f_i^2, \end{aligned}$$

which coincides with the classical Euclidean norm of $[f_1 \ldots f_m]$. This allows us to conclude that the associated RKHS is the classical space $\ell _2$ of square summable sequences.

As a finale note, Definition 6.1 easily confirms that $\ell _2$ is a RKHS. In fact, for every $f=[f_1 \ f_2 \ \ldots ] \in \ell _2$ one has

$$ |f_i| \le \sqrt{\sum _i f_i^2} = \Vert f \Vert _2 \quad \forall i, $$

and, recalling (6.1), this shows that all the evaluation functionals $f \rightarrow f_i$ with $ i \in {\mathbb N}$ are bounded. $\square $

Example 6.5

(Sobolev space and the first-order spline kernel) While in the previous example we have seen that $\mathscr {L}_2^{\mu }$ is not a RKHS, consider now the space obtained by integrating the functions in this space. In particular, let $\mathscr {X}=[0,1]$, set $\mu $ to the Lebesgue measure and consider

$$\begin{aligned} \mathscr {H} = \left\{ f \ | \ f(x) = \int _0^x h(y) dy \ \text{ with } \ h \in \mathscr {L}_2^{\mu } \right\} . \end{aligned}$$

One thus has that any f in $\mathscr {H}$ satisfies $f(0)=0$ and is absolutely continuous: its derivative $h=\dot{f}$ is defined almost everywhere and is Lebesgue integrable.

With the inner product given by

$$ \langle f,g \rangle _{\mathscr {H}} = \langle \dot{f}, \dot{g} \rangle _{\mathscr {L}_2^{\mu }}, $$

it is easy to see that $\mathscr {H}$ is a Hilbert space. In fact, $\mathscr {L}_2^{\mu }$ is Hilbert and we have established a one-to-one correspondence between functions in $\mathscr {H}$ and $\mathscr {L}_2^{\mu }$ which preserves inner product. Such $\mathscr {H}$ is an example of Sobolev space [2] since the complexity of a function is measured by the energy of its derivative:

$$ \Vert f \Vert _{\mathscr {H}}^2 = \int _0^1 \dot{f}^2(x) dx. $$

Now, given $x \in [0,1]$, let $\chi _x(\cdot )$ be the indicator function of the set [0, x]. Then, one has

$$\begin{aligned} | f(x) |= & {} \left| \int _0^x \dot{f}(a) da \right| = \left| \langle \chi _x , \dot{f} \rangle _{\mathscr {L}_2^{\mu }} \right| \\\le & {} \Vert \dot{f} \Vert _{\mathscr {L}_2^{\mu }} = \Vert f \Vert _{\mathscr {H}}, \end{aligned}$$

where we have used the Cauchy–Schwarz inequality. Hence, $\mathscr {H}$ is also a RKHS since all the evaluations functionals are bounded. We now prove that its reproducing kernel is the so-called first-order (linear) spline kernel given by

$$\begin{aligned} K(x,y) = \min (x,y). \end{aligned}$$

(6.6)

In fact, every kernel section belongs to $\mathscr {H}$, being piecewise linear with $\dot{K}_x = \chi _x$. Furthermore, (6.6) satisfies the reproducing property since

$$\begin{aligned} \langle f, K_x \rangle _{\mathscr {H}}= & {} \langle \dot{f} , \chi _x \rangle _{\mathscr {L}_2^{\mu }} \\= & {} \int _0^x \dot{f}(y) dy = f(x). \end{aligned}$$

The linear spline kernel and some of its sections are displayed in the top panels of Fig. 6.2. $\square $

6.2.1 Reproducing Kernel Hilbert Spaces Induced by Operations on Kernels $\star $

We report some classical results about RKHSs induced by operations on kernels which can be derived from [13]. The first theorem characterizes the RKHS induced by the sum or product of two kernels.

Theorem 6.6

(RKHS induced by sum or product of two kernels, based on [13]) Let K and G be two positive definite kernels over the same domain $\mathscr {X} \times \mathscr {X}$, associated to the RKHSs $\mathscr {H}$ and $\mathscr {G}$, respectively.

The sum $K+G$, where

$$ [K+G](x,y)=K(x,y)+G(x,y), $$

is the reproducing kernel of the RKHS $\mathscr {R}$ containing functions

$$ f= h +g, \quad (h,g) \in \mathscr {H} \times \mathscr {G} $$

with

$$ \Vert f \Vert ^2_{\mathscr {R}} = \min _{h \in \mathscr {H}, g \in \mathscr {G}} \Vert h \Vert ^2_{\mathscr {H}} + \Vert g \Vert ^2_{\mathscr {G}} \ \text{ s.t. } \ f=h+g. $$

The product KG, where

$$ [KG](x,y)=K(x,y)G(x,y) $$

is instead the reproducing kernel of the RKHS $\mathscr {R}$ containing functions

$$ f= hg, \quad (h,g) \in \mathscr {H} \times \mathscr {G} $$

with

$$ \Vert f \Vert ^2_{\mathscr {R}} = \min _{h \in \mathscr {H}, g \in \mathscr {G}} \Vert h \Vert ^2_{\mathscr {H}}\Vert g \Vert ^2_{\mathscr {G}} \ \text{ s.t. } \ f=hg. $$

The second theorem instead provides the connection between two RKHSs, with the second one obtained from the first one by sampling its kernel.

Theorem 6.7

(RKHS induced by kernel sampling, based on [13]) Let $\mathscr {H}$ be the RKHS induced by the kernel $K: \mathscr {X} \times \mathscr {X} \rightarrow {\mathbb R}$. Let $\mathscr {Y} \subset \mathscr {X}$ and denote with $\mathscr {R}$ the RKHS of functions over $\mathscr {Y}$ induced by the restriction of the kernel K on $\mathscr {Y} \times \mathscr {Y}$. Then, the functions in $\mathscr {R}$ correspond to the functions in $\mathscr {H}$ sampled on $\mathscr {Y}$. One also has

$$\begin{aligned} \Vert f \Vert ^2_{\mathscr {R}} = \min _{g \in \mathscr {H}} \ \Vert g \Vert ^2_{\mathscr {H}} \ \ \text{ s.t. } \ \ g_{\mathscr {Y}}=f, \end{aligned}$$

(6.7)

where $g_{\mathscr {Y}}$ is g sampled on $\mathscr {Y}$.

The following theorem lists some operations which permit to build kernels (and hence RKHSs) from simple building blocks.

Theorem 6.8

(Building kernels from kernels, based on [13]) Let $K_1$ and $K_2$ two positive definite kernels over $\mathscr {X} \times \mathscr {X}$ and $K_3$ a positive definite kernel over $\mathbb {R}^m \times \mathbb {R}^m$. Let also P an $m \times m$ symmetric positive semidefinite matrix and $\mathscr {P}(x)$ a polynomial with positive coefficients. Then, the following functions are positive definite kernels over $\mathscr {X} \times \mathscr {X}$:

$K(x,y)=K_1(x,y) + K_2(x,y)$ (see also Theorem 6.6).
$K(x,y)=aK_1(x,y), \quad a \ge 0$.
$K(x,y)=K_1(x,y)K_2(x,y)$ (see also Theorem 6.6).
$K(x,y)=f(x)f(y), \quad f: \mathscr {X} \rightarrow \mathbb {R}$.
$K(x,y)=K_3(f(x),f(y)), \quad f: \mathscr {X} \rightarrow \mathbb {R}^m$.
$K(x,y)=x^T P y, \quad \mathscr {X}=\mathbb {R}^m$.
$K(x,y)=\mathscr {P}(K_1(x,y))$.
$K(x,y)=\exp (K_1(x,y))$.

6.3 Spectral Representations of Reproducing Kernel Hilbert Spaces

In the previous section we have seen that any RKHS is generated by its kernel sections. We now discuss another representation obtainable when the kernel can be diagonalized as follows

$$\begin{aligned} K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y), \ \ \zeta _i > 0 \ \forall i , \end{aligned}$$

(6.8)

where the set $\mathscr {I}$ is countable. This will lead to new insights on the nature of the RKHSs, generalizing to the infinite-dimensional case the connection between regularization and basis expansion reported in Sect. 5.6.

A simple situation holds when the input space has finite cardinality, e.g., $\mathscr {X}=\{x_1 \ldots x_m\}$. Under this assumption, any positive definite kernel is in one-to-one correspondence with the $m \times m$ matrix $\mathbf {K}$ whose (i, j)-entry is $K(x_i,x_j)$. The representation (6.8) then follows from the spectral theorem applied to $\mathbf {K}$. In fact, if $\zeta _i$ and $v_i$ are, respectively, the eigenvalues and the orthonormal (column) eigenvectors of $\mathbf {K}$, (6.8) can be written as

$$ \mathbf {K} = \sum _{i=1}^m \zeta _i v_i v_i^T, $$

where the functions $\rho _i(\cdot )$ have become the vectors $v_i$. One generalization of this result is described below.

Let $L_K$ be the linear operator defined by the positive definite kernel K as follows:

$$\begin{aligned} L_K[f](\cdot ) = \int _{X} K(\cdot ,x) f(x) d\mu (x). \end{aligned}$$

(6.9)

We also assume that $\mu $ is a $\sigma $-finite and nondegenerate Borel measure on $\mathscr {X}$. Essentially this means that $\mathscr {X}$ is the countable union of measurable sets with finite measure and that $\mu $ “covers” entirely $\mathscr {X}$. The reader can, e.g., consider $\mathscr {X} \subset \mathbb {R}^m$ and think of $\mu $ as the Lebesque measure or any probability measure with $\mu (A)>0$ for any non-empty open set $A \subset \mathscr {X}$. The next classical result goes under the name of Mercer theorem whose formulations trace back to [60].

Theorem 6.9

(Mercer theorem, based on [60]) Let $\mathscr {X}$ be a compact metric space equipped with a nondegenerate and $\sigma $-finite Borel measure $\mu $ and let K be a Mercer kernel on $\mathscr {X} \times \mathscr {X}$. Then, there exists a complete orthonormal basis of $\mathscr {L}_2^{\mu }$ given by a countable number of continuous functions $\{\rho _i\}_{i \in \mathscr {I}}$ satisfying

$$\begin{aligned} L_K[\rho _i] = \zeta _i \rho _i, \quad i \in \mathscr {I}, \quad \zeta _1 \ge \zeta _2 \ge \ \cdots \ \ge 0, \end{aligned}$$

(6.10)

with $\zeta _i >0 \ \forall i$ if K is strictly positive and $\lim _{i \rightarrow \infty } \zeta _i =0$ if the number of eigenvalues is infinite.

One also has

$$\begin{aligned} K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y), \end{aligned}$$

(6.11)

where the convergence of the series is absolute and uniform on $\mathscr {X} \times \mathscr {X}$.

The following result characterizes a RKHS through the eigenfunctions of $L_K$. The proof is reported in Sect. 6.9.3.

Theorem 6.10

(RKHS defined by an orthonormal basis of $\mathscr {L}_2^{\mu }$) Under the same assumption of Theorem 6.9, if the $\rho _i$ and $\zeta _i$ satisfy (6.10), with also $\zeta _i>0 \ \forall i$, one has

$$\begin{aligned} \mathscr {H} = \left\{ f \ \Big | \ f(x) = \sum _{i \in \mathscr {I}} c_i \rho _i(x) \ \text{ s.t. } \ \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i } < \infty , \right\} . \end{aligned}$$

(6.12)

In addition, if

$$ f = \sum _{i \in \mathscr {I}} a_i \rho _i, \quad g = \sum _{i \in \mathscr {I}} b_i \rho _i, $$

one has

$$\begin{aligned} \langle f,g \rangle _{\mathscr {H}} = \sum _{i \in \mathscr {I}} \frac{a_i b_i}{\zeta _i}, \end{aligned}$$

(6.13)

so that

$$\begin{aligned} \Vert f\Vert _{\mathscr {H}}^2 = \sum _{i \in \mathscr {I}} \frac{a_i^2}{\zeta _i}. \end{aligned}$$

(6.14)

Hence, it also comes that $\{\sqrt{\zeta _i} \rho _i\}_{i \in \mathscr {I}}$ is an orthonormal basis of $\mathscr {H}$.

The representation (6.12) is not unique since the spectral maps, i.e., the functions that associate a kernel with a decomposition of the type (6.8), are not unique. They depend on the chosen measure $\mu $ even if they lead to the same RKHS.

Theorem 6.10 thus shows that any kernel admitting an expansion (6.11) coming from the Mercer theorem induces a separable RKHS, i.e., having a countable basis given by the $\rho _i$. Later on, Theorem 6.13 will show that such result holds under much milder assumptions. In fact, the representation (6.12) can be obtained starting from any diagonalized kernel (6.8) involving generic functions $\rho _i$, e.g., not necessarily independent of each other. One can also remove the compactness hypothesis on the input space, e.g., letting $\mathscr {X}$ be the entire $\mathbb {R}^m$.

Remark 6.2

(Relationship between $\mathscr {H}$ and $\mathscr {L}_2^{\mu }$ ) Theorem 6.10 points out an interesting connection between $\mathscr {H}$ and $\mathscr {L}_2^{\mu }$. Since the functions $\rho _i$ form an orthonormal basis in $\mathscr {L}_2^{\mu }$, one has

$$\begin{aligned} f \in \mathscr {L}_2^{\mu } \ \iff \ f= \sum _{i \in \mathscr {I}} c_i \rho _i \ \text{ with } \ \sum _{i \in \mathscr {I}} \ c_i^2 < \infty \end{aligned}$$

(6.15)

while (6.12) shows that

$$\begin{aligned} f \in \mathscr {H} \ \iff \ f= \sum _{i \in \mathscr {I}} c_i \rho _i \ \text{ with } \ \sum _{i \in \mathscr {I}} \ \frac{c_i^2}{\zeta _i} < \infty . \end{aligned}$$

(6.16)

If $\zeta _i>0 \ \forall i$, one has the set inclusion $\mathscr {H} \subset \mathscr {L}_2^{\mu }$ since the functions in the RKHS, must satisfy a more stringent condition on the expansion coefficients decay (the $\zeta _i$ decay to zero).

In addition, let $L_K^{1/2}$ denote the operator defined as the square root of $L_K$, i.e., for any $f \in \mathscr {L}_2^{\mu }$ with $f= \sum _{i \in \mathscr {I}} c_i \rho _i$, one has

$$\begin{aligned} L_K^{1/2}[f] = \sum _{i \in \mathscr {I}} \sqrt{\zeta _i}c_i \rho _i. \end{aligned}$$

(6.17)

This is a smoothing operator: the function $L_K^{1/2}[f] $ is more regular than f since the expansion coefficients $\sqrt{\zeta _i}c_i $ decrease to zero faster than the $c_i$. In view of (6.15) and (6.16), we obtain

$$\begin{aligned} \mathscr {H} = \left\{ L_K^{1/2}[f] \ \ | \ \ f \in \mathscr {L}_2^{\mu } \right\} , \end{aligned}$$

(6.18)

which shows that the RKHS can be thought of as the output of the linear system $L_K^{1/2}$ fed with the space $\mathscr {L}_2^{\mu }$, i.e., $\mathscr {H} = L_K^{1/2} \mathscr {L}_2^{\mu }$.

Example 6.11

(Spline kernel expansion) In Example 6.5, we have seen that the space of functions on the unit interval satisfying $f(0)=0$ and $\int _0^1 \dot{f}^2(x) dx < \infty $ is the RKHS associated to the first-order spline kernel $\min (x,y)$. We now derive a representation of the type (6.12) for this space setting $\mu $ to the Lebesgue measure. For this purpose, consider the system

$$ \int _0^1 \min (x,y) \rho (y) dy = \zeta \rho (x) . $$

The above equation is equivalent to

$$ \int _0^x y \rho (y) dy + x \int _x^1 \rho (y) dy = \zeta \rho (x), $$

which implies $\rho (0)=0$. Taking the derivative w.r.t. x we also obtain

$$ \int _x^1 \rho (y) dy = \zeta \dot{\rho }(x) $$

that implies $ \dot{\rho }(1)=0$. Differentiating again w.r.t. x gives

$$ -\rho (x) = \zeta \ddot{\rho }(x), $$

whose general solution is

$$ \rho (x) = a \sin (x / \sqrt{\zeta }) + b \cos (x / \sqrt{\zeta }), \quad a,b \in \mathbb {R}. $$

The boundary conditions $\rho (0)=\dot{\rho }(1)=0$ imply $b=0$ and lead to the following possible eigenvalues:

$$ \zeta _i = \frac{1}{( i \pi - \pi /2)^2}, \quad i=1,2,\ldots . $$

The orthonormality condition also implies $a=\sqrt{2}$ so that we obtain

$$ \rho _i(x) = \sqrt{2} \sin \left( i \pi x - \frac{\pi x}{2}\right) , \quad i=1,2,\ldots . $$

This provides the formulation (6.12) of the Sobolev space $\mathscr {H}$. Figure 6.3 plots three eigenfunctions (left panel) and the first 100 eigenvalues $\zeta _i$ (right panel). It is evident that the larger i the larger is the high-frequency content of $\rho _i$ and the RKHS norm of such basis function. In fact, a large value of i corresponds to a small eigenvalue $\zeta _i$ and one has $\Vert \rho _i\Vert ^2_{\mathscr {H}}=1/\zeta _i$. $\square $

Example 6.12

(Translation invariant kernels and Fourier expansion) A translation invariant kernel depends only on the difference of its two arguments. Hence, there exists $h:\mathscr {X} \rightarrow \mathbb {R}$ such that $K(x,y)=h(x-y)$. Assume that $\mathscr {X}=[0,2\pi ]$ and that h can be extended to a continuous, symmetric and periodic function over $\mathbb {R}$. Then, it can be expanded in terms of the following uniformly convergent Fourier series

$$ h(x)= \sum _{i=0}^{\infty } \ \zeta _i \cos ( i x), $$

where $\zeta _0$ accounts for the constant component and we assume $\zeta _i>0 \ \forall i$. We thus obtain the kernel expansion

$$ K(x,y) = \zeta _0 + \sum _{i=1}^{\infty } \ \zeta _{i} \cos (i x) \cos (i y) + \sum _{i=1}^{\infty } \ \zeta _{i} \sin (i x) \sin (i y), $$

in terms of functions which are all orthogonal in $\mathscr {L}_2^{\mu }$. Hence, these kernels induce RKHSs generated by the Fourier basis, with different inner products determined by $\zeta _i$. $\square $

6.3.1 More General Spectral Representation $\star $

Now, assume that the kernel K is available in the form $K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y)$ with $\zeta _i > 0 \ \forall i$, but with functions $\rho _i$ not necessarily orthonormal. More generally, we do not even require that they are independent, e.g., $\rho _1$ could be a linear combination of $\rho _2$ and $\rho _3$. The following result shows that the RKHS associated to K is still generated by the $\rho _i$, but the relationship of the expansion coefficients with $\Vert \cdot \Vert _{\mathscr {H}}$ is more involved than in the previous case.

Theorem 6.13

(RKHS induced by a diagonalized kernel) Let $\mathscr {H}$ be the RKHS induced by $K(x,y) = \sum _{i \in \mathscr {I}} \ \zeta _i \rho _i(x) \rho _i(y)$ with $\zeta _i > 0 \ \forall i$ and the set $\mathscr {I}$ countable. Then, $\mathscr {H}$ is separable and admits the representation

$$\begin{aligned} \mathscr {H} = \left\{ f \ \Big | \ f(x) = \sum _{i \in \mathscr {I}} c_i \rho _i(x) \ \text{ s.t. } \ \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i } < \infty \right\} \end{aligned}$$

(6.19)

and one has

$$\begin{aligned} \Vert f\Vert _{\mathscr {H}}^2 = \min _{\{c_i\}} \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i} \ \text{ s.t. } \ f = \sum _{i \in \mathscr {I}} c_i \rho _i. \end{aligned}$$

(6.20)

The proof is reported in Sect. 6.9.4 while an application example is given below.

Example 6.14

Let

$$ K(x,y) = 2 \sin ^2(x) \sin ^2(y) + 2 \cos ^2(x) \cos ^2(y) + 1. $$

Using Theorem 6.13, we obtain that the RKHS $\mathscr {H}$ associated to K is spanned by $\sin ^2(x)$, $\cos ^2(x)$ and the constant function. Now, let $f(x)=1$ and consider the problem of computing $\Vert f\Vert _{\mathscr {H}}^2$. To have a correspondence with (6.8) we can, e.g., fix the notation

$$ \rho _1(x) = \sin ^2(x), \quad \rho _2(x) = \cos ^2(x), \quad \rho _3(x)= 1 $$

and

$$ \zeta _1 = 2, \quad \zeta _2 = 2, \quad \zeta _3=1. $$

Since the functions $\rho _i$ are not independent, many different representation for $f(x)=1$ can be found. In particular, one has

$$ 1= c \rho _1(x) + c \rho _2(x) + (1-c) \rho _3(x) \quad \forall c \in \mathbb {R}, $$

so that

$$ \Vert f \Vert _{\mathscr {H}}^2 = \min _{c} \ \frac{c^2}{2} + \frac{c^2}{2} + (1-c)^2 = \min _{c} \ 2c^2 -2c +1 = \frac{1}{2} $$

with the minimum 1/2 obtained at $c=1/2$. Hence, according to the norm of $\mathscr {H}$, the “minimum energy” representation of $f(x)=1$ is $1/2(\rho _1(x)+ \rho _2(x) + \rho _3(x))$.

$\square $

6.4 Kernel-Based Regularized Estimation

6.4.1 Regularization in Reproducing Kernel Hilbert Spaces and the Representer Theorem

A powerful approach to reconstruct a function $g:\mathscr {X} \rightarrow \mathbb {R}$ from sparse data $\{x_i,y_i\}_{i=1}^N$ consists of minimizing a suitable functional over a RKHS. An important generalization of the estimators based on quadratic penalties, denoted by ReLS-Q in Chap. 3, is defined by

$$\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N}\mathscr {V}_i(y_i,f(x_i))+ \gamma \Vert f\Vert _{\mathscr {H}}^2. \end{aligned}$$

(6.21)

In (6.21), $\mathscr {V}_i$ are loss functions measuring the distance between $y_i$ and $f(x_i)$. They can take only positive values and are assumed convex w.r.t. their second argument $f(x_i)$. As an example, when the quadratic loss is adopted for any i, one obtains

$$ \mathscr {V}_i(y_i,f(x_i)) = (y_i -f(x_i))^2. $$

Then, the norm $\Vert \cdot \Vert _{\mathscr {H}}$ defines the regularizer, e.g., given by the energy of the first-order derivative

$$ \Vert f \Vert _{\mathscr {H}}^2 = \int _0^1 \dot{f}^2(x) dx, $$

which corresponds to the spline norm introduced in Example 6.5. Finally, the positive scalar $\gamma $ is the regularization parameter (already encountered in the previous chapters) which has to balance adherence to experimental data and function regularity. Indeed, the idea underlying (6.21) is that the predictor $\hat{g}$ should be able to describe the data without being too complex according to the RKHS norm. In particular, the scope of the regularizer is to restore the well-posedness of the problem, making the solution depend continuously on the data. It should also include our available information on the unknown function, e.g., the expected smoothness level.

The importance of the RKHSs in the context of regularization methods stems from the following central result, whose first formulation can be found in [52]. It shows that the solutions of the class of variational problems (6.21) admit a finite-dimensional representation, independently of the dimension of $\mathscr {H}$. The proof of an extended version of this result can be found in Sect. 6.9.5.

Theorem 6.15

(Representer theorem, adapted from [104]) Let $\mathscr {H}$ be a RKHS. Then, all the solutions of (6.21) admit the following expression

$$\begin{aligned} \hat{g} = \sum _{i=1}^N \ c_i K_{x_i}, \end{aligned}$$

(6.22)

where the $c_i$ are suitable scalar expansion coefficients.

Thus, as in the traditional linear parametric approach, the optimal function is a linear combination of basis functions. However, a fundamental difference is that their number is now equal to the number of data pairs, and is thus not fixed a priori. In fact, the functions appearing in the expression of the minimizer $\hat{g}$ are just the kernel sections $K_{x_i}$ centred on the input data. The representer theorem also conveys the message that, using estimators of the form (6.21), it is not possible to recover arbitrarily complex functions from a finite amount of data. The solution is always confined to a subspace with dimension equal to the data set size.

Now, let $\mathbf {K} \in \mathbb {R}^{N \times N}$ be the positive semidefinite matrix (called kernel matrix, or Gram matrix) such that $\mathbf {K}_{ij} = K(x_i,x_j)$. The ith row of $\mathbf {K}$ is denoted by $\mathbf {k}_i$. Using this notation, if $g = \sum _{i=1}^N \ c_i K_{x_i}$ then

$$\begin{aligned} g(x_i) = \mathbf {k}_i c \ \ \text{ and } \ \ \Vert g \Vert _{\mathscr {H}}^2 = c^T\mathbf {K}c, \end{aligned}$$

(6.23)

where $c=[c_1,\ldots ,c_N]^T$ and the second equality derives from the reproducing property or, equivalently, from (6.4).

Using the representer theorem, we can plug the expression (6.22) of the optimal $\hat{g}$ into the objective (6.21). Then, exploiting (6.23), the variational problem (6.21) boils down to

$$\begin{aligned} \min _{c \in \mathbb {R}^{N}} \sum _{i=1}^N \mathscr {V}_i (y_i,\mathbf {k}_ic)+ \gamma c^T\mathbf {K}c. \end{aligned}$$

(6.24)

The regularization problem (6.21) has been thus reduced to a finite-dimensional optimization problem whose order N does not depend on the dimension of the original space $\mathscr {H}$. In addition, since each loss function $\mathscr {V}_i$ has been assumed convex, the objective (6.24) is convex overall. How to compute the expansion coefficients now depends on the specific choice of the $\mathscr {V}_i$, as discussed in the next section.

Remark 6.3

(Kernel trick and implicit basis functions encoding) Assume that the kernel admits the expansion $K(x,y) = \sum _{i =1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y), \ \ \zeta _i > 0$. Then, as discussed in Sect. 6.3, any function in $\mathscr {H}$ has the representation

$$ f=\sum _{i =1}^{\infty } \ a_i \rho _i \ \ \text {with} \ \ \Vert f\Vert _{\mathscr {H}}^2=\sum _{j=1}^{\infty } \frac{a_j^2}{\zeta _j}. $$

Problem (6.21) can then be rewritten using the infinite-dimensional vector $a=[a_1 \ a_2 \ \ldots ]$ as unknown:

$$ \hat{a} =\arg \min _a \ \sum _{i=1}^N \mathscr {V}_i\left( y_i,\sum _{j=1}^\infty a_j \rho _j(x_i)\right) + \gamma \sum _{j=1}^{\infty } \frac{a_j^2}{\zeta _j}, $$

and an equivalent representation of (6.22) becomes $\hat{g}=\sum _{i =1}^{\infty } \ \hat{a}_i \rho _i$. In comparison to this reformulation the use of the kernel and of the representer theorem subsumes modelling and computational advantages. In fact, through K one needs neither to choose the number of basis functions to be used (the kernel can already include in an implicit way an infinite number of basis functions) nor to store any basis function in memory (the representer theorem reduces inference to solving a finite-dimensional optimization problem based on the kernel matrix $\mathbf {K}$). These features are related to what is called the kernel trick in the machine learning literature.

6.4.2 Representer Theorem Using Linear and Bounded Functionals

A more general version of the representer theorem obtained in [52] can be obtained by replacing $f(x_i)$ with $L_i[f]$, where $L_i$ is linear and bounded. In the first part of the following result $\mathscr {H}$ is just required to be Hilbert. In Sect. 6.9.5 we will see how Theorem 6.16 can be further generalized.

Theorem 6.16

(Representer theorem with functionals $L_i$, adapted from [104]) Let $\mathscr {H}$ be a Hilbert space and consider the optimization problem

$$\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N}\mathscr {V}_i(y_i,L_i[f])+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}$$

(6.25)

where each $L_i: \mathscr {H} \rightarrow \mathbb {R}$ is linear and bounded. Then, all the solutions of (6.25) admit the following expression

$$\begin{aligned} \hat{g} = \sum _{i=1}^N \ c_i \eta _i, \end{aligned}$$

(6.26)

where the $c_i$ are suitable scalar expansion coefficients and each $\eta _i \in \mathscr {H}$ is the representer of $L_i$, i.e., for any i and $f \in \mathscr {H}$:

$$\begin{aligned} L_i[f]=\langle f,\eta _i \rangle _{\mathscr {H}}. \end{aligned}$$

(6.27)

In particular, if $\mathscr {H}$ is a RKHS with kernel K, each basis function is given by

$$\begin{aligned} \eta _i(x) = L_i[K(\cdot ,x)]. \end{aligned}$$

(6.28)

The existence of $\eta _i$ satisfying (6.27) is ensured by the Riesz representation theorem (Theorem 6.27). One can also prove that in a RKHS a linear functional L is linear and bounded if and only if the function f obtained by applying L to the kernel, i.e., $f(x)=L[K(x,\cdot )] \ \forall x$, belongs to the RKHS.

Note also that Theorem 6.15 is indeed a special case of the last result. In fact, let $\mathscr {H}$ be a RKHS and $L_i[f]=f(x_i) \ \forall i$. Then, each $L_i$ is linear and bounded and each $\eta _i$ becomes the kernel section $K_{x_i}$ according to the reproducing property.

Example 6.17

(Solution using the quadratic loss) Let us adopt a quadratic loss in (6.25), i.e., $\mathscr {V}_i(y_i,L_i[f])=(y_i-L_i[f])^2$. This makes the objective strictly convex so that a unique solution exists. To find it, plugging (6.26) in (6.25) and using also (6.28), the following quadratic problem is obtained

$$\begin{aligned} \Vert Y-Oc \Vert ^2+ \gamma c^T O c \end{aligned}$$

(6.29)

where $Y=[y_1, \ldots ,y_N]^T$, $\Vert \cdot \Vert $ is the Euclidean norm, while the $N \times N$ matrix O has i, j entry given by

$$\begin{aligned} O_{ij}=\langle \eta _i, \eta _j \rangle _{\mathscr {H}} = L_i[L_j[K]]. \end{aligned}$$

(6.30)

The minimizer $\hat{c}$ of (6.29) is unique if O is full rank. Otherwise, all the solutions lead to the same function estimate in view of the (already mentioned) strict convexity of (6.25). In particular, one can always use as optimal expansion coefficients the components of the vector

$$\begin{aligned} \hat{c} = (O+\gamma I_N)^{-1}Y. \end{aligned}$$

(6.31)

In Sect. 6.5.1 this result will be further discussed in the context of the so-called regularization networks, where one comes back to assume $L_i[f]=f(x_i)$. $\square $

6.5 Regularization Networks and Support Vector Machines

The choice of the loss $\mathscr {V}_i$ in (6.21) yields regularization algorithms with different properties. We will illustrate four different cases below.

6.5.1 Regularization Networks

Let us consider the quadratic loss function $\mathscr {V}_i(y_i,f(x_i))= r_i^2$, with the residual $r_i$ defined by $r_i=y_i-f(x_i)$. Such a loss, also depicted in Fig. 6.4 (top left panel), leads to the problem

$$\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N} (y_i-f(x_i))^2+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}$$

(6.32)

which is a generalization of the regularized least squares problem encountered in the previous chapters. In particular, it extends the estimator (3.58a) based on quadratic penalty called ReLS-Q in Chap. 3. The estimator (6.32) is known in the literature as regularization network [71] or also kernel ridge regression. The strict convexity of the objective (6.32) ensures that the minimizer $\hat{g}$ not only exists but is also unique (this issue is further discussed in the remark at the end of this subsection).

To find the solution, we can follow the same arguments developed in Example 6.17, just specializing the result to the case $L_i[f]=f(x_i)$. We will see that the matrix O has just to be replaced by the kernel matrix $\mathbf {K}$.

As previously done, let $Y=[y_1, \ldots ,y_N]^T$ and use $\Vert \cdot \Vert $ to indicate the Euclidean norm. Then, the corresponding regularization problem (6.24) becomes

$$\begin{aligned} \min _{c \in \mathbb {R}^{N}} \Vert Y-\mathbf {K}c\Vert ^2 + \gamma c^T\mathbf {K}c, \end{aligned}$$

(6.33)

which is a finite-dimensional ReLS-Q. After simple calculations, one of the optimal solutions^{Footnote 2} is found to be

$$\begin{aligned} \hat{c} = \left( \mathbf {K}+\gamma I_{N}\right) ^{-1}Y, \end{aligned}$$

(6.34)

where $I_{N}$ is the $N \times N$ identity matrix. The estimate from the regularization network is thus available in closed form, given by $\hat{g} = \sum _{i=1}^N \ \hat{c}_i K_{x_i}$ with the optimal coefficient vector $\hat{c}$ solving a linear system of equations.

Remark 6.4

(Regularization network as projection) An interpretation of the regularization network can be also given in terms of a projection. In particular, let $\mathscr {R}$ be the Hilbert space $\mathbb {R}^N \times \mathscr {H}$ (any element is a couple containing a vector v and a function f) with norm defined, for any $v \in \mathbb {R}^N$ and $f \in \mathscr {H}$, by

$$\Vert (v,f)\Vert _{\mathscr {R}}^2 = \Vert v\Vert ^2 + \gamma \Vert f \Vert _{\mathscr {H}}^2, \ \ \gamma >0, \ \ \Vert \cdot \Vert = \text {Euclidean norm}. $$

Let also S be the (closed) subspace given by all the couples (v, f) satisfying the constraint $v=[f(x_1) \ldots f(x_N)]$. Then, if $g=(Y,0)$ where 0 here denotes the null function in $\mathscr {H}$, the projection of g onto S is

$$\begin{aligned} g_S= & {} \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{h \in S} \ \Vert g-h \Vert ^2_\mathscr {R} \\= & {} \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{(\{f(x_i)\}_{i=1}^N,f), \ f \in \mathscr {H}} \ \sum _{i=1}^{N} (y_i-f(x_i))^2+ \gamma \Vert f\Vert _{\mathscr {H}}^2. \end{aligned}$$

It is now immediate to conclude that $g_S$ corresponds to $([\hat{g}(x_1) \ldots \hat{g}(x_n)],\hat{g})$ where $\hat{g}$ is indeed the minimizer (6.32), which must thus be unique in view of Theorem 6.25 (Projection theorem). Note that this interpretation can be extended to all the variational problems (6.21) containing losses defined by a norm induced by an inner product in $\mathbb {R}^N$.

6.5.2 Robust Regression via Huber Loss $\star $

As described in Sect. 3.6.1, a shortcoming of the quadratic loss is its sensitivity to outliers because the influence of large residuals $r_i$ grows quadratically. In presence of outliers, one would better use a loss function that grows linearly. These issues have been widely studied in the field of robust statistics [51], where loss functions such as the Huber’s have been introduced. Recalling (3.115), one has

$$ \mathscr {V}_i(y_i,f(x_i)) = \left\{ \begin{array}{lcl} \frac{r_i^2}{2 }, \quad &{} |r_i|\le \delta \\ \delta \left( |r_i|-\frac{\delta }{2}\right) , \quad &{} |r_i| > \delta \end{array}, \right. $$

where we still have $r_i=y_i-f(x_i)$. The Huber loss function with $\delta =1$ is shown in Fig. 6.4 (top right panel). Notice that it grows linearly and is thus robust to outliers. When $\delta \rightarrow +\infty $, one recovers the quadratic loss. On the other hand, we also have $\lim _{\delta \rightarrow 0^+} \mathscr {V}_i(r)/\delta = |r_i|$ that is the absolute value loss.

6.5.3 Support Vector Regression $\star $

Sometimes, it is desirable to neglect prediction errors, as long as they are below a certain threshold. This can be achieved, e.g., using the Vapnik’s $\varepsilon $-insensitive loss given, for $r_i=y_i-f(x_i)$, by

$$ \mathscr {V}_i(y_i,f(x_i)) = |r_i|_{\varepsilon } = \left\{ \begin{array}{lcl} 0, \quad &{} |r_i| \le \varepsilon \\ |r_i|-\varepsilon , \quad &{} |r_i| > \varepsilon \end{array}. \right. $$

The Vapnik loss with $\varepsilon =0.5$ is shown in Fig. 6.4 (bottom left panel). Notice that it has a null plateau in the interval $[-\varepsilon , \varepsilon ]$ so that any predictor closer than $\varepsilon $ to $y_i$ is seen as a perfect interpolant. The loss then grows linearly, thus ensuring robustness. The regularization problem (6.21) associated with the $\varepsilon $-insensitive loss function turns out

$$\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N} |y_i-f(x_i)|_{\varepsilon }+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}$$

(6.35)

and is called Support Vector Regression (SVR), see, e.g., [37]. The SVR solution, given by $\hat{g} = \sum _{i=1}^N \ \hat{c}_i K_{x_i}$ according to the representer theorem, is characterized by sparsity in $\hat{c}$, i.e., some components $\hat{c}_i$ are set to zero. This feature is briefly discussed below.

In the SVR case, obtaining the optimal coefficient vector $\hat{c}$ by (6.24) is not trivial since the loss $| \cdot |_{\varepsilon }$ is not differentiable everywhere. This difficulty can be circumvented by replacing (6.24) with the following equivalent problem obtained considering two additional N-dimensional parameter vectors $\xi $ and $\xi ^*$:

$$\begin{aligned} \min _{c,\xi ,\xi ^*} \ \sum _{i=1}^N (\xi _i + \xi _i^*) + \gamma c^T\mathbf {K}c, \end{aligned}$$

(6.36)

subject to the constraints

$$\begin{aligned}&y_i - \mathbf {k}_ic \le \varepsilon + \xi _i, \quad i=1,\ldots ,N,\\&\mathbf {k}_ic - y_i \le \varepsilon + \xi ^*_i, \quad i=1,\ldots ,N,\\&\xi _i,\xi _i^* \ge 0, \qquad \qquad i=1,\ldots ,N. \end{aligned}$$

To see that its minimizer contains the optimal solution $\hat{c}$ of (6.24), it suffices noticing that (6.36) assigns a linear penalty only when $|y_i - \mathbf {k}_ic| > \varepsilon $.

Problem (6.36) is quadratic subject to linear inequality constraints, hence it is solvable by standard optimization approaches like interior point methods [64, 108]. Calculating the Karush–Kuhn–Tucker conditions, it is possible to show that the condition $|y_i - \mathbf {k}_i\hat{c}| < \varepsilon $ implies $\hat{c}_i=0$. Indexes i for which $\hat{c}_i \ne 0$ instead identify the set of input locations $x_i$ called support vectors.

6.5.4 Support Vector Classification $\star $

The three losses illustrated above were originally proposed for regression problems, with the output y real valued. When the outputs can assume only two values, e.g., 1 and −1, a classification problem arises. Here, the scope of the predictor is just to separate two classes. This problem can be seen as a special case of regression. In particular, even if the output space is binary, consider prediction functions $f: \mathscr {X} \rightarrow \mathbb {R}$ and assume that the input $x_i$ is associated to the class 1 if $f(x_i)\ge 0$ and to the class $-1$ if $f(x_i)<0$. Let the margin on an example $(x_i,y_i)$ be $m_i=y_if(x_i)$. Then, we will see that the value of $m_i$ is a measure of how well we are classifying the available data. One can thus try to maximize the margin but still searching for a function not too complex according to the RKHS norm. In particular, we can exploit (6.21) with a loss that depends on the margin as described below.

The most natural classification loss is the $0-1$ loss defined for any i by

$$ \mathscr {V}_i(y_i,f(x_i)) = \left\{ \begin{array}{lcl} 0, \quad &{} m_i >0 \\ 1, \quad &{} m_i \le 0 \end{array}, \quad m_i=y_if(x_i), \right. $$

and depicted in Fig. 6.4 (bottom right panel, dashed line). Adopting it, the first component of the objective in (6.21) returns the number of misclassifications. However, the $0-1$ loss is not convex and leads to an optimization problem of combinatorial nature.

An alternative is the so-called hinge loss [98] defined by

$$ \mathscr {V}_i(y_i,f(x_i)) = | 1 - y_i f(x_i) |_+ = \left\{ \begin{array}{lcl} 0, \quad &{} m > 1 \\ 1-m, \quad &{} m \le 1 \end{array}, \quad m=y_if(x_i), \right. $$

which thus provides a linear penalty when $m<1$. Figure 6.4 (bottom right panel, solid line) illustrates that it is a convex upper bound on the $0-1$ loss. The problem associated with the hinge loss turns out

$$\begin{aligned} \hat{g} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \sum _{i=1}^{N} |1 - y_i f(x_i)|_+ + \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}$$

(6.37)

and is called support vector classification (SVC).

Like in the SVR case, obtaining the optimal coefficient vector by (6.37) is not trivial since the hinge loss is not differentiable. But one can still resort to an equivalent problem, now obtained considering just an additional parameter vector $\xi $:

$$\begin{aligned} \min _{c,\xi } \ \sum _{i=1}^N \ \xi _i + \gamma c^T\mathbf {K}c, \end{aligned}$$

(6.38)

subject to the constraints

$$\begin{aligned}&y_i (\mathbf {k}_ic) \ge 1 - \xi _i, \quad i=1,\ldots ,N,\\&\xi _i \ge 0, \qquad \qquad \ \ i=1,\ldots ,N. \end{aligned}$$

As in the SVR case, the optimal solution $\hat{c}$ is sparse and indexes i for which $\hat{c}_i \ne 0$ define the support vectors $x_i$.

6.6 Kernels Examples

The reproducing kernel characterizes the hypothesis space $\mathscr {H}$. Together with the loss function, it also completely defines the key estimator (6.21) which exploits the RKHS norm as regularizer. The choice of K has thus a crucial impact on the ability of predicting future output data. Some important RKHSs are discussed below.

6.6.1 Linear Kernels, Regularized Linear Regression and System Identification

We now show that the regularization network (6.32) generalizes the ReLS-Q problem introduced in Chap. 3 which adopts quadratic penalties. The link is provided by the concept of linear kernel.

We start assuming that the input space is $\mathscr {X} = {\mathbb R}^m$. Hence, any input location x corresponds to an m-dimensional (column) vector. If $P \in {\mathbb R}^{m \times m}$ denotes a symmetric and positive semidefinite matrix, a linear kernel is defined as follows

$$ K(y,x) = y^T P x, \quad (x,y) \in \mathbb {R}^m \times \mathbb {R}^m. $$

All the kernel sections are linear functions. Hence, their span defines a finite-dimensional (closed) subspace of linear functions that, in view of Theorem 6.1 (and subsequent discussion) coincides with the whole $\mathscr {H}$. Hence, the RKHS induced by the linear kernel is simply a space of linear functions and, for any $g \in \mathscr {H}$, there exists $a \in {\mathbb R}^m$ such that

$$ g(x)=a^T P x=K_a(x). $$

If P is full rank, letting $\theta := P a$, we also have

$$\begin{aligned} || g ||^2_{\mathscr {H}}= & {} || K_a ||^2_{\mathscr {H}} = \langle K_a, K_a \rangle _{\mathscr {H}} \\= & {} K(a,a) = a^T P a \\= & {} \theta ^T P^{-1} \theta . \end{aligned}$$

Now, let us use such $\mathscr {H}$ in the regularization network (6.32). Without using the representer theorem, we can plug the representation $g(x)=\theta ^T x$ in the regularization problem to obtain $\hat{g}(x)=\hat{\theta }^Tx$ where

$$\begin{aligned} \hat{\theta }= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\theta \in \mathbb {R}^{m}} \ \Vert Y-\varPhi \theta \Vert ^2+ \gamma \theta ^T P^{-1} \theta , \end{aligned}$$

(6.39)

with the ith row of the regression matrix $\varPhi $ equal to $x_i^T$. One can see that (6.39) coincides with ReLS-Q, with the regularization matrix P which defines the linear kernel K and, in turn, the penalty term $\theta ^T P^{-1} \theta $.

We now derive the connection with linear system identification in discrete time. The data set consists of the output measurements $\{y_i\}_{i=1}^N$, collected on the time instants $\{t_i\}_{i=1}^N$, and of the system input u. We can form each input location using past input values as follows

$$\begin{aligned} x_i = [u_{t_i-1} \ u_{t_i-2} \ \ldots \ u_{t_i-m}]^T, \end{aligned}$$

(6.40)

where m is the FIR order and an input delay of one unit has been assumed. Then, if Y collects the noisy outputs, $\hat{\theta }$ becomes the impulse response estimate. This establishes a correspondence between regularized FIR estimation and regularization in RKHS induced by linear kernels.

6.6.1.1 Infinite-Dimensional Extensions $\star $

In place of $\mathscr {X}=\mathbb {R}^m$, let now $\mathscr {X} \subset \mathbb {R}^\infty $, i.e., the input space contains sequences. We can interpret any input location as an infinite-dimensional column vector and use ordinary notation of algebra to handle infinite-dimensional objects. For instance, if $x,y \in \mathscr {X}$ then $x^Ty=\langle x,y \rangle _2$ where $\langle \cdot ,\cdot \rangle _2$ is the inner product in $\ell _2$. Assume we are given a symmetric and infinite-dimensional matrix P such that the linear kernel

$$ K(y,x) = y^T P x $$

is well defined over a subset of $\mathbb {R}^\infty \times \mathbb {R}^\infty $. For example, if P is absolutely summable, i.e., $\sum _{ij} |P_{ij}|<\infty $, the kernel is well defined for any input location $x \in \mathscr {X}$ with $\mathscr {X}=\ell _\infty $. The kernel section centred on x is the infinite-dimensional column vector Px. Following arguments similar to those seen in the finite-dimensional case, one can conclude that the RKHS associated to such K contains linear functions of the form $g(x)=a^T P x$ with $a \in \mathscr {X}$. Roughly speaking, the regularization network (6.32) relying on such hypothesis space is the limit of Problem (6.39) for $m \rightarrow \infty $. To compute the solution, in this case it is necessary to resort to the representer theorem (6.22). One obtains

$$ \hat{g}(x) = \sum _{i=1}^N \ \hat{c}_i K_{x_i}(x) = \hat{\theta }^T x $$

where $\hat{c}$ is defined by (6.34) and

$$ \hat{\theta } := \sum _{i=1}^N \ \hat{c}_i P x_i. $$

The link with linear system identification follows the same reasoning previously developed but $x_i$ now contains an infinite number of past input values, i.e.,

$$ x_i = [u_{t_i-1} \ u_{t_i-2} \ u_{t_i-3} \ldots ]^T. $$

With this correspondence, the regularization network now implements regularized IIR estimation and $\hat{\theta }$ contains the impulse response coefficients estimates. In fact, note that the nature of $x_i$ makes the value $\hat{g}(x_i)$ the convolution between the system input u and $\hat{\theta }$ evaluated at $t_i$ (with one unit input delay).

In a more sophisticated scenario, in place of sequences, the input space $\mathscr {X}$ could contain functions. For instance, $\mathscr {X} \subset \mathscr {P}^c$ where $\mathscr {P}^c$ is the space of piecewise continuous functions on $\mathbb {R}^+$. Thus, each input location corresponds to a continuous function $x:\mathbb {R}^+ \rightarrow \mathbb {R}$. Given a suitable symmetric function $P: \mathbb {R}^+ \times \mathbb {R}^+ \rightarrow \mathbb {R}$, a linear kernel is now defined by

$$ K(y,x) = \int _{\mathbb {R}^+ \times \mathbb {R}^+} \ y(t) P(t,\tau ) x(\tau ) dt d\tau . $$

The corresponding RKHS thus contains linear functionals: any $f \in \mathscr {H}$ maps x (which is a function) into $\mathbb {R}$. The solution of the regularization network (6.32) equipped with such hypothesis space is

$$ \hat{g}(x) = \sum _{i=1}^N \ \hat{c}_i K_{x_i}(x) = \int _{\mathbb {R}^+} \hat{\theta }(\tau ) x(\tau ) d \tau , $$

where $\hat{c}$ is still defined by (6.34) and

$$ \hat{\theta }(\tau ) := \sum _{i=1}^N \ \hat{c}_i \int _{\mathbb {R}^+} \ P(\tau ,t) x_i(t) dt. $$

The connection with linear system identification is obtained by defining

$$ x_i(t) = u(t_i-t), \quad t \ge 0 $$

(if the input u(t) is continuous for $t \ge 0$ and causal, the functions $x_i(t)$ is piecewise continuous, making necessary the assumption $\mathscr {X} \subset \mathscr {P}^c$). In this way, each $g \in \mathscr {H}$ represents a different linear system. Furthermore, the regularization network (6.32) implements regularized system identification in continuous time and $\hat{\theta }$ is the continuous-time impulse response estimate. The class of kernels which include the BIBO stability constraint will be discussed in the next chapter.

6.6.2 Kernels Given by a Finite Number of Basis Functions

Assume we are given an input space $\mathscr {X}$ and m independent functions $\rho _i:\mathscr {X}\rightarrow \mathbb {R}$. Then, we define

$$ K(x,y) = \sum _{i=1}^m \rho _i(x) \rho _i(y). $$

It is easy to verify that K is a positive definite kernel. Recalling Theorem 6.13, the associated RKHS coincides with the m-dimensional space spanned by the basis functions $\rho _i$. Each function in $\mathscr {H}$ has the representation $g(x) = \sum _{i=1}^m \theta _i \rho _i(x)$ and, in view of (6.20) and the independence of the basis functions, one has

$$ \Vert g \Vert _{\mathscr {H}}^2 = \sum _{i=1}^m \ \theta _i^2. $$

Consider now the regularization network (6.32) equipped with such hypothesis space. The solution can be computed without using the representer theorem by plugging in (6.32) the expression of g as a function of $\theta $. Letting $\varPhi \in {\mathbb R}^{N \times m}$ with $\varPhi _{ij} = \rho _j(x_i)$, we obtain $\hat{g} = \sum _{i=1}^m \ \hat{\theta }_i \rho _i$ with

$$\begin{aligned} \hat{\theta } = \arg \min _{\theta \in \mathbb {R}^{m}} \ \Vert Y-\varPhi \theta \Vert ^2+ \gamma \Vert \theta \Vert ^2. \end{aligned}$$

(6.41)

The solution (6.41) coincides with the ridge regression estimate introduced in Sect. 1.2.

6.6.3 Feature Map and Feature Space $\star $

Let $\mathscr {F}$ be a space endowed with an inner product, and assume that a representation of the form

$$\begin{aligned} K(x,y) = \langle \phi (x), \phi (y) \rangle _{\mathscr {F}}, \qquad \phi :\mathscr {X}\rightarrow \mathscr {F}, \end{aligned}$$

(6.42)

is available. Then, it follows immediately that K is a positive definite kernel. In this context, $\phi $ is called a feature map, and $\mathscr {F}$ the feature space. For instance, to have the connection with the kernel discussed in the previous subsection, we can think of $\phi $ as a vector containing m functions. It is defined for any x by

$$ \phi (x)=\left( \begin{array}{c}\rho _1(x) \\ \rho _2(x) \\ \vdots \\ \rho _m(x) \end{array}\right) $$

so that $\mathscr {F}=\mathbb {R}^m$ with the Euclidean inner product. Then, we obtain

$$ K(x,y) = \langle \phi (x), \phi (y) \rangle _{2} = \phi ^T(x) \phi (y) = \sum _{i=1}^m \rho _i(x) \rho _i(y). $$

Now, given any positive definite kernel K, Theorem 6.2 (Moore–Aronszajn theorem) implies the existence of at least one feature map, namely, the RKHS map $\phi _{\mathscr {H}}:\mathscr {X} \rightarrow \mathscr {H}$ such that

$$ \phi _{\mathscr {H}}(x) = K_x, $$

where the representation (6.42) follows immediately from the reproducing property. These arguments show that K is a positive definite kernel iff there exists at least one Hilbert space $\mathscr {F}$ and a map $\phi : \mathscr {X} \rightarrow \mathscr {F}$ such that $K(x,y)=\langle \phi (x), \phi (y) \rangle _{\mathscr {F}}$.

Feature maps and feature spaces are not unique since, by introducing any linear isometry $I:\mathscr {H} \rightarrow \mathscr {F}$, one can obtain a representation in a different space:

$$ K(x,y) = \langle \phi _{\mathscr {H}}(x), \phi _{\mathscr {H}}(y) \rangle _{\mathscr {H}} = \langle I \circ \phi _{\mathscr {H}}(x), I \circ \phi _{\mathscr {H}}(y) \rangle _\mathscr {F}. $$

Now, assume that the kernel admits the decomposition (6.8), i.e.,

$$ K(x,y) = \sum _{i=1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y)$$

with $\zeta _i > 0 \ \forall i$. Then, a spectral feature map of K is

$$ \phi _{\mu }: \mathscr {X} \rightarrow \ell _2 $$

with

$$ \phi _{\mu }(x) = \{ \sqrt{\zeta _i} \rho _i(x) \}_{i=1}^{\infty }, \ \ x \in \mathscr {X}. $$

In fact, we have

$$ \langle \phi _{\mu }(x), \phi _{\mu }(y) \rangle _2 = \sum _{i=1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y) = K(x,y). $$

It is worth also pointing out the role of the feature map within the estimation scenario. In many applications, linear functions are not models powerful enough. Kernels define more expressive spaces by (implicitly) mapping the data into a high-dimensional feature space where linear machines can be applied. Then, the use of the estimator (6.21) does not require to know any feature map associated to K: the representer theorem shows that the only information needed to compute the estimate is the kernel matrix, as also discussed in Remark 6.3.

6.6.4 Polynomial Kernels

Another example of kernel is the (inhomogeneous) polynomial kernel [70]. For $x,y \in \mathbb {R}^m$, it is defined by

$$ K(x,y) = \left( \langle x, y \rangle _2 +c\right) ^p, \quad p \in \mathbb {N}, \quad c \ge 0, $$

with $\langle \cdot , \cdot \rangle _2$ to denote the classical Euclidean inner product. As an example, assume $c=1$ and $m=p=2$ with $x=[x_a \ x_b]$ and $y=[y_a \ y_b]$. Then, one obtains the kernel expansion

$$ K(x,y) = 1+ x_a^2y_a^2+x_b^2y_b^2+2x_ax_by_ay_b+2x_ay_a+2x_by_b, $$

of the type (6.8) with the $\rho _i(x_a,x_b)$ given by all the monomials of degree up to 2, i.e., the 6 functions

$$ 1, \ x_a^2, \ x_b^2, \ x_ax_b, \ x_a, \ x_b. $$

More in general, if $c>0$, the polynomial kernel induces a $\left( {\begin{array}{c}m+p\\ p\end{array}}\right) $-dimensional RKHS spanned by all possible monomials up to the pth degree. The number of basis function is thus finite but exponential in p. This simple example is in some sense opposite to that described in Sect. 6.6.2. It shows how a kernel can be used to define implicitly a rich class of basis functions.

6.6.5 Translation Invariant and Radial Basis Kernels

A kernel is said translation invariant if there exists $h:\mathscr {X} \rightarrow \mathbb {R}$ such that $K(x,y)=h(x-y)$. This class has been already encountered in Example 6.12 where its relationship with the Fourier basis (in the case of one-dimensional input space) is illustrated. A general characterization is given below, see also [80].

Theorem 6.18

(Bochner, based on [23]) A positive definite kernel K over $\mathscr {X} = \mathbb {R}^d$ is continuous and of the form $K(x,y)=h(x-y)$ if and only if there exists a probability measure $\mu $ and a positive scalar $\eta $ such that:

$$ K(x,y) = \eta \int _{\mathscr {X}} \cos \left( \langle z, x-y \rangle _2 \right) d\mu (z). $$

Translation invariant kernels include also the class of radial basis kernels (RBF) of the form $K(x,y) = h(\Vert x-y\Vert )$ where $\Vert \cdot \Vert $ is the Euclidean norm [85]. A notable example is the so-called Gaussian kernel:

$$\begin{aligned} K(x,y) = \exp \left( -\frac{\Vert x-y\Vert ^2}{\rho }\right) , \quad \rho > 0, \end{aligned}$$

(6.43)

where $\rho $ denotes the kernel width. This kernel is often used to model functions expected to be somewhat regular. Note however that $\rho $ has an important role in tuning the smoothness level. A low value makes the kernel close to diagonal so that a low norm can be assigned also to rapidly changing functions. On the other hand, as $\rho $ approaches zero, only functions close to be constant are given a low penalty. This is the same phenomenon illustrated in Fig. 6.1.

Another widely adopted kernel, which induces spaces of functions less regular than the Gaussian one, is the Laplacian kernel which uses the Euclidean norm in place of the squared Euclidean norm:

$$\begin{aligned} K(x,y) = \exp \left( -\frac{\Vert x-y\Vert }{\rho } \right) , \quad \rho > 0. \end{aligned}$$

(6.44)

Differently from the kernels described in the first part of Sect. 6.6.1, as well as in Sects. 6.6.2 and 6.6.4, the RKHS associated with any non-constant RBF kernel is infinite dimensional (it cannot be spanned by a finite number of basis functions). The associated RKHS can be shown to be dense in the space of all continuous functions defined on a compact subset $\mathscr {X} \subset \mathbb {R}^m$. This means that every continuous function can be represented in this space with the desired accuracy as measured by the sup-norm $\sup _{x \in \mathscr {X}} |f(x)|$. This property is called universality. This does not imply that the RKHS induced by a universal kernel includes any continuous function. For instance, the Gaussian kernel is universal but it has been proved that it does not contain any polynomial, including the constant function [69].

6.6.6 Spline Kernels

To simplify the exposition, let $\mathscr {X}=[0,1]$ and let also $g^{(j)}$ denote the jth derivative of g, with $g^{(0)}:=g$. Intuitively, in many circumstances an effective regularizer is obtained by penalizing the energy of the pth derivative of g, i.e., employing

$$\begin{aligned} \int _0^1 \left( g^{(p)}(x)\right) ^2 dx. \end{aligned}$$

An interesting question is whether this penalty term can be cast in the RKHS theory. For $p=1$, a positive answer has been given by Example 6.5. Actually, the answer is positive for any integer p. In fact, consider the Sobolev space of functions g whose first $p-1$ derivatives are absolutely continuous and satisfy $g^{(j)}(0)=0$ for $j=0,\ldots ,p-1$. The same arguments developed in Example 6.5 when $p=1$ can be easily generalized to prove that this is a RKHS $\mathscr {H}$ with norm

$$ \Vert g\Vert _{\mathscr {H}}^2=\int _0^1 \left( g^{(p)}(x)\right) ^2 dx. $$

The corresponding kernel is the pth-order spline kernel

$$\begin{aligned} K(x,y) = \int _0^1 G_p(x,u)G_p(y,u)du, \end{aligned}$$

(6.45)

where $G_p$ is the so-called Green’s function given by

$$\begin{aligned} G_p(x,u) = \frac{(x-u)_+^{p-1}}{(p-1)!} , \qquad (u)_+ = \left\{ \begin{array}{cl} u &{} \text{ if } ~ u \ge 0 \\ 0 &{} \text{ otherwise } \end{array} \right. . \end{aligned}$$

(6.46)

Note that the Laplace transform of $G_p(\cdot ,0)$ is $1/s^p$. Hence, the Green’s function is connected with the impulse response of a p-fold integrator. When $p=1$, we recover the linear spline kernel of Example 6.5:

$$\begin{aligned} K(x,y) = \min \{x, y\} \end{aligned}$$

(6.47)

whereas $p=2$ leads to the popular cubic spline kernel [104]:

$$\begin{aligned} K(x,y) = \frac{x y \min \{x, y\}}{2}-\frac{(\min \{x, y\})^3}{6}. \end{aligned}$$

(6.48)

The linear and the cubic spline kernel are displayed in Fig. 6.2.

We can use the spline hypothesis space in the regularization problem (6.21). Then, from the representer theorem one obtains that the estimate $\hat{g}$ is a pth-order smoothing spline with derivatives continuous exactly up to order $2p-2$ (the order’s choice is thus related to the expected function smoothness). This can be seen also from the kernels sections plotted in Fig. 6.2 for p equal to 1 and 2. For $p=2$ the (finite) sum of kernel sections provides the well-known cubic smoothing splines, i.e., piecewise third-order polynomials.

Spline functions enjoy many numerical properties originally studied in the interpolation scenario. In particular, piecewise polynomials circumvent Runge’s phenomenon (large oscillations affecting the reconstructed function) which, e.g., arises when high-order polynomials are employed [81]. Fit convergence rates are discussed, e.g., in [3, 14].

6.6.7 The Bias Space and the Spline Estimator

Bias space As discussed in Sect. 4.5, in a Bayesian setting, in some cases it can be useful to enrich $\mathscr {H}$ with a low-dimensional parametric part, known in the literature as bias space. The bias space typically consists of linear combinations of functions $\{\phi _k\}_{k=1}^m$. For instance, if the unknown function exhibits a linear trend, one may let $m=2$ and $\phi _1(x)=1,\phi _2(x)=x$. Then, one can assume that g is sum of two functions, one in $\mathscr {H} $ and the other one in the bias space. In this way, the function space becomes $\mathscr {H} + \text{ span } \{ \phi _1,\ldots , \phi _m\}$. Using a quadratic loss, the regularization problem is given by

$$\begin{aligned} (\hat{f},\hat{\theta }) = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{\begin{array}{c} f \in \mathscr {H},\\ \theta \in \mathbb {R}^m \end{array} } \sum _{i=1}^{N}\left( y_i-f(x_i)-\sum _{k=1}^{m} \theta _k \phi _k(x_i)\right) ^2+ \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}$$

(6.49)

and the overall function estimate turns out $\hat{g} = \hat{f} + \sum _{k=1}^{m} \hat{\theta }_k\phi _k$. Note that the expansion coefficients in $\theta $ are not subject to any penalty term but a low value for m avoids overfitting. The solution can be computed exploiting an extended version of the representer theorem. In particular, it holds that

$$\begin{aligned} \hat{g} = \sum _{i=1}^{N} \hat{c}_i K_{x_i} + \sum _{k=1}^{m} \hat{\theta }_k\phi _k, \end{aligned}$$

(6.50)

where, assuming that $\varPhi \in {\mathbb R}^{N \times m}$ is full column rank and $\varPhi _{ij} = \phi _j(x_i)$,

$$\begin{aligned} \hat{\theta }&= \left( \varPhi ^T A^{-1} \varPhi \right) ^{-1} \varPhi ^T A^{-1} Y \end{aligned}$$

(6.51a)

$$\begin{aligned} \hat{c}&= A^{-1} \left( Y- \varPhi \hat{\theta }\right) \end{aligned}$$

(6.51b)

$$\begin{aligned} A&= \mathbf {K} + \gamma I_{N}. \end{aligned}$$

(6.51c)

Remark 6.5

(Extended version of the representer theorem) The correctness of formulas (6.51a–6.51c) can be easily verified as follows. Fix $\theta $ to the optimizer $\hat{\theta }$ in the objective present in the rhs of (6.49). Then, we can use the representer theorem with Y replaced by $Y-\varPhi \hat{\theta }$ to obtain $\hat{f} = \sum _{i=1}^{N} \hat{c}_i K_{x_i} $ with

$$ \hat{c} = A^{-1} \left( Y- \varPhi \hat{\theta }\right) $$

with A indeed given by (6.51c). This proves (6.51b). Using the definition of A this also implies

$$ Y-\mathbf {K} \hat{c}= \varPhi \hat{\theta } + \gamma \hat{c}. $$

Now, if we fix f to $\hat{f}$, the optimizer $\hat{\theta }$ is just the least squares estimate of $\theta $ with Y replaced by $Y- \mathbf {K} \hat{c}$. Hence, we obtain

$$ \hat{\theta }= \left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T (Y- \mathbf {K} \hat{c}). $$

Using $Y-\mathbf {K} \hat{c}= \varPhi \hat{\theta } + \gamma \hat{c}$ in the expression for $\hat{\theta }$, we obtain $\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T \hat{c}=0$. Multiplying the lhs and rhs of (6.51b) by $\left( \varPhi ^T \varPhi \right) ^{-1} \varPhi ^T$ and using this last equality, (6.51a) is finally obtained.

The spline estimator The bias space is useful, e.g., when spline kernels are adopted. In fact, the spline space of order p contains functions all satisfying the constraints $g^{(j)}(0)=0$ for $j=0,\ldots ,p-1$. Then, to cope with nonzero initial conditions, one can enrich such RKHS with polynomials up to order $p-1$. The enriched space is $\mathscr {H} \oplus \text{ span } \{1,x,\ldots ,x^{p-1}\}$, with $\oplus $ denoting a direct sum, and enjoys the universality property mentioned at the end of Sect. 6.6.5. The resulting spline estimator becomes a notable example of (6.49): it solves

$$\begin{aligned} \min _{\begin{array}{c} f \in \mathscr {H},\\ \theta \in \mathbb {R}^p \end{array} } \sum _{i=1}^{N} \left( y_i -f(x_i)-\sum _{k=1}^{p} \theta _k x_i^{k-1} \right) ^2+ \gamma \int _0^1 \left( f^{(p)}(x) \right) ^2 dx, \end{aligned}$$

(6.52)

whose explicit solution is given by (6.50) setting $\phi _k(x)=x^{k-1}$ and $\varPhi _{ij} = x_i^{j-1}$.

We consider a simple numerical example to illustrate the estimator (6.52) and the impact of different choices of $\gamma $ on its performance. The task is the reconstruction of the function $g(x)=e^{\sin (10x)}$, with $x \in [0,1]$, from 100 direct samples corrupted by white and Gaussian noise with standard deviation 0.3. The estimates coming from (6.52) with $p=2$ and three different values of $\gamma $ are displayed in the three panels of Fig. 6.5. The cubic spline estimate plotted in the top left panel is affected by oversmoothing: the too large value of $\gamma $ overweights the norm of f in the objective (6.52), introducing a large bias. Hence, the model is too rigid, unable to describe the data. The top right panel displays the opposite situation obtained adopting a too low value for $\gamma $ which overweights the loss function in (6.52). This leads to a high variance estimator: the model is overly flexible and overfits the measurements. Finally, the estimate in the bottom panel of Fig. 6.5 is obtained using the regularization parameter optimal in the MSE sense. The good trade-off between bias and variance leads to an estimate close to truth. As already pointed out in the previous chapters, the choice of $\gamma $ can thus be interpreted as the counterpart of model order selection in the classical parametric paradigm.

6.7 Asymptotic Properties $\star $

6.7.1 The Regression Function/Optimal Predictor

In what follows, we use $\mu $ to indicate a probability measure on the input space $\mathscr {X}$. For simplicity, we assume that it admits a probability density function (pdf) denoted by ${\mathrm p}_{x}$. The input locations $x_i$ are now seen as random quantities and ${\mathrm p}_{x}$ models the stochastic mechanism through which they are drawn from $\mathscr {X}$. For instance, in the system identification scenario treated in Sect. 6.6.1, each input location contains system input values, e.g., see (6.40). If we assume that the input is a stationary stochastic process, all the $x_i$ indeed follow the same pdf ${\mathrm p}_x$.

Let also $\mathscr {Y}$ indicate the output space. Then, ${\mathrm p}_{yx}$ denotes the joint pdf on $\mathscr {X} \times \mathscr {Y}$ which factorizes into ${\mathrm p}_{y|x}(y|x){\mathrm p}_{x}(x)$. Here, ${\mathrm p}_{y|x}$ is the pdf of the output y conditional on a particular realization x.

Let us now introduce some important quantities function of $\mathscr {X},\mathscr {Y}$ and $p_{yx}$. Given a function f, the least squares error associated to f is defined by

$$\begin{aligned} \mathrm {Err}(f) = \mathscr {E} (y-f(x))^2 = \int _{\mathscr {X} \times \mathscr {Y}} \ (y-f(x))^2 {\mathrm p}_{yx}(y,x) dx dy. \end{aligned}$$

(6.53)

The following result, also discussed in [33], characterizes the minimizer of $\mathrm {Err}(f)$ and has connections with Theorem 4.1.

Theorem 6.19

(The regression function, based on [33]) We have

$$ f_{\rho } = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_f \ \mathrm {Err}(f), $$

where $f_{\rho }$ is the so-called regression function defined by

$$\begin{aligned} f_{\rho }(x) = \int _{\mathscr {Y}} y {\mathrm p}_{y|x}(y|x)dy, \quad x \in \mathscr {X}. \end{aligned}$$

(6.54)

One can see that the regression function does not depend on the marginal density ${\mathrm p}_{x}$ but only on the conditional ${\mathrm p}_{y|x}$. For any given x, it corresponds to the posterior mean (Bayes estimate) of the output y conditional on x. The proof of this fact is easily obtained by first using the following decomposition

$$\begin{aligned} \mathrm {Err}(f)= & {} \int _{\mathscr {X} \times \mathscr {Y}} \ (y-f_{\rho }(x)+f_{\rho }(x)-f(x))^2 {\mathrm p}_{yx}(y,x) dx dy \\= & {} \mathscr {E}(f_{\rho }(x)-f(x))^2 + \mathscr {E}(y-f_{\rho }(x))^2\\+ & {} 2 \int _{\mathscr {X}} (f_{\rho }(x)-f(x)) \underbrace{\left( \int _{\mathscr {Y}} (y-f_{\rho }(x)) {\mathrm p}_{y|x}(y|x) dy \right) }_{0} {\mathrm p}_x(x) dx \\= & {} \mathscr {E}(f_{\rho }(x)-f(x))^2 + \mathscr {E}(y-f_{\rho }(x))^2, \end{aligned}$$

and then noticing that the very last term is independent of f.

Theorem 6.19 shows that $f_{\rho }$ is the best output predictor in the sense that it minimizes the expected quadratic loss (MSE) on a new output drawn from ${\mathrm p}_{yx}$. Now, we will consider a scenario where ${\mathrm p}_{y|x}$ (and possibly also ${\mathrm p}_{x}$) is unknown and only N samples $\{x_i,y_i\}_{i=1}^N$ from ${\mathrm p}_{yx}$ are available. We will study the asymptotic properties (N growing to infinity) of the regularized approaches previously described. The regularization network case is treated in the following subsection.

6.7.2 Regularization Networks: Statistical Consistency

Consider the following regularization network

$$\begin{aligned} \hat{g}_N= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \frac{\sum _{i=1}^{N} (y_i-f(x_i))^2}{N} + \gamma \Vert f\Vert _{\mathscr {H}}^2, \end{aligned}$$

(6.55)

which coincides with (6.32) except for the introduction of the scale factor 1/N in the quadratic loss. We have also stressed the dependence of the estimate on the data set size N. Our goal is to assess whether $\hat{g}_N$ converges to $f_{\rho }$ as $N \rightarrow \infty $ using the norm $\Vert \cdot \Vert _{\mathscr {L}_2^{\mu }}$ defined by the pdf ${\mathrm p}_x$ as follows

$$ \Vert f \Vert ^2_{\mathscr {L}_2^{\mu }} = \int _{\mathscr {X}} f^2(x){\mathrm p}_x(x)dx. $$

First, details on the data generation process are provided.

Data generation assumptions The probability measure $\mu $ on $\mathscr {X}$ is assumed to be Borel non degenerate. As already recalled, this means that realizations from ${\mathrm p}_{x}$ can cover entirely $\mathscr {X}$, without holes. This happens, e.g., when ${\mathrm p}_x(x)>0 \ \forall x \in \mathscr {X}$. The stochastic processes $x_i$ and $y_i$ are jointly stationary, with joint pdf ${\mathrm p}_{yx}$.

The study is not limited to the i.i.d. case. This is important, e.g., in system identification where, as visible in (6.40), input locations contain past input values shifted in time, hence introducing correlation among the $x_i$. Let a, b indicate two integers with $a \le b$. Then, $\mathscr {M}_a^b$ denotes the $\sigma $-algebra generated by $(x_a,y_a),\ldots ,(x_b,y_b)$. The process (x, y) is said to satisfy a strong mixing condition if there exists a sequence of real numbers $\psi _m$ such that, $\forall k,m\ge 1$, one has

$$ |P(A \cap B) - P(A)P(B) | \le \psi _i \quad \forall A \in \mathscr {M}_1^k, B \in \mathscr {M}_{k+i}^\infty $$

with

$$ \lim _{i \rightarrow \infty } \psi _i = 0. $$

Intuitively, if a, b represent different time instants, this means that the random variables tend to become independent as their temporal distance increases.

Assumption 6.20

(Data generation and strong mixing condition) The probability measure $\mu $ on the input space (having pdf ${\mathrm p}_x$) is nondegenerate. In addition, the random variables $x_i$ and $y_i$ form two jointly stationary stochastic processes, with finite moments up to the third order and satisfy a strong mixing condition. Finally, denoting with $\psi _i$ the mixing coefficients, one has

$$\begin{aligned} \sum _{i=1}^\infty \left( |\psi _i|^{1/3} \right) < \infty . \end{aligned}$$

Consistency Result

The following theorem, whose proof is in Sect. 6.9.6, illustrates the convergence in probability of (6.55) to the best output predictor.

Theorem 6.21

(Statistical consistency of the regularization networks) Let $\mathscr {H}$ be a RKHS of functions $f: \mathscr {X} \rightarrow \mathbb {R}$ induced by the Mercer kernel K, with $\mathscr {X}$ a compact metric space. Assume that $f_{\rho } \in \mathscr {H}$ and that Assumption 6.20 holds. In addition, let

$$\begin{aligned} \gamma \propto \frac{1}{N^{\alpha }}, \end{aligned}$$

(6.56)

where $\alpha $ is any scalar in $(0,\frac{1}{2})$. Then, as N goes to infinity, one has

$$\begin{aligned} \Vert \hat{g}_N - f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \longrightarrow _p 0, \end{aligned}$$

(6.57)

where $\longrightarrow _p$ denotes convergence in probability.

The meaning of (6.56) is the following one. The regularizer $\Vert \cdot \Vert _{\mathscr {H}}^2$ in (6.55) restores the well-posedness of the problem by introducing some bias in the estimation process. Intuitively, to have consistency, the amount of regularization should decay to zero as N goes to $\infty $, but not too rapidly in order to keep the variance term under control. This can be obtained making the regularization parameter $\gamma $ go to zero with the rate suggested by (6.56).

6.7.3 Connection with Statistical Learning Theory

We now discuss the class of estimators (6.21) within the framework of statistical learning theory.

Learning problem Let us consider the problem of learning from examples as defined in statistical learning. The starting point is that described in Sect. 6.7.1. There is an unknown probabilistic relationship between the variables x and y described by the joint pdf ${\mathrm p}_{yx}$ on $\mathscr {X} \times \mathscr {Y}$. We are given examples $\{x_i,y_i\}_{i=1}^N$ of this relationship, called training data, which are independently drawn from ${\mathrm p}_{yx}$. The aim of the learning process is to obtain an estimator $\hat{g}_N$ (a map from the training set to a space of functions) able to predict the output y given any $x \in \mathscr {X}$.

Generalization and consistency In the statistical learning scenario, the two fundamental properties of an estimator are generalization and consistency. To introduce them, first we introduce a loss function $\mathscr {V}(y,f(x))$, called risk functional. Then, the mean error associated to a function f is the expected risk given by

$$\begin{aligned} I(f) = \int _{\mathscr {X} \times \mathscr {Y}} \ \mathscr {V}(y,f(x)) {\mathrm p}_{yx}(y,x) dx dy. \end{aligned}$$

(6.58)

Note that, in the quadratic loss case, the expected risk coincides with the error already introduced in (6.53). Given a function f, the empirical risk is instead defined by

$$\begin{aligned} I_N(f) = \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)). \end{aligned}$$

(6.59)

Then, we introduce a class of functions forming the hypothesis space $\mathscr {F}$ where the predictor is searched for. The ideal predictor, also called the target function, is given by^{Footnote 3}

$$\begin{aligned} f_{0} = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {F}} \ I(f). \end{aligned}$$

(6.60)

In general, even when a quadratic loss is chosen, $f_{0}$ does not coincide with the regression function $f_{\rho }$ introduced in (6.54) since $\mathscr {F}$ could not contain $f_{\rho }$.

The concepts of generalization and consistency trace back to [97, 99,100,101]. Below, recall that $\hat{g}_N$ is stochastic since it is function of the training set which contains the random variables $\{x_i,y_i\}_{i=1}^N$.

Definition 6.3

(Generalization and consistency, based on [102]) The estimator $\hat{g}_N$ (uniformly) generalizes if $\forall \varepsilon >0$:

$$\begin{aligned} \lim _{N \rightarrow \infty } \ \sup _{{\mathrm p}_{yx}} \ \mathbb {P} \left\{ | I_N(\hat{g}_N) - I(\hat{g}_N) | > \varepsilon \right\} =0. \end{aligned}$$

(6.61)

The estimator is instead (universally) consistent if $\forall \varepsilon >0$:

$$\begin{aligned} \lim _{N \rightarrow \infty } \ \sup _{{\mathrm p}_{yx}} \ \mathbb {P} \left\{ I(\hat{g}_N) > I(f_{0}) + \varepsilon \right\} =0. \end{aligned}$$

(6.62)

From (6.61), one can see that generalization implies that the performance on the training set (the empirical error) must converge to the “true” performance on future outputs (the expected error). The presence of the $\sup _{{\mathrm p}_{yx}}$ is then to indicate that this property must hold uniformly w.r.t. all the possible stochastic mechanisms which generate the data. Consistency, as defined in (6.62), instead requires the expected error of $\hat{g}_N$ to converge to the expected error achieved by the best predictor in $\mathscr {F}$. Note that the reconstruction of $f_{0}$ is not required. The goal is that $\hat{g}_N$ be able to mimic the prediction performance of $f_0$ asymptotically. Key issues in statistical learning theory are the understanding of the conditions on $\hat{g}_N$, the function class $\mathscr {F}$ and the loss $\mathscr {V}$ which ensure such properties.

Empirical Risk Minimization

The most natural technique to determine $f_0$ from data is the empirical risk minimization (ERM) approach where the empirical risk is optimized:

$$\begin{aligned} \hat{g}_N = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {F}} \ I_N(f) = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {F}} \ \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)). \end{aligned}$$

(6.63)

The study of ERM has provided a full characterization of the necessary and sufficient conditions for its generalization and consistency. To introduce them, we first need to provide further details on the data generation assumptions.

Assumption 6.22

(Data generation assumptions) It holds that

the $\{x_i,y_i\}_{i=1}^N$ are i.i.d. and each couple has joint pdf ${\mathrm p}_{yx}$;
the input space $\mathscr {X}$ is a compact set in the Euclidean space;
$y \in \mathscr {Y}$ almost surely with $\mathscr {Y}$ a bounded real set;
the class of functions $\mathscr {F}$ is bounded, e.g., under the sup-norm;
$A \le \mathscr {V}(y,f(x)) \le B$, for $f \in \mathscr {F},y \in \mathscr {Y}$, with A, B finite and independent of f and y. $\square $

Note that, if the first four points hold true, in practice any loss function of interest, such as quadratic, Huber or Vapnik, satisfies the last requirement.

We now introduce the concept of $V_{\gamma }$-dimension [5]. It is a complexity measure which extends the concept of Vapnik–Chervonenkis (VC) dimension originally introduced for the indicator functions.

Definition 6.4

($V_{\gamma }$ -dimension, based on [5]) Let Assumption 6.22 hold. The $V_{\gamma }$-dimension of $\mathscr {V}$ in $\mathscr {F}$, i.e., of the set $\mathscr {V}(y,f(x)), \ f \in \mathscr {F}$, is defined as the maximum number h of vectors $(x_1,y_1),\ldots ,(x_h,y_h)$ that can be separated in all $2^h$ possible way using rules

$$\begin{aligned}&\text {Class 1:} \ \text {if} \ \mathscr {V}(y_i,f(x_i)) \ge s + \gamma ,\\&\text {Class 0:} \ \text {if} \ \mathscr {V}(y_i,f(x_i)) \le s - \gamma \end{aligned}$$

for $f \in \mathscr {F}$ and some $s\ge 0$. If, for any h, it is possible to find h pairs $(x_1,y_1),\ldots ,(x_h,y_h)$ that can be separated in all the $2^h$ possible ways, the $V_{\gamma }$-dimension of $\mathscr {V}$ in $\mathscr {F}$ is infinite.

So, the $V_{\gamma }$-dimension is infinite if, for any data set size, one can always find a function f and a set of points which can be separated by f in any possible way. Note that the required margin to distinguish the classes increases as $\gamma $ augments. This means that the $V_{\gamma }$-dimension is a monotonically decreasing function of $\gamma $.

The following definition deals with the uniform, distribution-free convergence of empirical means to expectations for classes of real-valued functions. It is related to the so-called uniform laws of large numbers.

Definition 6.5

(Uniform Glivenko Cantelli class, based on [5]) Let $\mathscr {G}$ denote a space of functions $\mathscr {Z} \rightarrow \mathscr {R}$, where $\mathscr {R} $ is a bounded real set, and let ${\mathrm p}_z$ denote a generic pdf on $\mathscr {Z}$. Then, $\mathscr {G}$ is said to be a Uniform Glivenko Cantelli (uGC) class^{Footnote 4} if

$$\begin{aligned} \forall \varepsilon>0 \quad \lim _{N \rightarrow \infty } \ \sup _{{\mathrm p}_z} \ \mathbb {P} \left\{ \sup _{g \in \mathscr {G} } \left| \frac{1}{N} \sum _{i=1}^N \ g(z_i) - \int _{\mathscr {X}} g(z)p_z(z)dz \right| > \varepsilon \right\} =0. \end{aligned}$$

It turns out that, under the ERM framework, generalization and consistency are equivalent concepts. Moreover, the finiteness of the $V_{\gamma }$-dimension coincides with the concept of uGC class relative to the adopted losses and turns out the necessary and sufficient condition for generalization and consistency [5]. This is formalized below.

Theorem 6.23

(ERM and $V_{\gamma }$-dimension, based on [5]) Let Assumption 6.22 hold. The following facts are then equivalent:

ERM (uniformly) generalizes.
ERM is (uniformly) consistent.
The $V_{\gamma }$-dimension of $\mathscr {V}$ in $\mathscr {F}$ is finite for any $\gamma >0$.
The class of functions $\mathscr {V}(y,f(x))$ with $f \in \mathscr {F}$ is uGC.

In the last point regarding the uGC class, one can follow Definition 6.5 using the correspondences $\mathscr {Z}=\mathscr {X} \times \mathscr {Y}$, $z=(x,y)$, ${\mathrm p}_{z}={\mathrm p}_{yx}$ and $\mathscr {R}=[A,B]$.

Connection with Regularization in RKHS

The connection between statistical learning theory and the class of kernel-based estimators (6.21) is obtained using as function space $\mathscr {F}$ a ball $\mathscr {B}_r$ in a RKHS $\mathscr {H}$, i.e.,

$$\begin{aligned} \mathscr {F}=\mathscr {B}_r:= \Big \{ f \in \mathscr {H} \ | \ \Vert f \Vert _{\mathscr {H}} \le r \Big \}. \end{aligned}$$

(6.64)

The ERM method (6.63) becomes

$$\begin{aligned} \hat{g}_N = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f} \ \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)) \quad \text {s.t.} \ \ \Vert f \Vert _{\mathscr {H}} \le r, \end{aligned}$$

(6.65)

which is an inequality constrained optimization problem. Exploiting the Lagrangian theory, we can find a positive scalar $\gamma $, function of r and of the data set size N, which makes (6.65) equivalent to

$$ \hat{g}_N = \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in \mathscr {H}} \ \frac{1}{N} \sum _{i=1}^N \ \mathscr {V}(y_i,f(x_i)) + \gamma \left( \Vert f\Vert _{\mathscr {H}}^2-r^2\right) , $$

which, apart from constants, coincides with (6.21). The question now is whether (6.65) is consistent in the sense of the statistical learning theory. The answer is positive. In fact, under Assumption 6.22, it can be proved that the class of functions $\mathscr {V}$ in $\mathscr {F}$ is uGC if $\mathscr {F}$ is uGC. In addition, one sufficient (but not necessary) condition for $\mathscr {F}$ to be uGC is that $\mathscr {F}$ be a compact set in the space of continuous functions. The following important result then holds.

Theorem 6.24

(Generalization and consistency of the kernel-based approaches, based on [33, 65]) Let $\mathscr {H}$ be any RKHS induced by a Mercer kernel containing functions $f: \mathscr {X} \rightarrow \mathbb {R}$, with $\mathscr {X}$ a compact metric space. Then, for any r, the ball $\mathscr {B}_r$ is compact in the space of continuous functions equipped with the sup-norm. It then comes that $\mathscr {B}_r$ is uGC and, if Assumption 6.22 holds, the regularized estimator (6.65) generalizes and is consistent.

Theorem 6.24 thus shows that kernel-based approaches permit to exploit flexible infinite-dimensional models with the guarantee that the best prediction performance (achievable inside the chosen class) will be asymptotically reached.

6.8 Further Topics and Advanced Reading

Basic functional analysis principles can be found, e.g., in [59, 79, 112]. The concept of RKHS was developed in 1950 in the seminal works [13, 20]. Classical books on the subject are [6, 82, 84]. RKHSs have been introduced within the machine learning community in [46, 47] leading, in conjunction with Tikhonov regularization theory [21, 96], to the development of many powerful kernel-based algorithms [42, 86].

Extensions of the theory to vector-valued RKHSs is described in [62]. This is connected to the so-called multi-task learning problem [18, 29] which deals with the simultaneous reconstruction of several functions. Here, the key point is that measurements taken on a function (task) may be informative w.r.t. the other ones, see [16, 40, 68, 95] for illustrations of the advantages of this approach. Multi-task learning will be illustrated in Chap. 9 using also a numerical example based on real pharmacokinetics data.

Mercer theorem dates back to [60] which discusses also the connection with integral equations, see also the book [50]. Extensions of the theorem to non compact domains are discussed in [94]. The first version of the representer theorem appears in [52]. It has been then subject of many generalizations which can be found in [11, 36, 83, 103, 110]. Recent works have also extended the classical formulation to the context of vector-valued functions (multi-task learning and collaborative filtering), matrix regularization problems (with penalty given by spectral functions of matrices), matricizations of tensors, see, e.g., [1, 7, 12, 54, 87]. These different types of representer theorems are cast in a general framework in [10].

The term regularization network traces back to [71] where it is illustrated that a particular regularized scheme is equal to a radial basis function network. Support vector regression and classification were introduced in [24, 31, 37, 98], see also the classical book [102]. Robust statistics are described in [51].

The term “kernel trick” was used in [83] while interpretation of kernels as inner products in a feature space was first described in [4]. Sobolev spaces are illustrated, e.g., in [2] while classical works on smoothing splines are [32, 104]. The important spline interpolation properties are described in [3, 14, 22].

Polynomial kernels were used for the first time in [70] while an application to Wiener system identification can be found in [44], as also discussed later on in Chap. 8 devoted to nonlinear system identification. An explicit (spectral) characterization of the RKHS induced by the Gaussian kernel can be found in [91, 92], while the more general case of radial basis kernels is treated in [85]. The concept of universal kernel is discussed, e.g., in [61, 90].

The strong mixing condition is discussed, e.g., in [107] and [34].

The convergence proof for the regularization network relies upon the integral operator approach described in [88] in an i.i.d. setting and its extension to the dependent case developed in [66] in the Wiener system identification context. For other works on statistical consistency and learning rates of regularized least squares in RKHS see, e.g., [48, 93, 105, 109, 111].

Statistical learning theory and the concepts of generalization and consistency, in connection with the uniform law of large numbers, date back to the works of Vapnik and Chervonenkis [97, 99,100,101]. Other related works on convergence of empirical processes are [38, 39, 73]. The concept of $V_{\gamma }$ dimension and its equivalence with the Glivenko–Cantelli class is proved in [5], see also [41] for links with RKHS. Relationships between the concept of stability of estimates (continuous dependence on the data) and generalization/consistency can be found in [63, 72], see also [26] for previous work on this subject. Numerical computation of the regularized estimate (6.21) is discussed in the literature studying the relationship between machine learning and convex optimization [19, 25, 77]. In the regularization network case (quadratic loss), if the data set size N is large, plain application of a solver with computational cost $O(N^3)$ can be highly inefficient. Then, one can use approximate representations of the kernel function [15, 53], based, e.g., on the Nyström method or greedy strategies [89, 106, 113]. One can also exploit the Mercer theorem by just using an mth-order approximation of K given by $\sum _{i=1}^m \zeta _i \rho _i(x) \rho _i(y)$. The solution obtained with this kernel may provide accurate approximations also when $m \ll N$, see [28, 43, 67, 114, 115]. Training of kernel machines can be also accelerated by using randomized low-dimensional feature spaces [74], see also [78] for insights on learning rates.

In the case of generic convex loss (different from the quadratic), one problem is that the objective is not differentiable everywhere. In this circumstance, the powerful interior point (IP) methods [64, 108] can be employed which applies damped Newton iterations to a relaxed version of the Karush–Kuhn–Tucker (KKT) equations for the objective [27]. A statistical and computational framework that allows their broad application to the problem (6.21) for a wide class of piecewise linear quadratic losses can be found in [8, 9]. In practice, IP methods exhibit a relatively fast convergence behaviour. However, as in the quadratic case, a difficulty can arise if N is very large, i.e., it may not be possible to store the entire kernel matrix in memory and this fact can hinder the application of second-order optimization techniques such as the (damped) Newton method. A way to circumvent this problem is given by the so-called decomposition methods where a subset of the coefficients $c_i$, called working set, is selected, and the associated low-dimensional sub-problem is solved. In this way, only the corresponding entries of the output kernel matrix need to be loaded into the memory, e.g., see [30, 56,57,58]. An extreme case of decomposition method is coordinate descent, where the working set contains only one coefficient [35, 45, 49].

Notes

1.
One can then also easily check that the case $\mathbf {K}_{12}=-1$ instead induces a RKHS containing only functions satisfying $f(1)=-f(2)$.
2.
Similarly to what discussed in Example 6.17, if $\mathbf {K}$ is not full rank, the solution of (6.33) is not unique. In fact, the minimizers are the sum of (6.34) and the null space of the kernel matrix. However, all of them lead to the same function estimate $\hat{g}$.
3.
Here, and also when introducing empirical risk minimization (ERM), we assume that all the introduced minimizers exist. If this does not hold true, all the concepts remain valid by resorting to the concept of almost minimizers and almost ERM, with $I(f_0):=\inf _{f \in \mathscr {F}} \ I(f)$.
4.
Sometimes, the class defined by (6.5) in terms of convergence in probability is called weak uGC while almost sure convergence leads to a strong uGC. However, it can be proved that, if Assumption 6.22 holds true and the function class is the composition of the losses with $\mathscr {F}$, the two concepts become equivalent.

References

Abernethy J, Bach F, Evgeniou T, Vert JP (2009) A new approach to collaborative filtering: operator estimation with spectral regularization. J Mach Learn Res 10:803–826
MATH Google Scholar
Adams RA, Fournier J (2003) Sobolev spaces. Academic Press
Google Scholar
Ahlberg JH, Nilson EH (1963) Convergence properties of the spline fit. J Soc Indust Appl Math 11:95–104
Article MathSciNet MATH Google Scholar
Aizerman A, Braverman EM, Rozoner LI (1964) Theoretical foundations of the potential function method in pattern recognition learning. Autom Remote Control 25:821–837
Google Scholar
Alon N, Ben-David S, Cesa-Bianchi N, Haussler D (1997) Scale-sensitive dimensions, uniform convergence, and learnability. J ACM 44(4):615–631
Article MathSciNet MATH Google Scholar
Alpay D (2003) Reproducing kernel Hilbert spaces and applications. Springer
Google Scholar
Amit Y, Fink M, Srebro N, Ullman S (2007) Uncovering shared structures in multiclass classification. In: Proceedings of the 24th international conference on machine learning, ICML ’07, New York, NY, USA. ACM, pp 17–24
Google Scholar
Aravkin A, Burke J, Pillonetto G (2012) Nonsmooth regression and state estimation using piecewise quadratic log-concave densities. In: Proceedings of the 51st IEEE conference on decision and control (CDC 2012)
Google Scholar
Aravkin A, Burke JV, Pillonetto G (2013) Sparse/robust estimation and Kalman smoothing with nonsmooth log-concave densities: modeling, computation, and theory. J Mach Learn Res 14:2689–2728
MathSciNet MATH Google Scholar
Argyriou A, Dinuzzo F (2014) A unifying view of representer theorems. In: Proceedings of the 31th international conference on machine learning, vol 32, pp 748–756
Google Scholar
Argyriou A, Micchelli CA, Pontil M (2009) When is there a representer theorem? vector versus matrix regularizers. J Mach Learn Res 10:2507–2529
MathSciNet MATH Google Scholar
Argyriou A, Micchelli CA, Pontil M (2010) On spectral learning. J Mach Learn Res 11:935–953
MathSciNet MATH Google Scholar
Aronszajn N (1950) Theory of reproducing kernels. Trans Am Math Soc 68:337–404
Article MathSciNet MATH Google Scholar
Atkinson KE (1968) On the order of convergence of natural cubic spline interpolation. SIAM J Numer Anal 5(1):89–101
Article MathSciNet MATH Google Scholar
Bach FR, Jordan MI (2005) Predictive low-rank decomposition for kernel methods. In: Proceedings of the 22nd international conference on Machine learning, ICML ’05, New York, NY, USA. ACM, pp 33–40
Google Scholar
Bakker B, Heskes T (2003) Task clustering and gating for Bayesian multitask learning. J Mach Learn Res 4:83–99
MATH Google Scholar
Bartlett PL, Long PM, Lugosi G, Tsigler A (2020) Benign overfitting in linear regression. PNAS 117:30063–30070
Article MathSciNet Google Scholar
Baxter J (1997) A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach Learn 28:7–39
Article MATH Google Scholar
Bennett KP, Parrado-Hernandez E (2006) The interplay of optimization and machine learning research. J Mach Learn Res 7:1265–1281
MathSciNet MATH Google Scholar
Bergman S (1950) The kernel function and conformal mapping. Mathematical surveys and monographs. AMS
Google Scholar
Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75:1–120
Article Google Scholar
Birkhoff G, De Boor C (1964) Error bounds for spline interpolation. J Math Mech 13:827–835
MathSciNet MATH Google Scholar
Bochner S. Monotone Funktionen, Stieltjessche Integrale, und harmonische Analyse. Math Ann 108:378–410
Google Scholar
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th annual ACM workshop on computational learning theory. ACM Press, pp 144–152
Google Scholar
Bottou L, Chapelle O, DeCoste D, Weston J (eds) (2007) Large scale kernel machines. MIT Press, Cambridge, MA, USA
Google Scholar
Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
MathSciNet MATH Google Scholar
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press
Google Scholar
Carli FP, Chiuso A, Pillonetto G (2012) Efficient algorithms for large scale linear system identification using stable spline estimators. In: IFAC symposium on system identification
Google Scholar
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Article MathSciNet Google Scholar
Collobert R, Bengio S (2001) SVMTorch: support vector machines for large-scale regression problems. J Mach Learn Res 1:143–160
MathSciNet MATH Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Craven P, Wahba G (1979) Smoothing noisy data with spline functions. Numer Math 31:377–403
Article MATH Google Scholar
Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull Am Math Soc 39:1–49
Article MathSciNet MATH Google Scholar
Dehling H, Philipp W (1982) Almost sure invariance principles for weakly dependent vector-valued random variables. Ann Probab 10(3):689–701
Article MathSciNet MATH Google Scholar
Dinuzzo F (2011) Analysis of fixed-point and coordinate descent algorithms for regularized kernel methods. IEEE Trans Neural Netw 22(10):1576–1587
Article Google Scholar
Dinuzzo F, Scholkopf B (2012) The representer theorem for Hilbert spaces: a necessary and sufficient condition. In: Bartlett P, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25, pp 189–196
Google Scholar
Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. In: Advances in neural information processing systems
Google Scholar
Dudley RM, Giné E, Zinn J (1991) Uniform and universal Glivenko-Cantelli classes. J Theor Probab 4(3):485–510
Article MathSciNet MATH Google Scholar
Dudley RM (1984) École d’Été de Probabilités de Saint-Flour XII - 1982, chapter A course on empirical processes. Springer, Berlin, Heidelberg, pp 1–142
Google Scholar
Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res 6:615–637
MathSciNet MATH Google Scholar
Evgeniou T, Pontil M (1999) On the ${V}_\gamma $ dimension for regression in reproducing kernel Hilbert spaces. In: Algorithmic learning theory, 10th international conference, ALT ’99, Tokyo, Japan, Dec 1999, Proceedings, lecture notes in artificial intelligence, vol 1720. Springer, pp 106–117
Google Scholar
Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13:1–50
Article MathSciNet MATH Google Scholar
Ferrari-Trecate G, Williams CKI, Opper M (1999) Finite-dimensional approximation of Gaussian processes. In: Proceedings of the 1998 conference on advances in neural information processing systems. MIT Press, Cambridge, MA, USA, pp 218–224
Google Scholar
Franz MO, Schölkopf B (2006) A unifying view of Wiener and Volterra theory and polynomial kernel regression. Neural Comput 18:3097–3118
Article MathSciNet MATH Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Article Google Scholar
Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10(6):1455–1480
Article Google Scholar
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7(2):219–269
Article Google Scholar
Guo ZC, Zhou DX (2013) Concentration estimates for learning with unbounded sampling. Adv Comput Math 38(1):207–223
Article MathSciNet MATH Google Scholar
Ho CH, Lin CJ (2012) Large-scale linear support vector regression. J Mach Learn Res 13:3323–3348
MathSciNet MATH Google Scholar
Hochstadt H (1973) Integral equations. Wiley
Google Scholar
Huber PJ (1981) Robust statistics. Wiley, New York, NY, USA
Book MATH Google Scholar
Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41(2):495–502
Article MathSciNet MATH Google Scholar
Kulis B, Sustik M, Dhillon I (2006) Learning low-rank kernel matrices. In: Proceedings of the 23rd international conference on Machine learning, ICML ’06, New York, NY, USA. ACM, pp 505–512
Google Scholar
Lafferty J, Zhu X, Liu Y (2004) Kernel conditional random fields: representation and clique selection. In: Proceedings of the twenty-first international conference on machine learning, ICML ’04, New York, NY, USA. ACM
Google Scholar
Liang T, Rakhlin A (2020) Just interpolate: kernel ridgeless regression can generalize. Ann Stat 48(3):1329–1347
Article MathSciNet MATH Google Scholar
Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12(12):1288–1298
Google Scholar
List N, Simon HU (2004) A general convergence theorem for the decomposition method. In: Proceedings of the 17th annual conference on computational learning theory, pp 363–377
Google Scholar
List N, Simon HU (2007) General polynomial time decomposition algorithms. J Mach Learn Res 8:303–321
MathSciNet MATH Google Scholar
Megginson RE (1998) An introduction to Banach space theory. Springer
Google Scholar
Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc Lond 209(3):415–446
MATH Google Scholar
Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J Mach Learn Res 7:2651–2667
MathSciNet MATH Google Scholar
Micchelli CA, Pontil M (2005) On learning vector-valued functions. Neural Comput 17(1):177–204
Article MathSciNet MATH Google Scholar
Mukherjee S, Niyogi P, Poggio T, Rifkin R (2006) Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv Comput Math 25(1):161–193
Article MathSciNet MATH Google Scholar
Nemirovskii A, Nesterov Y (1994) Interior-point polynomial algorithms in convex programming, vol 13. SIAM, Philadelphia, PA, USA
MATH Google Scholar
Pillonetto G (2008) Solutions of nonlinear control and estimation problems in reproducing kernel hilbert spaces: existence and numerical determination. Automatica 44(8):2135–2141
Article MathSciNet MATH Google Scholar
Pillonetto G (2013) Consistent identification of Wiener systems: a machine learning viewpoint. Automatica 49(9):2704–2712
Article MathSciNet MATH Google Scholar
Pillonetto G, Bell BM (2007) Bayes and empirical Bayes semi-blind deconvolution using eigenfunctions of a prior covariance. Automatica 43(10):1698–1712
Article MathSciNet MATH Google Scholar
Pillonetto G, Dinuzzo F, De Nicolao G (2010) Bayesian on-line multi-task learning of Gaussian processes. IEEE Trans Pattern Anal Mach Intell 32(2):193–205
Article Google Scholar
Pillonetto G, Quang MH, Chiuso A (2011) A new kernel-based approach for nonlinear system identification. IEEE Trans Autom Control 56(12):2825–2840
Article MathSciNet MATH Google Scholar
Poggio T (1975) On optimal nonlinear associative recall. Biol Cybern 19(4):201–209
Article MathSciNet MATH Google Scholar
Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497
Google Scholar
Poggio T, Rifkin R, Mukherjee S, Niyogi P (2004) General conditions for predictivity in learning theory. Nature 428(6981):419–422
Article Google Scholar
Pollard D (1989) Asymptotics via empirical processes. J Stat Sci 4(4):341–354
MathSciNet MATH Google Scholar
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Advances in neural information processing systems, pp 1177–1184
Google Scholar
Ribeiro AH, Hendriks J, Wills A, Schön TB (2021) Beyond Occam’s razor in system identification: double-descent when modeling dynamics. In: Proceedings of the 19th IFAC symposium on system identification (SYSID), Online, July 2021
Google Scholar
Riesz F (1909) Sur les operations fonctionnelles lineaires. Comptes rendus de l’Academie des Sciences (in French) 149:974–977
MATH Google Scholar
Rockafellar RT (1970) Convex analysis. Princeton Landmarks in Mathematics. Princeton University Press
Google Scholar
Rudi A, Rosasco L (2017) Generalization properties of learning with random features. In: Advances in neural information processing systems, pp 3218–3228
Google Scholar
Rudin W (1987) Real and complex analysis. McGraw-Hill, Singapore
MATH Google Scholar
Rudin W (1990) Fourier analysis on groups. Wiley
Google Scholar
Runge C (1901) Uber empirische funktionen und die interpolation zwischen aquidistanten ordinaten. Zeitschrift für Mathematik und Physik 46:224–243
MATH Google Scholar
Saitoh S (1988) Theory of reproducing kernels and its applications, vol 189. Pitman research notes in mathematics series. Longman Scientific and Technical, Harlow
Google Scholar
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. Neural Netw Comput Learn Theory 81:416–426
Article MathSciNet MATH Google Scholar
Schölkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. (Adaptive computation and machine learning). MIT Press
Google Scholar
Scovel C, Hush D, Steinwart I, Theiler J (2010) Radial kernels and their reproducing kernel Hilbert spaces. J Complex 26(6):641–660
Article MathSciNet MATH Google Scholar
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press
Google Scholar
Signoretto M, Tran DQ, De Lathauwer L, Suykens JAK (2014) Learning with tensors: a framework based on convex optimization and spectral regularization. Mach Learn 94(3):303–351
Google Scholar
Smale S, Zhou DX (2007) Learning theory estimates via integral operators and their approximations. Constr Approx 26:153–172
Article MathSciNet MATH Google Scholar
Smola A, Schölkopf B (2000) Sparse greedy matrix approximation for machine learning. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp 911–918
Google Scholar
Sriperumbudur BK, Fukumizu K, Lanckriet G (2011) Universality, characteristic kernels and RKHS embedding of measures. J Mach Learn Res 12:2389–2410
Google Scholar
Steinwart I (2002) On the influence of the kernel on the consistency of support vector machines. J Mach Learn Res 2:67–93
MathSciNet MATH Google Scholar
Steinwart I, Hush D, Scovel C (2006) An explicit description of the reproducing kernel Hilbert space of Gaussian RBF kernels. IEEE Trans Inf Theory 52:4635–4643
Article MathSciNet MATH Google Scholar
Steinwart I, Hush D, Scovel C (2009) Learning from dependent observations. J Multivar Anal 100(1):175–194
Article MathSciNet MATH Google Scholar
Sun H (2005) Mercer theorem for RKHS on noncompact sets. J Complex 21(3):337–349
Article MathSciNet MATH Google Scholar
Thrun S, Pratt L (1997) Learning to learn. Kluwer
Google Scholar
Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. Winston/Wiley, Washington, D.C
MATH Google Scholar
Vapnik V (1982) Estimation of dependences based on empirical data: springer series in statistics (Springer series in statistics). Springer, New York Inc., Secaucus, NJ, USA
Google Scholar
Vapnik V (1997) The nature of statistical learning theory. Springer
Google Scholar
Vapnik V, Chervonenkis A (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab Appl 16(2):264–280
Article MATH Google Scholar
Vapnik V, Chervonenkis A (1981) Necessary and sufficient conditions for the uniform convergence of means to their expectations. Theory Probab Appl 26:532–553
Article MATH Google Scholar
Vapnik V, Chervonenkis A (1991) The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognit Image Anal 1(3):283–305
Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York, NY, USA
MATH Google Scholar
De Vito E, Rosasco L, Caponnetto A, Piana M, Verri A (2004) Some properties of regularized kernel methods. J Mach Learn Res 5:1363–1390
MathSciNet MATH Google Scholar
Wahba G (1990) Spline models for observational data. SIAM, Philadelphia
Book MATH Google Scholar
Wang C, Zhou DX (2011) Optimal learning rates for least squares regularized regression with unbounded sampling. J Complex 27(1):55–67
Article MathSciNet MATH Google Scholar
Williams CKI, Seeger M (2000) Using the Nyström method to speed up kernel machines. In: Proceedings of the 2000 conference on advances in neural information processing systems, Cambridge, MA, USA. MIT Press, pp 682–688
Google Scholar
Withers CS (1981) Conditions for linear processes to be strong-mixing. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 57(4):477–480
Article MathSciNet MATH Google Scholar
Wright SJ (1997) Primal-dual interior-point methods. Siam, Englewood Cliffs, N.J., USA
Book MATH Google Scholar
Wu Q, Ying Y, Zhou DX (2006) Learning rates of least-square regularized regression. Found Comput Math 6:171–192
Article MathSciNet MATH Google Scholar
Yu Y, Cheng H, Schuurmans D, Szepesvari C (2013) Characterizing the representer theorem. In: Proceedings of the 30th international conference on machine learning, pp 570–578
Google Scholar
Yuan M, Tony Cai T (2010) A reproducing kernel Hilbert space approach to functional linear regression. Ann Stat 38:3412–3444
Google Scholar
Zeidler E (1995) Applied functional analysis. Springer
Google Scholar
Zhang K, Kwok JT (2010) Clustered Nyström method for large scale manifold learning and dimension reduction. IEEE Trans Neural Netw 21(10):1576–1587
Article Google Scholar
Zhu H, Rohwer R (1996) Bayesian regression filters and the issue of priors. Neural Comput Appl 4:130–142
Article Google Scholar
Zhu H, Williams CKI, Rohwer RJ, Morciniec M (1998) Gaussian regression and optimal finite dimensional linear models. In: Neural networks and machine learning. Springer, Berlin
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padova, Italy
Gianluigi Pillonetto & Alessandro Chiuso
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Tianshi Chen
Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
Giuseppe De Nicolao
Department of Electrical Engineering, Linköping University, Linköping, Sweden
Lennart Ljung

Authors

Gianluigi Pillonetto
View author publications
You can also search for this author in PubMed Google Scholar
Tianshi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Chiuso
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe De Nicolao
View author publications
You can also search for this author in PubMed Google Scholar
Lennart Ljung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gianluigi Pillonetto .

6.9 Appendix

6.1.1 6.9.1 Fundamentals of Functional Analysis

We gather some basic functional analysis definitions and results.

Vector Spaces

We will assume that the reader is familiar with the concept of real vector space V (the field is given by the real numbers). Here, we just recall that this is a set whose elements are called vectors. The space is closed w.r.t. two operations, called addition and scalar multiplication, which satisfy the usual algebraic properties. This means that any linear and finite combination of vectors still falls in V. When the vector space contains functions $g: \mathscr {X} \rightarrow \mathbb {R}$, for any $f,g \in V$ and $\alpha \in \mathbb {R}$ the two operations are defined as follows:

$$ f+g=h \ \ \text{ where } \ \ h(x) = f(x)+g(x) \ \ \forall x \in \mathscr {X} $$

and

$$ \alpha f = h \ \ \text{ where } \ \ h(x) = \alpha f(x) \ \ \forall x \in \mathscr {X}. $$

Inner Products and Norms

An inner product on V is the function

$$ \langle \cdot , \cdot \rangle : V \times V \rightarrow \mathbb {R} $$

which is

1.
linear in the first argument
$$ \langle \alpha v + \beta y , z \rangle = \alpha \langle v , z \rangle + \beta \langle y , z \rangle , \quad v,y,z \in V \quad \alpha ,\beta \in \mathbb {R}; $$
2.
symmetric (and so also linear in the second argument)
$$ \langle v , y \rangle = \langle y , v \rangle ; $$
3.
positive, in the sense that
$$ \langle v , v \rangle \ge 0 \quad \forall v $$
with
$$ \langle v , v \rangle = 0 \ \iff \ v=0, $$
where in the r.h.s. 0 denotes the null vector.

Recall also that a norm on V is the nonnegative function

$$ \Vert \cdot \Vert : V \rightarrow \mathbb {R}^+ $$

which satisfies

1.
absolute homogeneity
$$ \Vert \alpha v \Vert = |\alpha | \Vert v\Vert , \quad v \in V \quad \alpha \in \mathbb {R}; $$
2.
the triangle inequality
$$ \Vert v + y \Vert \le \Vert v \Vert + \Vert y \Vert ; $$
3.
null vector condition
$$ \Vert v \Vert = 0 \ \iff \ v=0. $$

The norm induced by the inner product $\langle \cdot , \cdot \rangle $ is given by

$$ \Vert v\Vert ^2 = \langle v, v \rangle , $$

and it is easy to check that this function indeed satisfies all the three norm axioms listed above. One also has the Cauchy–Schwarz inequality

$$ | \langle v, y \rangle | \le \Vert v\Vert \Vert y\Vert . $$

Finally, recall that both $\langle \cdot , x \rangle $ with $x \in V$ and $\Vert \cdot \Vert $ are examples of continuous functionals $V \rightarrow \mathbb {R}$, i.e., if $\lim _{j \rightarrow \infty } \Vert v-v_j\Vert =0$, then

$$ \lim _{j \rightarrow \infty } \Vert v_j\Vert = \Vert v\Vert , \quad \lim _{j \rightarrow \infty } \langle v_j, x \rangle = \langle v, x \rangle \ \forall x \in V. $$

Hilbert and Banach Spaces

A Hilbert space $\mathscr {H}$ is a vector space equipped with an inner product $\langle \cdot , \cdot \rangle $ which is complete w.r.t. to the norm $\Vert \cdot \Vert $ induced by such inner product. This means that, given any Cauchy sequence, i.e., a sequence of vectors $\{g_j\}_{j=1}^\infty $ such that

$$ \lim _{m,n \rightarrow \infty } \ \Vert g_m-g_n \Vert = 0, $$

there exists $g \in \mathscr {H}$ such that

$$ \lim _{j \rightarrow \infty } \Vert g- g_j \Vert =0. $$

In other words, every Cauchy sequence is convergent. Examples of Hilbert spaces used in this book are

the classical Euclidean space $\mathbb {R}^m$ of vectors $a=[a_1 \ \ldots \ a_m]$ equipped with the classical Euclidean inner product
$$ \langle a, b \rangle _2 = \sum _{i=1}^m a_i b_i $$
sometimes denoted just by $\langle \cdot , \cdot \rangle $ in the book;
the space $\ell _2$ of squared summable real sequences $a=[a_1 \ a_2 \ldots ]$, i.e., such that
$$ \sum _{i=1}^\infty a_i^2 < \infty , $$
equipped with the inner product
$$ \langle a, b \rangle _2 = \sum _{i=1}^\infty a_i b_i; $$
the classical Lebesgue space $\mathscr {L}_2$ of functions (where the measure $\mu $ is here omitted to simplify notation) $g: \mathscr {X} \rightarrow \mathbb {R}$ which are squared summable w.r.t. the measure $\mu $, i.e., such that
$$ \int _{ \mathscr {X}} g^2(x) d\mu (x) < \infty , $$
equipped with the inner product still denoted by $\langle \cdot , \cdot \rangle _2 $ but now given by
$$ \langle g, f \rangle _2 = \int _{ \mathscr {X}} g(x)f(x) d\mu (x). $$

The spaces reported above are also instances of metric spaces where, for every couple of vectors f, g, there is a notion of distance defined by $\Vert f-g\Vert $. Other metric spaces are the Banach spaces. They are normed vector spaces complete w.r.t. the metric induced by their norm. Hence, every Hilbert space is a Banach space but the converse is not true: this happens when $\Vert \cdot \Vert $ does not derive from an inner product. Examples of Banach spaces (whose norm does not derive from an inner product) are

the space $\ell _1$ of absolutely summable real sequences $a=[a_1 \ a_2 \ldots ]$, i.e., such that
$$ \sum _{i=1}^\infty |a_i| < \infty , $$
equipped with the norm
$$ \Vert a \Vert _1 = \sum _{i=1}^\infty |a_i|; $$
the Lebesgue space $\mathscr {L}_1$ of functions $g: \mathscr {X} \rightarrow \mathbb {R}$ absolutely integrable w.r.t. the measure $\mu $, i.e., such that
$$ \int _{\mathscr {X}} |g(x)| d\mu (x) < \infty , $$
equipped with the norm
$$ \Vert g \Vert _1 = \int _{\mathscr {X}} |g(x)| d\mu (x); $$
the space $\ell _{\infty }$ of bounded real sequences $a=[a_1 \ a_2 \ldots ]$, i.e., such that
$$ \sup _i |a_i| < \infty , $$
equipped with the norm
$$ \Vert a \Vert _{\infty } = \sup _i |a_i|; $$
the space $\mathscr {C}$ of continuous functions $g: \mathscr {X} \rightarrow \mathbb {R}$. where $\mathscr {X}$ is a compact set typically in $\mathbb {R}^m$, equipped with the sup-norm (also called uniform norm)
$$ \Vert g \Vert _{\infty } = \max _{x \in \mathscr {X}} |g(x)|; $$
the Lebesgue space $\mathscr {L}_\infty $ of functions $g: \mathscr {X} \rightarrow \mathbb {R}$ which are essentially bounded w.r.t. the measure $\mu $, i.e., for any g there exists M such that
$$ |g(x)| \le M \ \text{ almost } \text{ everywhere } \text{ in } \mathscr {X} \text{ w.r.t. } \text{ the } \text{ measure } \mu , $$
equipped with the norm
$$ \Vert g \Vert _\infty = \inf \left\{ M \ | \ |g(x)| \le M \ \text{ almost } \text{ everywhere } \text{ in } \mathscr {X} \text{ w.r.t. } \text{ the } \text{ measure } \mu \right\} . $$

An infinite-dimensional Hilbert (or Banach) space is said to be separable if it admits a countable basis $\{ \rho _j \}_{j=1}^\infty $, i.e., for any g in the space we can find scalars $c_j$ such that

$$ \lim _{j \rightarrow \infty } \Big \Vert g -\sum _{j=1}^\infty c_j \rho _j \Big \Vert = 0. $$

When such vectors $\{ \rho _j \}$ satisfy also the conditions

$$ \Vert \rho _j\Vert =1 \ \forall j, \quad \langle \rho _j, \rho _i \rangle = 0 \ \ j \ne i, $$

then the basis is said to be orthonormal.

Subspaces, Projections and Compact Sets

A subset S of the vector space V is said to be a subspace if S is itself a vector space with the same addition and multiplication operations defined in V. The symbol

$$\text{ span }( \{ \rho _j \}_{ j \in A })$$

denotes the subspace containing all the finite linear combinations of vectors taken from the (possibly uncountable) family $\{ \rho _j \}_{ j \in A }$.

Given a subspace (or simply a set) S contained in a Hilbert (or Banach) space, we define

$$ \bar{S} = S \ \cup \ \left\{ \text{ all } \text{ the } \text{ limits } \text{ of } \text{ Cauchy } \text{ sequences } \text{ built } \text{ using } \text{ vectors } \text{ in } \text{ S } \right\} . $$

Then, S is said to be closed if

$$ \bar{S} = S. $$

The orthogonal to a subspace S of a Hilbert space is denoted by $S^\perp $ and defined by

$$ S^\perp = \left\{ g \ | \ \langle g,f \rangle = 0 \ \forall f \in S \right\} . $$

It is easy to prove that $S^\perp $ is always a closed subspace.

The following fundamental theorem holds.

Theorem 6.25

(Projection theorem) Let S be a closed subspace of a Hilbert space with norm $\Vert \cdot \Vert _\mathscr {H}$. Then, one has

any $g \in \mathscr {H}$ has a unique decomposition
$$ g = g_S + g_{S^\perp }, \quad g_S \in S, \ g_{S^\perp } \in S^\perp ; $$
$g_S$ is the projection of g onto S, i.e.,
$$ g_S= \displaystyle \mathop {{\text {arg}}\,{\text {min}}}_{f \in S} \ \Vert g-f \Vert _\mathscr {H}; $$
it holds that
$$ \Vert g \Vert _\mathscr {H}^2 = \Vert g_S \Vert _\mathscr {H}^2 + \Vert g_{S^\perp } \Vert _\mathscr {H}^2. $$

A set A contained in a Hilbert (or Banach) space with norm $\Vert \cdot \Vert $ is said to be compact if, given any sequence $\{g_j\}$ of vectors all contained in A, it is possible to extract a subsequence $\{g_{k_j}\}$ convergent in A, i.e., there exists $g \in A$ such that

$$ \lim _{j \rightarrow \infty } \ \Vert g-g_{k_j} \Vert = 0. $$

When the space is finite-dimensional, a set is compact iff it is closed and bounded.

Linear and Bounded Functionals

Given a Hilbert space $\mathscr {H}$ with norm $\Vert \cdot \Vert _{\mathscr {H}}$, a functional $L: \mathscr {H} \rightarrow \mathbb {R}$ is said to be bounded (or, equivalently, continuous) if there exists a positive scalar C such that

$$\begin{aligned} | L[g] | \le C \Vert g\Vert _{\mathscr {H}}, \quad \forall g \in \mathscr {H}. \end{aligned}$$

(6.66)

The following classical theorem holds.

Theorem 6.26

(Closed graph theorem) Let $\mathscr {H}$ be a Hilbert (or Banach) space. Then $L: \mathscr {H} \rightarrow \mathbb {R}$ is linear and bounded if and only if the graph of L, i.e.,

$$ \text {Gr}(L)=\left\{ (f,L[f]) \ \text {with} \ f \in \mathscr {H} \right\} , $$

is a closed set in the product space $\mathscr {H} \times \mathbb {R}$. This means that if $\{ f_i \}_{i=1}^{+\infty }$ is a sequence converging to $f \in \mathscr {H}$ and $\{L[f_i]\}_{i=1}^{+\infty }$ converges to $y \in \mathbb {R}$, then $L[f]=y$.

This other fundamental theorem asserts that every linear and bounded functional over $\mathscr {H}$ is in one-to-one correspondence with a vector in $\mathscr {H}$.

Theorem 6.27

(Riesz representation theorem, based on [76]) Let $\mathscr {H}$ be a Hilbert space and let $L: \mathscr {H} \rightarrow \mathbb {R}$. Then L is linear and bounded if and only there is a unique $f \in \mathscr {H}$ such that

$$\begin{aligned} L[g] = \langle g, f \rangle _{\mathscr {H}}, \quad \forall g \in \mathscr {H}. \end{aligned}$$

(6.67)

6.1.2 6.9.2 Proof of Theorem 6.1

First, we derive two lemmas which are instrumental to the main proof.

Lemma 6.1

Let

$$ S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} }). $$

If there exists a Hilbert space $\mathscr {H}$ satisfying conditions (6.2) and (6.3), then $\mathscr {H}$ is the closure of S, i.e., $\mathscr {H} = \bar{S}$.

Proof

It comes from condition (6.2) that $\bar{S}$ is a closed subspace which must belong to $\mathscr {H}$. Theorem 6.25 (Projection theorem) then ensures that any function $f \in \mathscr {H}$ can be written as

$$ f = f_{\bar{S}} + f_{\bar{S}^\perp }, \quad f_{\bar{S}} \in \bar{S}, \ f_{\bar{S}^\perp } \in \bar{S}^\perp . $$

As for the component $f_{\bar{S}^\perp }$, using condition (6.3) (reproducing property) we obtain

$$ f_{\bar{S}^\perp }(x) = \langle f_{\bar{S}^\perp } , K_x \rangle _{\mathscr {H}} = 0, \ \forall x. $$

In fact, every kernel section belongs to S and is thus orthogonal to every function in $\bar{S}^\perp $. Hence, $f_{\bar{S}^\perp }$ is the null vector and this concludes the proof. $\square $

Lemma 6.2

Let $S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })$ and define

$$\begin{aligned} \Vert f \Vert _{\mathscr {H}}^2 = \sum _{i=1}^m \sum _{j=1}^m c_i c_j K(x_i,x_j), \end{aligned}$$

(6.68)

where f is a generic element in S, hence admitting representation

$$\begin{aligned} f(\cdot ) = \sum _{i=1}^{m} c_i K_{x_i}(\cdot ). \end{aligned}$$

Then, $\Vert \cdot \Vert _{\mathscr {H}}$ is a well-defined norm in S.

Proof

The reader can easily check that absolute homogeneity and the triangle inequality are satisfied by $\Vert \cdot \Vert _{\mathscr {H}}$. We only need to prove the null vector condition, i.e., that for every $f \in S$ one has

$$ \Vert f \Vert _{\mathscr {H}} = 0 \ \iff \ f=0. $$

Now, assume that $\Vert f \Vert _{\mathscr {H}} = 0$ where $f(\cdot ) = \sum _{i=1}^{m} c_i K_{x_i}(\cdot )$. While the coefficients $\{c_i\}_{i=1}^m$ and the input locations $\{x_i\}_{i=1}^m$ are fixed and define f, let also $c_{m+1}$ and $x_{m+1}$ be an arbitrary scalar and input location, respectively. Define $\mathbf {K} \in \mathbb {R}^{m \times m}$ and $\mathbf {K} _+ \in \mathbb {R}^{m+1 \times m+1}$ two matrices with (i, j)-entry given by $K(x_i,x_j)$. Let also $c=[c_1 \ \ldots \ c_m]^T$ and $c_+=[c_1 \ \ldots \ c_m \ c_{m+1}]^T$. Note that $\mathbf {K} c$ is the vector which contains the values of f on the input locations $\{x_i\}_{i=1}^m$.

Since K is positive definite, it holds that

$$ c_+^T \mathbf {K} _+ c_+ \ge 0 \quad \forall \ (c_{m+1},x_{m+1}) \in (\mathbb {R} \times \mathscr {X}). $$

In addition, since by assumption

$$ \Vert f \Vert ^2_{\mathscr {H}} = c^T \mathbf {K} c =0, $$

it comes that the components of the vector $\mathbf {K} c$, which are the values of f on $\{x_i\}_{i=1}^m$, are all null. Now, we show that $f(x)=0$ holds everywhere, also on the generic input location $x_{m+1} \in \mathscr {X}$. In fact, after simple calculations, one obtains

$$\begin{aligned} c_+^T \mathbf {K} _+ c_+= & {} c^T \mathbf {K} c +2 \left[ \sum _{i=1}^m c_i K(x_i,x_{m+1}) \right] c_{m+1} + K(x_{m+1},x_{m+1}) c_{m+1}^2\\= & {} 2 \left[ \sum _{i=1}^m c_i K(x_i,x_{m+1}) \right] c_{m+1} + K(x_{m+1},x_{m+1}) c_{m+1}^2\\= & {} 2 f(x_{m+1}) c_{m+1} + K(x_{m+1},x_{m+1}) c_{m+1}^2. \end{aligned}$$

Now, assume that $f(x_{m+1})>0$. Then, since the last term on the r.h.s. is infinitesimal of order two w.r.t. $c_{m+1}$ we can find a negative value for $c_{m+1}$ sufficiently close to zero such that $c_+^T \mathbf {K} _+ c_+ <0$ which contradicts the fact that K is positive definite. If $f(x_{m+1})<0$ we can instead find a positive value for $c_{m+1}$ sufficiently close to zero such that $c_+^T \mathbf {K} _+ c_+ <0$, which is still a contradiction. Hence, we must have $f(x_{m+1}) = 0$. Since $x_{m+1}$ was arbitrary, we conclude that f must be the null function.

$\square $

We now prove Theorem 6.1. Let $S=\text{ span }( \{ K_x \}_{ x \in \mathscr {X} })$ and, for any $f,g \in S$ having representations

$$\begin{aligned} f(\cdot ) = \sum _{i=1}^{m} c_i K_{x_i}(\cdot ), \quad g(\cdot ) = \sum _{i=1}^{p} d_i K_{y_i}(\cdot ) \end{aligned}$$

define

$$ \langle f,g \rangle _{\mathscr {H}} = \sum _{i=1}^m \sum _{j=1}^p c_i d_j K(x_i,y_j). $$

By Lemma 6.2, it is immediate to check that $\langle \cdot , \cdot \rangle _{\mathscr {H}}$ is a well-defined inner product on S. Then, we now show that the desired Hilbert space is $\mathscr {H} = \bar{S}$, where $\bar{S}$ is the completion of S w.r.t. the norm induced by $\langle \cdot , \cdot \rangle _{\mathscr {H}}$.

Condition (6.2) is trivially satisfied since, by construction, all the kernel sections belong to $\mathscr {H}$.

As for the condition (6.3), we start checking that it holds over S. Introducing the couple of functions in S given by

$$\begin{aligned} f(\cdot ) = \sum _{i=1}^{m} c_i K_{x_i}(\cdot ), \quad g(\cdot ) = K_{x}(\cdot ), \end{aligned}$$

we have

$$ \langle f, K_x \rangle _{\mathscr {H}} = \langle f, g \rangle _{\mathscr {H}} = \sum _{i=1}^m c_i K(x_i,x) = f(x), $$

showing that the reproducing property holds in S. Let us now consider the completion of S. To this aim, let $\{f_j\}$ be a Cauchy sequence with $f_j \in S \ \forall j$. We have

$$\begin{aligned} | f_i(x) - f_j(x)|= & {} | \langle f_i-f_j, K_x \rangle _{\mathscr {H}} | \\\le & {} \Vert f_i -f_j \Vert _{\mathscr {H}} \Vert K_x \Vert _{\mathscr {H}}, \end{aligned}$$

where we have used first the reproducing property (since it holds in S) and then the Cauchy–Schwarz inequality. We have

$$ \Vert K_x \Vert _{\mathscr {H}} = | \sqrt{\langle K_x, K_x \rangle _{\mathscr {H}}} | = \sqrt{K(x,x)} \le q < + \infty , $$

where the scalar q independent of x exists because the kernel K is continuous over the compact $\mathscr {X} \times \mathscr {X}$. Combining the last two inequalities leads to

$$\begin{aligned} | f_i(x) - f_j(x)| \le \sup _{x \in \mathscr {X}} | f_i(x) - f_j(x)| \le q \Vert f_i -f_j \Vert _{\mathscr {H}}, \end{aligned}$$

(6.69)

which shows that the convergence in $\mathscr {H}$ implies also uniform convergence. In other words, if $f_j \rightarrow f$ in $\mathscr {H}$ w.r.t. $\Vert \cdot \Vert _\mathscr {H}$, then $f_j \rightarrow f$ also in the space $\mathscr {C}$ of continuous functions w.r.t. the sup-norm $\Vert \cdot \Vert _{\infty }$. Since $S \subset \mathscr {C}$ and $\mathscr {C}$ is Banach, all the functions in the completion of S are continuous, i.e., $\mathscr {H} \subset \mathscr {C}$. Furthermore, if $f_j \rightarrow f$ in $\mathscr {H}$, one has that for any $x \in \mathscr {X}$

$$\begin{aligned} \lim _{j \rightarrow \infty } \langle f_j, K_x \rangle _\mathscr {H} = \langle f, K_x \rangle _\mathscr {H}, \end{aligned}$$

by the continuity of the inner product. But we also have

$$\begin{aligned} \lim _{j \rightarrow \infty } \langle f_j, K_x \rangle _\mathscr {H} = \lim _{j \rightarrow \infty } f_j(x) =f(x), \end{aligned}$$

since $f_j \in S \ \forall j$, the reproducing property holds in S and convergence in $\mathscr {H}$ implies uniform (and, hence, pointwise) convergence. This shows that $\langle f, K_x \rangle _\mathscr {H} = f(x) \ \forall f \in \mathscr {H}$, i.e., the reproducing property holds over all the space $\mathscr {H}$.

The last point is the unicity of $\mathscr {H}$. For the sake of contradiction, assume that there exists another Hilbert space $\mathscr {G}$ which satisfies conditions (6.2) and (6.3). By Lemma 6.1, we must have $\mathscr {G}=\bar{S}$ where the completion of S is w.r.t. the norm $\Vert \cdot \Vert _\mathscr {G}$ deriving from the inner product $\langle \cdot ,\cdot \rangle _{\mathscr {G}}$. Condition (6.3) holds both in $\mathscr {H}$ and in $\mathscr {G}$, so that we have

$$ \langle K_x,K_s\rangle _{\mathscr {H}} = K(x,s) = \langle K_x,K_s\rangle _{\mathscr {G}}, \ \forall (x,s) \in \mathscr {X} \times \mathscr {X}. $$

Since the functions in S are finite linear combinations of kernel sections, by the linearity of the inner product, the above equality allows to conclude that

$$ \langle f,g \rangle _{\mathscr {H}} = \langle f,g \rangle _{\mathscr {G}}, \ \forall (f,g) \in S \times S. $$

Such an equality, together with the uniqueness of limits, implies that the completion of S w.r.t. $\Vert \cdot \Vert _\mathscr {H}$ coincides with the completion w.r.t. $\Vert \cdot \Vert _\mathscr {G}$. Hence, $\mathscr {H}$ and $\mathscr {G}$ are the same Hilbert space and this completes the proof.

6.1.3 6.9.3 Proof of Theorem 6.10

It is not difficult to see that (6.12) with the inner product (6.13) is a Hilbert space. In addition, using the Mercer theorem, in particular the expansion (6.11), from (6.13) one has

$$\begin{aligned} \Vert K_x \Vert _{\mathscr {H}}^2= & {} \Vert \sum _{i \in \mathscr {I}} \zeta _i \rho _i(x) \rho _i(\cdot ) \Vert _{\mathscr {H}}^2\\= & {} \sum _{i \in \mathscr {I}} \frac{\zeta _i^2 \rho ^2_i(x)}{\zeta _i} = K(x,x) < \infty , \end{aligned}$$

and, for any $f = \sum _{i \in \mathscr {I}} a_i \rho _i$, it also holds that

$$\begin{aligned} \langle K_x, f \rangle _{\mathscr {H}}= & {} \langle \sum _{i \in \mathscr {I}} \zeta _i \rho _i(x) \rho _i(\cdot ), \sum _{i \in \mathscr {I}} a_i \rho _i(\cdot ) \rangle _{\mathscr {H}} \\= & {} \sum _{i \in \mathscr {I}} \frac{\zeta _i \rho _i(x) a_i}{\zeta _i} =f(x). \end{aligned}$$

This shows that every kernel section belongs to $\mathscr {H}$ and the reproducing property holds. Theorem 6.1 then ensures that $\mathscr {H}$ is indeed the RKHS associated to K.

6.1.4 6.9.4 Proof of Theorem 6.13

First, let $\mathscr {H}$ be the RKHS induced by $K(x,y)=\zeta \rho (x) \rho (y)$. Any RKHS is spanned by its kernel sections, hence in this case $\mathscr {H}$ is the one-dimensional subspace generated by $\rho $. By the reproducing property it holds that

$$ \Vert K_x \Vert ^2_{\mathscr {H}} = K(x,x)= \zeta \rho ^2(x). $$

In addition, one has

$$ \Vert K_x \Vert ^2_{\mathscr {H}} = \Vert \zeta \rho (x) \rho \Vert ^2_{\mathscr {H}}= \zeta ^2 \rho ^2(x) \Vert \rho \Vert ^2_{\mathscr {H}}, $$

so that

$$ \Vert \rho \Vert ^2_{\mathscr {H}} = \frac{1}{\zeta }. $$

Now, consider the kernel of interest $K(x,y) = \sum _{i=1}^{\infty } \zeta _i \rho _i(x) \rho _i(y)$ associated with $\mathscr {H}$. Define $K_j(x,y) = \zeta _j \rho _j(x) \rho _j(y)$. with $\Vert \cdot \Vert _{\mathscr {H}_j} $ to denote the norm induced by $K_j$. From the discussion above it holds that

$$\begin{aligned} \Vert \rho _j \Vert ^2_{\mathscr {H}_j} = \frac{1}{\zeta _j}. \end{aligned}$$

(6.70)

Think of $K(x,y) = \sum _{i=1}^{\infty } \zeta _i \rho _i(x) \rho _i(y)$ as the sum of $K_j(x,y)$ and $K_{-j}(x,y) = \sum _{k \ne j}^{\infty } \zeta _k \rho _k(x) \rho _k(y)$. Then, using Theorem 6.6 and (6.70), one has

$$ \Vert \rho _j \Vert ^2_{\mathscr {H}} = \min _{c_j,h} \ \frac{c_j^2}{\zeta _j} + \Vert h \Vert ^2_{\mathscr {H}_{-j}} \ \text{ s.t. } \ \rho _j = c_j \rho _j + h, \ c_j \in \mathbb {R}, \ h \in \mathscr {H}_{-j} $$

where $\mathscr {H}_{-j}$ is the RKHS induced by $K_{-j}$. Evaluating the objective at $(c_j=1,h=0)$, one obtains

$$ \Vert \rho _j \Vert ^2_{\mathscr {H}} \le \frac{1}{\zeta _j}, $$

and this shows that $\rho _j \in \mathscr {H} \ \forall j$.

Now we prove that the functions $\rho _j$ generate all the RKHS $\mathscr {H}$ induced by K. Using Theorem 6.25 (Projection theorem), it comes that for any $f \in \mathscr {H}$ we have

$$ f = g + h \ \ \text{ with } \ \ g \in G, \ h \in G^\perp $$

where G indicates the closure in $\mathscr {H}$ of the subspace generated by all the $\rho _k$. Using the reproducing property, one obtains

$$\begin{aligned} h(x)= & {} <h(\cdot ),K(x,\cdot )>_{\mathscr {H}} \\= & {} <h(\cdot ),\sum _{k=1}^{\infty } \zeta _k \rho _k(x) \rho _k(\cdot )>_{\mathscr {H}} \\= & {} \sum _{k=1}^\infty \zeta _k \rho _k(x) <h(\cdot ), \rho _k(\cdot )>_{\mathscr {H}} = 0 \quad \forall x, \end{aligned}$$

where the last equality exploits the relation $h \perp \rho _k \ \forall k $. This completes the first part of the proof.

As for the RKHS norm characterization, first let $\mathscr {H}_j^\infty $ be the RKHS induced by the kernel $\sum _{k=j}^\infty K_k$ with $h_j$ to denote a generic element of $\mathscr {H}_j^\infty $. Then, given $f \in \mathscr {H}$, using Theorem 6.6 in an iterative fashion, we obtain

$$\begin{aligned} \Vert f \Vert _{\mathscr {H}}^2&= \min _{c_1,h_2} \ \frac{c_1^2}{\zeta _1} + \Vert h_2 \Vert ^2_{\mathscr {H}_2^\infty } \ \text{ s.t. } \ f = c_1 \rho _1 + h_2 \\&= \min _{c_1,c_2,h_3} \ \frac{c_1^2}{\zeta _1} + \frac{c_2^2}{\zeta _2} + \Vert h_3 \Vert ^2_{\mathscr {H}_3^\infty } \ \text{ s.t. } \ f = c_1 \rho _1 + c_2 \rho _2 + h_3 \\&\vdots \\&= \min _{c_1,\ldots ,c_{n-1},h_n} \ \sum _{k=1}^{n-1} \frac{c_k^2}{\zeta _k} + \Vert h_n \Vert ^2_{\mathscr {H}_n^\infty } \ \text{ s.t. } \ f = \sum _{i=1}^{n-1} c_i \rho _i + h_n. \end{aligned}$$

In particular, every equality above is obtained thinking of the kernel $\sum _{k=j}^\infty K_k$ as the sum of $K_j$ and $\sum _{k=j+1}^\infty K_k$. Then, $h_j$ can be decomposed into two parts, i.e., $h_j = c_j\rho _j + h_{j+1}$, with $\Vert \rho _j\Vert ^2_{\mathscr {H}_j} = 1/\zeta _j$ where, as before, $\Vert \cdot \Vert _{\mathscr {H}_j}$ denotes the norm in the one-dimensional RKHS induced by $K_j$. Now, let $\hat{c}_1,\ldots ,\hat{c}_{n-1},\hat{h}_n$ be the minimizers of the last objective (the minimizer can be assumed unique without loss of generality, just to simplify the exposition) and note that $\Vert \hat{h}_n\Vert _{\mathscr {H}_n^\infty }$ must go to zero as $n \rightarrow \infty $. Then, it comes that the sequence $\hat{c}_1,\hat{c}_{2},\ldots $ characterizing the norm $\Vert f \Vert _{\mathscr {H}}^2$ is indeed $\min _{\{c_k\}} \ \sum _{k=1}^\infty \ \frac{c_k^2}{\zeta _k }$ with the $\{c_k\}$ subject to the constraints $\lim _{n \rightarrow \infty } \ \Vert f - \sum _{k=1}^n c_k \rho _k \Vert _{\mathscr {H}} = 0$.

6.1.5 6.9.5 Proofs of Theorems 6.15 and 6.16

We prove the following more general result that embraces as special cases Theorems 6.15 and 6.16.

Theorem 6.28

Let $\mathscr {H}$ be a Hilbert space. Consider the optimization problem

$$\begin{aligned} \min _{f \in \mathscr {H}} \ \varPhi (L_1[f],\ldots ,L_N[f],\Vert f \Vert _{\mathscr {H}}) \end{aligned}$$

(6.71)

and assume that

problem (6.71) admits at least one solution;
each $L_i: \mathscr {H} \rightarrow \mathbb {R}$ is linear and bounded;
the objective $\varPhi $ is strictly increasing w.r.t. its last argument.

Then, all the solutions of (6.71) admit the following expression

$$\begin{aligned} \hat{g} = \sum _{i=1}^N \ c_i \eta _i, \end{aligned}$$

(6.72)

where the $c_i$ are suitable scalar expansion coefficients and each $\eta _i \in \mathscr {H}$ is the representer of $L_i$, i.e.,

$$\begin{aligned} L_i[f]=\langle f,\eta _i \rangle _{\mathscr {H}}, \quad \forall f \in \mathscr {H}, \ \ i=1,\ldots ,N. \end{aligned}$$

In particular, if $\mathscr {H}$ is a RKHS with kernel K, each basis function is given by

$$\begin{aligned} \eta _i(x) = L_i[K(\cdot ,x)]. \end{aligned}$$

To prove the above result, let $\hat{g}$ be a solution of (6.71) and denote with S the (closed) subspace spanned by the N representers $\eta _i$ of the functionals $L_i$, i.e.,

$$ S=\text{ span }\{\eta _1,\ldots ,\eta _N\}. $$

Exploiting Theorem 6.25 (Projection theorem), we can write

$$ \hat{g} = \hat{g}_S + \hat{g}_{S^\perp }, \quad \hat{g}_S \in S, \ \hat{g}_{S^\perp } \in S^\perp . $$

For the sake of contradiction, assume that $\hat{g}_{S^\perp }$ is different from the null function. Then, we have

$$\begin{aligned} \varPhi (L_1[\hat{g} ],\ldots ,L_n[\hat{g} ],\Vert \hat{g} \Vert _{\mathscr {H}})&= \varPhi (\langle \eta _1, \hat{g} \rangle _{\mathscr {H}},\ldots , \langle \eta _N, \hat{g} \rangle _{\mathscr {H}},\Vert \hat{g} \Vert _{\mathscr {H}}) \\&= \varPhi (\langle \eta _1, \hat{g}_S + \hat{g}_{S^\perp } \rangle _{\mathscr {H}},\ldots , \langle \eta _N, \hat{g}_S + \hat{g}_{S^\perp } \rangle _{\mathscr {H}}, \sqrt{\Vert \hat{g}_S \Vert _{\mathscr {H}}^2 + \Vert \hat{g}_{S^\perp } \Vert _{\mathscr {H}}^2 }) \\&= \varPhi (\langle \eta _1, \hat{g}_S \rangle _{\mathscr {H}},\ldots , \langle \eta _N, \hat{g}_S \rangle _{\mathscr {H}}, \sqrt{\Vert \hat{g}_S \Vert _{\mathscr {H}}^2 + \Vert \hat{g}_{S^\perp } \Vert _{\mathscr {H}}^2 }) \\&< \varPhi (\langle \eta _1, \hat{g}_S \rangle _{\mathscr {H}},\ldots , \langle \eta _N, \hat{g}_S \rangle _{\mathscr {H}}, \Vert \hat{g}_S \Vert _{\mathscr {H}}), \end{aligned}$$

where the last equality exploits the fact that each $\eta _i$ is orthogonal to all the functions in $S^\perp $ while the inequality exploits the assumption that $\varPhi $ is strictly increasing w.r.t. its last argument. This contradicts the optimality of $\hat{g}$ and implies that $\hat{g}_{S^\perp }$ must be the null function, hence concluding the first part of the proof.

Finally, to prove (6.28) note that, if $\mathscr {H}$ is a RKHS, one has

$$ \eta _i(x) = \langle \eta _i,K_x \rangle _{\mathscr {H}} = L_i[K(\cdot ,x)], $$

where the first equality comes from the reproducing property, while the second one derives from the fact that $\eta _i$ is the representer of $L_i$.

6.1.6 6.9.6 Proof of Theorem 6.21

Preliminary Lemmas

The first lemma, whose proof can be found in [34], states a bound on the correlation between two random variables assuming values in a Hilbert space.

Lemma 6.3

(based on [34]) Let a and b be zero-mean random variables measurable with respect to the $\sigma $-algebras $\mathscr {M}_1$ and $\mathscr {M}_2$ and with values in the Hilbert space $\mathscr {H}$ having inner product $\langle \cdot , \cdot \rangle _{\mathscr {H}}$. Then, it holds that

$$\begin{aligned} \left| \mathscr {E}[\langle a,b \rangle _{{\mathscr {H}}}] \right| \le 15\root 3 \of {\psi (\mathscr {M}_1,\mathscr {M}_2) \mathscr {E} \Vert a \Vert _{\mathscr {H}} ^3 \mathscr {E} \Vert b \Vert _{\mathscr {H}} ^3 }, \end{aligned}$$

(6.73)

where all the expectations above are assumed to exist and

$$\begin{aligned} \psi (\mathscr {M}_1,\mathscr {M}_2) = \sup _{A\in \mathscr {M}_1,B\in \mathscr {M}_2} | P(A \cap B) - P(A)P( B) |. \end{aligned}$$

As for the second lemma, first it is useful to introduce the following integral operator:

$$\begin{aligned} L_K[f](\cdot ) = \int _{X} K(\cdot ,x) f(x) p_x(x) dx. \end{aligned}$$

Since the assumptions underlying Theorem (6.9) (Mercer Theorem) hold true, there exists a complete orthonormal basis of $\mathscr {L}_2^{\mu }$, denoted by $\{\rho _i\}_{i \in \mathscr {I}}$, which satisfies

$$\begin{aligned} L_K[\rho _i] = \zeta _i \rho _i, \quad i \in \mathscr {I}, \quad \zeta _1 \ge \zeta _2 \ge . \end{aligned}$$

To simplify exposition, hereby we assume $\zeta _i>0 \ \forall i$. Then, for $r>0$, we define the operators $L_K^{-r}$ and $L_K^{r}$ as follows

$$\begin{aligned} L_K^{r}[f]= & {} \sum _{i \in \mathscr {I}} \zeta _i^r c_i \rho _i\end{aligned}$$

(6.74)

$$\begin{aligned} L_K^{-r}[f]= & {} \sum _{i \in \mathscr {I}} \frac{c_i}{\zeta _i^r} \rho _i. \end{aligned}$$

(6.75)

The function $L_K^{-r}[f]$ is less regular than f since its expansion coefficients go to zero more slowly. Instead, $L_K^{r}$ is a smoothing operator since $\zeta _i^r c_i$ goes to zero faster than $c_i$ as i goes to infinity. When $r=1/2$ we recover the operator $L_K^{1/2}$ already defined in (6.17) which satisfies $\mathscr {H} = L_K^{1/2} \mathscr {L}_2^{\mu }$. The following lemma holds.

Lemma 6.4

If $L_K^{-r} f_{\rho } \in \mathscr {L}_2^{\mu }$ for some $0<r \le 1$, letting

$$\begin{aligned} \hat{f} = \arg \min _{f \in \mathscr {H}} \Vert f-f_{\rho } \Vert ^2_{\mathscr {L}_2^{\mu }} + \gamma \Vert f \Vert ^2_{\mathscr {H}}, \end{aligned}$$

(6.76)

one has

$$\begin{aligned} \Vert \hat{f}-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \le \gamma ^{\ r} \Vert L_K^{-r}f_{\rho } \Vert _{\mathscr {L}_2^{\mu }}. \end{aligned}$$

(6.77)

Proof

By assumption, there exists $g \in \mathscr {L}_2^{\mu }$, say $g=\sum _{i \in \mathscr {I}} d_i \rho _i$, such that $f_{\rho }= L_K^{r}g $ so that $f_{\rho }=\sum _{i \in \mathscr {I}} \zeta _i^r d_i \rho _i$. Now, we characterize the solution $\hat{f}$ of (6.76) using $f = \sum _{i \in \mathscr {I}} c_i \rho _i$ and optimizing w.r.t. the $c_i$. The objective becomes

$$ \sum _{i \in \mathscr {I}} (c_i-\zeta _i^rd_i)^2 + \gamma \sum _{i \in \mathscr {I}} \frac{c_i^2}{\zeta _i}, $$

and setting the partial derivatives w.r.t. each $c_i$ to zero, we obtain

$$\begin{aligned} \hat{f} = \sum _{i \in \mathscr {I}} \hat{c}_i \rho _i, \quad \hat{c}_i = \frac{\zeta _i^{r+1}d_i}{\zeta _i+\gamma }. \end{aligned}$$

(6.78)

This implies

$$ \hat{f} - f_{\rho } = - \sum _{i \in \mathscr {I}} \frac{\gamma }{\zeta _i+\gamma } \zeta _i^r d_i \rho _i. $$

If $0<r \le 1$, it follows that

$$\begin{aligned} \Vert \hat{f}-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }}= & {} \left\{ \sum _{i \in \mathscr {I}} \left( \frac{\gamma }{\zeta _i+\gamma } \zeta _i^r d_i \right) ^2 \right\} ^{1/2} \\= & {} \gamma ^{\ r} \left\{ \sum _{i \in \mathscr {I}} \left( \frac{\gamma }{\zeta _i+\gamma }\right) ^{2(1-r)} \left( \frac{\zeta _i}{\zeta _i+\gamma }\right) ^{2r} d_i^2 \right\} ^{1/2} \\\le & {} \gamma ^{\ r} \sum _{i \in \mathscr {I}} \left( \frac{\gamma }{\zeta _i+\gamma }\right) ^{(1-r)} \left( \frac{\zeta _i}{\zeta _i+\gamma }\right) ^{r} |d_i| \\\le & {} \gamma ^{\ r} \left\{ \sum _{i \in \mathscr {I}} d_i^2 \right\} ^{1/2} = \gamma ^{\ r} \Vert g \Vert _{\mathscr {L}_2^{\mu }} = \gamma ^{\ r} \Vert L_K^{-r}f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \end{aligned}$$

and this proves (6.77). $\square $

In the proof of the third lemma reported below, the notation $S_x: \mathscr {H} \rightarrow \mathbb {R}^N$ indicates the sampling operator defined by $S_x f=[f(x_1) \ldots f(x_N)]$. In addition, $S_x^T$ denotes its adjoint, i.e., for any $c \in \mathbb {R}^N$, it satisfies

$$ \langle f, S_x^T c \rangle _{\mathscr {H}} = \langle S_x f,c \rangle = \sum _{i=1}^N \ c_i f(x_i) = \langle f, \sum _{i=1}^N \ c_i K_{x_i} \rangle _{\mathscr {H}}, $$

where $\langle \cdot , \cdot \rangle $ is the Euclidean inner product. Hence, one has

$$ S_x^T c = \sum _{i=1}^N \ c_i K_{x_i} \quad \forall \ c \in \mathbb {R}^N. $$

Lemma 6.5

Define

$$\begin{aligned} \eta _i (\cdot ) = \left[ y_i-\hat{f}(x_i) \right] K(x_i, \cdot ) \end{aligned}$$

(6.79)

with $\hat{f} $ defined by (6.76). Then, if $ \hat{g}_N$ is given by (6.55), one has

$$ \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {H}} \le \frac{1}{\gamma } \left\| \frac{1}{N} \sum _{i=1}^{N} \left( \eta _i - \mathscr {E}[\eta _i ] \right) \right\| _{\mathscr {H}}. $$

Proof

First, it is useful to derive two useful equalities involving $\hat{f}$ and $\hat{g}_N$. The first one is

$$\begin{aligned} \gamma \hat{f} = L_K (\hat{f}_{\rho } - \hat{f}) = \mathscr {E} \eta _i. \end{aligned}$$

(6.80)

The last equality in (6.80) follows from the definition of $L_k$ and $\eta _i$. The first equality can be obtained using the representation $f_{\rho }=\sum _{i \in \mathscr {I}} d_i \rho _i$, then following the same passages contained in the first part of the previous lemma’s proof to obtain

$$ \hat{f} = \sum _{i \in \mathscr {I}} \frac{\zeta _i }{\zeta _i+\gamma } d_i \rho _i, \quad f_{\rho } - \hat{f} = \sum _{i \in \mathscr {I}} \frac{\gamma }{\zeta _i+\gamma } d_i \rho _i. $$

The second result consists of the following alternative expression for $\hat{g}_N$:

$$\begin{aligned} \hat{g}_N = \left( \frac{S_x^T S_x}{N} + \gamma I\right) ^{-1} \frac{S_x^T}{N} Y, \end{aligned}$$

(6.81)

where I denotes the identity operator. To prove it, we will use the equality $\hat{g}_N=S_x^T \left( \mathbf {K}+N \gamma I_{N}\right) ^{-1}Y$ which derives from the representer theorem and also the fact that, for any vector $c \in \mathbb {R}^N$, it holds that $S_xS_x^Tc=\mathbf {K}c$ with $\mathbf {K}$ the kernel matrix built using $[x_1 \ldots x_N]$. Then, we have

$$\begin{aligned} \left( \frac{S_x^T S_x}{N} + \gamma I \right) \hat{g}_N= & {} \frac{S_x^T}{N} \left( \mathbf {K} \left( \mathbf {K}+N \gamma I_{N}\right) ^{-1} + N \gamma \left( \mathbf {K}+N \gamma I_{N}\right) ^{-1}\right) Y\\= & {} \frac{S_x^T}{N}Y. \end{aligned}$$

Now, it is also useful to obtain a bound on the inverse of the operator $\frac{S_x^T S_x}{N} + \gamma I$. Assume that $v \in \mathscr {H}$ and let u satisfy

$$ \left( \frac{S_x^T S_x}{N} + \gamma I\right) u = v. $$

We take inner products on both sides with u and use the equality $\langle S_x S_x^T u, u \rangle _{\mathscr {H}} =\langle S_x u, S_x u \rangle $ to obtain

$$ \frac{1}{N} \langle S_x u, S_x u \rangle + \gamma \Vert u\Vert ^2_{\mathscr {H}} = \langle v,u \rangle _{\mathscr {H}} \le \Vert v \Vert _{\mathscr {H}} \Vert u \Vert _{\mathscr {H}}. $$

One has

$$ \lambda _x := \inf _{f \in \mathscr {H}} \ \ \frac{\Vert S_x f \Vert }{\sqrt{N} \Vert f \Vert _{\mathscr {H}}} \implies \left( \lambda _x^2 + \gamma \right) \Vert u \Vert ^2_{\mathscr {H}} \le \Vert v \Vert _{\mathscr {H}} \Vert u \Vert _{\mathscr {H}}. $$

Thus, we have shown that

$$\begin{aligned} \left( \frac{S_x^T S_x}{N} + \gamma I\right) u = v \ \implies \ \Vert u \Vert _{\mathscr {H}} \le \frac{ \Vert v \Vert _{\mathscr {H}} }{ \lambda _x^2 + \gamma } \le \frac{1}{ \gamma }\Vert v \Vert _{\mathscr {H}}, \quad \forall v \in \mathscr {H}. \end{aligned}$$

(6.82)

Now, it comes from (6.81) that

$$ \hat{g}_N - \hat{f} = \left( \frac{S_x^T S_x}{N} + \gamma I\right) ^{-1} \left( \frac{S_x^T Y}{N}- \frac{S_x^T S_x \hat{f}}{N}-\gamma \hat{f} \right) . $$

Exploiting the equalities

$$ \frac{S_x^T Y}{N}- \frac{S_x^T S_x \hat{f}}{N} = \frac{1}{N} \sum _{i=1}^{N} \eta _i, \quad \gamma \hat{f} = \mathscr {E} \eta _i, $$

which derive from (6.79) and (6.80), respectively, we obtain

$$ \hat{g}_N - \hat{f} = \left( \frac{S_x^T S_x}{N} + \gamma I\right) ^{-1} \frac{1}{N} \sum _{i=1}^{N} \left( \eta _i - \mathscr {E}[\eta _i ] \right) . $$

The use of (6.82) then completes the proof. $\square $

Proof of Statistical Consistency

Let $\hat{f}$ be defined by (6.76), i.e.,

$$\begin{aligned} \hat{f} = \arg \min _{f \in \mathscr {H}} \Vert f-f_{\rho } \Vert ^2_{\mathscr {L}_2^{\mu }} + \gamma \Vert f \Vert ^2_{\mathscr {H}}. \end{aligned}$$

Then, consider the following error decomposition

$$\begin{aligned} \Vert \hat{g}_N-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \le \Vert \hat{f}-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} + \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {L}_2^{\mu }}. \end{aligned}$$

(6.83)

The first term $\Vert \hat{f}-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }}$ on the r.h.s. is not stochastic. The assumption $f_{\rho } \in \mathscr {H}$ ensures that $\Vert L_K^{-r} f_{\rho } \Vert _{\mathscr {L}_2^{\mu }}< \infty $ for $0 \le r \le 1/2$. It thus comes from Lemma 6.4 that, at least for $0<r \le 1/2$, it holds that

$$\begin{aligned} \Vert \hat{f}-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \le \gamma ^{\ r} \Vert L_K^{-r} f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} < \infty . \end{aligned}$$

(6.84)

Now, consider the second term $\Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {L}_2^{\mu }}$. Since the input space (the function domain) is compact, and recalling also (6.69), there exists a constant A such that

$$\begin{aligned} \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {L}_2^{\mu }} \le A \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {H}}. \end{aligned}$$

(6.85)

To obtain a bound for the r.h.s. involving the RKHS norm, consider the stochastic function

$$ \eta _i (\cdot ) = \left[ y_i-\hat{f}(x_i) \right] K(x_i, \cdot ), $$

already introduced in (6.79). Using the reproducing property, one has

$$\begin{aligned} \Vert \eta _i \Vert _{\mathscr {H}} ^2 = \left[ y_i-\hat{f}(x_i) \right] ^2\,K(x_i, x_i). \end{aligned}$$

(6.86)

The function $\hat{f}$ belongs to $\mathscr {H}$ and is thus continuous on the compact $\mathscr {X}$. In addition, the kernel K is continuous on the compact $\mathscr {X} \times \mathscr {X}$ and the process $\{x_i,y_i\}$ has finite moments up to the third order by assumption. Hence, there exists a constant B independent of i such that

$$\begin{aligned} \mathscr {E}\left[ \Vert \eta _i \Vert _{\mathscr {H}} ^k \right] \le B, \quad k=1,2,3. \end{aligned}$$

(6.87)

We can now come back to $\Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {H}}$. From Lemma 6.5, $\forall \gamma >0$ it holds that

$$\begin{aligned} \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {H}} \le \frac{1}{\gamma } \left\| \frac{1}{N} \sum _{i=1}^{N} \left( \eta _i - \mathscr {E}[\eta _i ] \right) \right\| _{\mathscr {H}}. \end{aligned}$$

(6.88)

Now, using first Jensen’s inequality and then (6.87), (6.88), Assumption 6.20 and (6.73) in Lemma 6.3 (with a and b replaced by $\eta _i- \mathscr {E}[\eta _i]$ and $\eta _j - \mathscr {E}[\eta _j]$) one obtains constants C and D such that

$$\begin{aligned}&\left( \mathscr {E} \left[ \left\| \frac{1}{N} \sum _{i=1}^{N} \left( \eta _i - \mathscr {E}[\eta _i ] \right) \right\| _{\mathscr {H}} \right] \right) ^2\le \mathscr {E} \left[ \left\| \frac{1}{N} \sum _{i=1}^{N} \left( \eta _i - \mathscr {E}[\eta _i ] \right) \right\| _{\mathscr {H}}^2 \right] \\&\qquad \le \frac{15}{N^2} \sum _{i=1}^{N} \sum _{j=1}^{N} \root 3 \of {|\psi _{| i - j|}|} \left( \mathscr {E}[\Vert \eta -\mathscr {E}[\eta ]\Vert _{\mathscr {H}}^3 ] \right) ^{\frac{2}{3}} \\&\qquad \le \frac{C}{N} \left( \mathscr {E}[(\Vert \eta \Vert _{\mathscr {H}} + \Vert \mathscr {E}[\eta ]\Vert _{\mathscr {H}})^3 ] \right) ^{\frac{2}{3}} \le \frac{D}{N} , \\ \end{aligned}$$

where $\eta $ replaces $\eta _i$ or $\eta _j$ when the expectation is independent of i and j. This latter result, combined with (6.85) and (6.88), leads to the existence of a constant E such that

$$\begin{aligned} \mathscr {E} \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {L}_2^{\mu }} \le A \mathscr {E} \Vert \hat{g}_N-\hat{f} \Vert _{\mathscr {H}} \le \frac{E}{\gamma \sqrt{N} } \end{aligned}$$

(6.89)

that, combined with (6.83) and (6.84), implies that for any $0<r \le 1/2$

$$\begin{aligned} \mathscr {E} \Vert \hat{g}_N-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} \le \gamma ^{\ r} \Vert L_K^{-r} f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} + \frac{E}{\gamma \sqrt{N} }. \end{aligned}$$

(6.90)

Hence, when $\gamma $ is chosen according to (6.56), $\mathscr {E}\Vert \hat{g}_N-f_{\rho } \Vert _{\mathscr {L}_2^{\mu }} $ converges to zero as N grows to $\infty $. Using the Markov inequality, (6.57) is finally obtained.

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Regularization in Reproducing Kernel Hilbert Spaces. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-95860-2_6
Published: 14 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95859-6
Online ISBN: 978-3-030-95860-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Regularization in Reproducing Kernel Hilbert Spaces

Abstract

6.1 Preliminaries

6.2 Reproducing Kernel Hilbert Spaces

Definition 6.1

Definition 6.2

Theorem 6.1

Remark 6.1

Theorem 6.2

Example 6.3

Example 6.4

Example 6.5

6.2.1 Reproducing Kernel Hilbert Spaces Induced by Operations on Kernels \(\star \)

Theorem 6.6

Theorem 6.7

Theorem 6.8

6.3 Spectral Representations of Reproducing Kernel Hilbert Spaces

Theorem 6.9

Theorem 6.10

Remark 6.2

Example 6.11

Example 6.12

6.3.1 More General Spectral Representation \(\star \)

Theorem 6.13

Example 6.14

6.4 Kernel-Based Regularized Estimation

6.4.1 Regularization in Reproducing Kernel Hilbert Spaces and the Representer Theorem

Theorem 6.15

Remark 6.3

6.4.2 Representer Theorem Using Linear and Bounded Functionals

Theorem 6.16

Example 6.17

6.5 Regularization Networks and Support Vector Machines

6.5.1 Regularization Networks

Remark 6.4

6.5.2 Robust Regression via Huber Loss \(\star \)

6.5.3 Support Vector Regression \(\star \)

6.5.4 Support Vector Classification \(\star \)

6.6 Kernels Examples

6.6.1 Linear Kernels, Regularized Linear Regression and System Identification

6.6.1.1 Infinite-Dimensional Extensions \(\star \)

6.6.2 Kernels Given by a Finite Number of Basis Functions

6.6.3 Feature Map and Feature Space \(\star \)

6.6.4 Polynomial Kernels

6.6.5 Translation Invariant and Radial Basis Kernels

Theorem 6.18

6.6.6 Spline Kernels

6.6.7 The Bias Space and the Spline Estimator

Remark 6.5

6.7 Asymptotic Properties \(\star \)

6.7.1 The Regression Function/Optimal Predictor

Theorem 6.19

6.7.2 Regularization Networks: Statistical Consistency

Assumption 6.20

Theorem 6.21

6.7.3 Connection with Statistical Learning Theory

Definition 6.3

Assumption 6.22

Definition 6.4

Definition 6.5

Theorem 6.23

Theorem 6.24

6.8 Further Topics and Advanced Reading

Notes

References

Author information

Authors and Affiliations

Corresponding author

6.9 Appendix

6.9 Appendix

6.1.1 6.9.1 Fundamentals of Functional Analysis

Theorem 6.25

Theorem 6.26

Theorem 6.27

6.1.2 6.9.2 Proof of Theorem 6.1

Lemma 6.1

Proof

Lemma 6.2

Proof

6.1.3 6.9.3 Proof of Theorem 6.10