7.1 Regularized Linear System Identification in Reproducing Kernel Hilbert Spaces

7.1.1 Discrete-Time Case

We will consider linear discrete-time systems in the form of the so-called output error (OE) models. Data are generated according to the relationship

$$\begin{aligned} y(t)&=G^0(q)u(t)+e(t),\quad t=1,\dots , N, \end{aligned}$$
(7.1)

where y(t), u(t) and \(e(t) \in \mathbb {R}\) are the system output, the known system input and the noise at time instant \(t\in \mathbb N\), respectively. In addition, \(G^0(q)\) is the “true” system that has to be identified from the input–output samples with q being the time shift operator, i.e., \(qu(t)=u(t+1)\). Here, and also in all the remaining parts of the chapter, we assume that e is white noise (all its components are mutually uncorrelated).

In Chap. 2, we have seen that there exist different ways to parametrize \(G^0(q)\). In what follows, we will start our discussions exploiting the simplest impulse response descriptions given by FIR models and then we will consider more general infinite-dimensional models also in continuous time. We will see that there is a common way to estimate them through regularization in the RKHS framework and the representer theorem.

7.1.1.1 FIR Case

The FIR case corresponds to

$$\begin{aligned} y(t)&=G(q,\theta )u(t)+e(t)\nonumber \\&=\sum ^m_{k=1}g_ku(t-k)+ e(t),\quad \theta =[g_1,\dots , g_m]^T, \end{aligned}$$
(7.2)

where m is the FIR order, \(g_1,\ \dots , g_m\) are the FIR coefficients and \(\theta \) is the unknown vector that collects them. Model (7.2) can be rewritten in vector form as follows:

$$\begin{aligned} Y=\varPhi \theta + E, \end{aligned}$$
(7.3)

where

$$ Y=[y(1)\ \dots \ y(N)]^T, \quad E=[e(1)\ \dots \ e(N)]^T $$

and

$$ \varPhi =[\varphi (1)\ \dots \ \varphi (N)]^T $$

with

$$ \varphi ^T(t)=[u(t-1)\ \dots u(t-m)]. $$

Instead of describing FIR model estimation directly in the regularized RKHS framework, let us first recall the ReLS method with quadratic penalty term introduced in Chap. 3. It gives the estimate of \(\theta \) by solving the following problem:

$$\begin{aligned} \hat{\theta }&=\mathop {\mathrm {arg\,min}}\limits \sum _{t=1}^N (y(t)-\sum ^m_{k=1}g_ku(t-k))^2 + \gamma \theta ^TP^{-1}\theta \end{aligned}$$
(7.4a)
$$\begin{aligned}&=\mathop {\mathrm {arg\,min}}\limits _\theta \Vert Y-\varPhi \theta \Vert ^2 + \gamma \theta ^TP^{-1}\theta \end{aligned}$$
(7.4b)
$$\begin{aligned}&= (\varPhi ^T\varPhi +\gamma P^{-1})^{-1}\varPhi ^TY \end{aligned}$$
(7.4c)
$$\begin{aligned}&= P \varPhi ^T (\varPhi P \varPhi ^T + \gamma I_{N} )^{-1} Y, \end{aligned}$$
(7.4d)

where the regularization matrix \(P\in {\mathbb R}^{m\times m}\) is positive semidefinite, assumed invertible for simplicity. The regularization parameter \(\gamma \) is a positive scalar that, as already seen, has to balance adherence to experimental data and strength of regularization.

Now we show that (7.4) can be reformulated as a function estimation problem with regularization in the RKHS framework. For this aim, we will see that the key is to use the \(m \times m\) matrix P to define the kernel over the domain \(\{1,2,\dots ,m\} \times \{1,2,\dots ,m\}\). This in turn will define a RKHS of functions \(g: \{1,2,\dots ,m\} \rightarrow \mathbb {R}\). Such functions are connected with the components \(g_i\) of the m-dimensional vector \(\theta \) by the relation \(g(i)=g_i\). So, the functional view is obtained replacing the vector \(\theta \) with the function that maps i into the ith component of \(\theta \).

Let us define a positive semidefinite kernel \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) as follows:

$$\begin{aligned} K(i,j)=P_{ij},\quad i,j\in \mathscr {X}=\{1,2,\dots ,m\}, \end{aligned}$$
(7.5)

where \(P_{ij}\) is the (ij)th entry of the regularization matrix P. It is obvious that K is positive semidefinite because P is positive semidefinite. Its kernel sections will be denoted by \(K_i\) with \(i=1,\ldots ,m\) and are the columns of P seen as functions mapping \(\mathscr {X}\) into \(\mathbb {R}\).

Now, using the Moore–Aronszajn Theorem, illustrated in Theorem 6.2, the kernel K reported in (7.5) defines a unique RKHS \(\mathscr {H}\) such that \(\langle K_i, g \rangle _{\mathscr {H}} = g(i)\), \(\forall (i, g) \in \left( \mathscr {X},\mathscr {H}\right) \). This is the function space where we will search for the estimate of the FIR coefficients. According to the discussion following Theorem 6.2, since there are just m kernel sections \(K_i\) associated to the m columns of P, for any impulse response candidate \(g \in \mathscr {H},\) there exist m scalars \(a_j\) such that

$$\begin{aligned} g(i) = \sum _{j=1}^m a_j K(i,j) =P(i,:)a \end{aligned}$$
(7.6)

where P(i,  : ) is the ith row of P. Since g(i) is the ith component of \(\theta \), one has

$$ \theta = Pa. $$

By the reproducing property, we also have

$$\begin{aligned} \Vert g\Vert ^2_{\mathscr {H}}&=\langle \sum _{j=1}^m a_j K_j, \sum _{l=1}^m a_l K_l \rangle _{\mathscr {H}} = \sum _{j=1}^m\sum _{l=1}^m a_ja_lK(j,l)\\ {}&= \sum _{j=1}^m\sum _{l=1}^m a_ja_lP_{jl} = a^TPa \end{aligned}$$

and this implies

$$ \Vert g\Vert ^2_{\mathscr {H}} = \theta ^T P^{-1} \theta . $$

As a result, the ReLS method (7.4) can be reformulated as follows:

$$\begin{aligned} \hat{g} = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \sum _{t=1}^N (y(t) - \sum ^m_{k=1}g(k) u(t-k))^2 + \gamma \Vert g \Vert ^2_{\mathscr {H}} \end{aligned}$$
(7.7)

which is a regularized function estimation problem in the RKHS \(\mathscr {H}\).

In view of the equivalence between (7.4) and (7.7), the FIR function estimate \(\hat{g}\) has the closed-form expression given by (7.4d). The correspondence is established by \(\hat{g}(i)=\hat{\theta }_i\). We will show later that such closed-form expression can be derived/interpreted by exploiting the representer theorem.

Remark 7.1

\(\star \) Besides (7.7), there is also an alternative way to reformulate the ReLS method (7.4) as a function estimation problem with regularization in the RKHS framework. This has been sketched in the discussions on linear kernels in Sect. 6.6.1. The difference lies in the choice of the function to be estimated and the choice of the corresponding kernel. In particular, in this chapter, we have obtained (7.7) choosing the function and the corresponding kernel to be the FIR g and (7.5), respectively. In contrast, in Sect. 6.6.1, the RKHS is defined by the kernel

$$\begin{aligned} K(x,y)=x^TPy,\quad x,y\in \mathscr {X}={\mathbb R}^m \end{aligned}$$
(7.8)

and contains the linear functions \(x^T\theta \), where the input locations x incapsulate m past input values. So, using (7.8), the corresponding RKHS does not contain impulse responses but functions that represent directly linear systems mapping regressors (built with input values) into outputs.

7.1.1.2 IIR Case

The infinite impulse response (IIR) case corresponds to

$$\begin{aligned} y(t)=G(q,\theta )u(t)+e(t)=\sum ^\infty _{k=1}g_ku(t-k)+ e(t), \quad t=1,\dots ,N \end{aligned}$$
(7.9)

where \(\theta =[g_1,\dots , g_\infty ]^T\). So, model order m is set to \(\infty \) and we have to handle infinite-dimensional objects. To face the intrinsic ill-posedness of the estimation problem, one could think to introduce an infinite-dimensional regularization matrix P. But the penalty \(\theta ^TP^{-1}\theta \), adopted in (7.4) for the FIR case, would turn out to be undefined. So, the RKHS setting is needed to define regularized IIR estimates. The first step is to choose a positive semidefinite kernel \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\). Then, let \(\mathscr {H}\) be the RKHS associated with K and \(g \in \mathscr {H}\) be the IIR function with \(g(k)=g_k\) for \(k\in \mathbb {N}\). Finally, the estimate is given by

$$\begin{aligned} \hat{g} = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \sum _{t=1}^N (y(t) - \sum ^{\infty }_{k=1}g(k)u(t-k))^2 + \gamma \Vert g\Vert ^2_{\mathscr {H}}. \end{aligned}$$
(7.10)

One may wonder whether it is possible to obtain a closed-form expression of the IIR estimate \(\hat{g}\) as in the FIR case.  The answer is positive and given by the following representer theorem. It derives from Theorem 6.16 reported in the previous chapter applied to the case of quadratic loss functions, as discussed in Example 6.17, that allows to recover the expansion coefficients of the estimate just solving a linear system of equations, see (6.29) and (6.31). Before stating in a formal way the result, it is useful to point out the following two facts:

  • in the dynamic systems context treated in this chapter any functional \(L_i\) present in Theorem 6.16 is now applied to discrete-time impulse responses g which lives in the RKHS \(\mathscr {H}\). Hence, it represents the discrete-time convolution with the input, i.e., \(L_i\) maps \(g \in \mathscr {H}\) into the system output evaluated at the time instant \(t=i\);

  • from the discussion after Theorem 6.16, recall also that a linear functional L is linear and bounded in \(\mathscr {H}\) if and only if the function f, defined for any x by \(f(x)=L[K(x,\dot{)}]\), belongs to \(\mathscr {H}\). Hence, the condition (7.11) reported below is equivalent to assume that the system input defines linear and bounded functionals over the RKHS induced by K.

Theorem 7.1

(Representer theorem for discrete-time linear system identification, based on [73, 90]). Consider the function estimation problem (7.10). Assume that \(\mathscr {H}\) is the RKHS induced by a positive semidefinite kernel \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\) and that, for \(t=1,\ldots ,N\), the functions \(\eta _t\) defined by

$$\begin{aligned} \eta _t(i) = \sum ^{\infty }_{k=1}K(i,k)u(t-k), \quad i\in \mathbb {N} \end{aligned}$$
(7.11)

are all well defined in \(\mathscr {H}\). Then, the solution of (7.10) is

$$\begin{aligned} \hat{g}(i) = \sum _{t=1}^N \ \hat{c}_t \eta _t(i),\quad i\in \mathbb {N}, \end{aligned}$$
(7.12)

where \(\hat{c}_t\) is the tth entry of the vector

$$\begin{aligned} \hat{c} = (O+\gamma I_N)^{-1}Y \end{aligned}$$
(7.13)

with \(Y=[y(1),\dots y(N)]^T\) and with the (ts)th entry of O given by

$$\begin{aligned} O_{ts} = \sum ^{\infty }_{i=1}\sum ^{\infty }_{k=1}K(i,k)u(t-k)u(s-i), \quad t,s=1,\dots ,N.\end{aligned}$$
(7.14)

Theorem 7.1 discloses an important feature of regularized impulse response estimation in RKHS. The function estimate \(\hat{g}\) has a finite dimensional representation that does not depend on the dimension of the RKHS \(\mathscr {H}\) induced by the kernel but only on the data set size N.

Example 7.2

(Stable spline kernel for IIR estimation) To estimate high-order FIR models, in the previous chapters, we have introduced some regularization matrices related to the DC, TC and stable spline kernels, see (5.40) and (5.41). Consider now the TC kernel, also called first-order stable spline, with support extended to \(\mathbb {N} \times \mathbb {N}\), i.e.,

$$\begin{aligned} K(i,j) =\alpha ^{\max {(i,j)}}, \quad 0< \alpha <1, \quad (i,j) \in \mathbb {N}. \end{aligned}$$
(7.15)

This kernel induces a RKHS that contains IIR models and can be conveniently adopted in the estimator (7.10). An interesting question is to derive the structure of the induced regularizer \(\Vert g\Vert ^2_{\mathscr {H}}\). One could connect K with the matrix P entering (7.4a) but its inverse is undefined since now P is infinite dimensional. To derive the stable spline norm, it is instead necessary to resort to functional analysis arguments. In particular, in Sect. 7.7.1, it is proved that

$$\begin{aligned} \Vert g \Vert ^2_{\mathscr {H}} = \sum _{t=1}^{\infty } \ \frac{ \left( g_{t+1} -g_t \right) ^2}{(1-\alpha )\alpha ^{t}}, \end{aligned}$$
(7.16)

an expression that well reveals how the kernel (7.15) includes information on smooth exponential decay. When used in (7.10), the resulting IIR estimate balances the data fit (sum of squared residuals) and the energy of the impulse response increments weighted by coefficients that increase exponentially with time t and thus enforce stability.

Let us now consider a simple application of the representer theorem. Assume that the system input is a causal step of unit amplitude, i.e., \(u(t)=1\) for \(t \ge 0\) and \(u(t)=0\) otherwise. The functions (7.11) are given by

$$ \eta _t(i) = \sum ^{\infty }_{k=1}K(i,k)u(t-k), \quad i\in \mathbb {N}. $$

For instance, the first three basis functions are

$$\begin{aligned} \eta _1(i)= & {} \sum ^{\infty }_{k=1}K(i,k)u(1-k) = \alpha ^{\max {(i,1)}} \\ \eta _2(i)= & {} \sum ^{\infty }_{k=1}K(i,k)u(2-k) = \alpha ^{\max {(i,1)}} + \alpha ^{\max {(i,2)}} \\ \eta _3(i)= & {} \sum ^{\infty }_{k=1}K(i,k)u(3-k) = \alpha ^{\max {(i,1)}} + \alpha ^{\max {(i,2)}} + \alpha ^{\max {(i,3)}} \end{aligned}$$

and, in general, one has

$$ \eta _t(i) = \sum _{k=1}^{t} \alpha ^{\max {(i,k)}}. $$

Hence, any \(\eta _t\) is a well-defined function in the RKHS induced by K, being the sum of the first t kernel sections. Then, according to Theorem 7.1, we conclude that the IIR estimate returned by (7.10) is spanned by the functions \(\{\eta _t\}_{t=1}^N\) with coefficients then computable from (7.13).    \(\square \)

Although Theorem 7.1 is stated for the IIR case (7.10), the same result also holds for the FIR case (7.7). The only difference is that the series in (7.11) and (7.14) have to be replaced by finite sums up to the FIR order m. Then, interestingly, one can interpret the regularized FIR estimate (7.4d) in a different way exploiting the representer theorem perspective. In particular, one finds \(O=\varPhi P\varPhi ^T\) while the basis functions \(\{\eta _t\}_{t=1}^N\) are in one-to-one correspondence with the N columns of \(P\varPhi ^T\), each of dimension m.

7.1.2 Continuous-Time Case

Now, we consider linear continuous-time systems still focusing on the output error (OE) model structure. The system outputs are collected over N time instants \(t_i\). Hence, the measurements model is

$$\begin{aligned} y(t_i)=\int _0^\infty g^0(\tau )u(t_i-\tau )d\tau + e(t_i),\quad i=1,\dots ,N , \end{aligned}$$
(7.17)

where y(t), u(t) and e(t) are the system output, the known input and the noise at time instant \(t\in {\mathbb R}^+\), respectively, while \(g^0(t), \ t\in {\mathbb R}^+\) is the “true” system impulse response.

Similarly to what done in the previous section, we will study how to determine from a finite set of input–output data a regularized estimate of the impulse response \(g^0\) in the RKHS framework. The first step is to choose a positive semidefinite kernel \(K:{\mathbb R}^+\times {\mathbb R}^+\rightarrow {\mathbb R}\). It induces the RKHS \(\mathscr {H}\) containing the impulse response candidates \(g \in \mathscr {H}\). Then, the linear model can be estimated by solving the following function estimation problem:

$$\begin{aligned} \hat{g} = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \sum _{i=1}^N \Big (y(t_i) - \int _0^\infty g(\tau )u(t_i-\tau )d\tau \Big )^2 + \gamma \Vert g\Vert ^2_{\mathscr {H}}. \end{aligned}$$
(7.18)

The closed-form expression of the impulse response estimate \(\hat{g}\) is given by the following representer theorem that again derives from Theorem 6.16 and the same discussion reported before Theorem 7.1. Note just that now any functional \(L_i\) entering Theorem 6.16 is applied to continuous-time impulse responses g in the RKHS \(\mathscr {H}\). Hence, it represents the continuous-time convolution with the input, i.e., \(L_i\) maps \(g \in \mathscr {H}\) into the system output evaluated at the time instant \(t_i\).

Theorem 7.3

(Representer theorem for continuous-time linear system identification, based on [73, 90]) Consider the function estimation problem (7.18). Assume that \(\mathscr {H}\) is the RKHS induced by a positive semidefinite kernel \(K:{\mathbb R}^+\times {\mathbb R}^+\rightarrow {\mathbb R}\) and that, for \(i=1,\ldots ,N\), the functions \(\eta _i\) defined by

$$\begin{aligned} \eta _i(s) = \int ^{\infty }_{0}K(s,\tau )u(t_i-\tau )d\tau , \quad s\in {\mathbb R}^+ \end{aligned}$$
(7.19)

are all well defined in \(\mathscr {H}\). Then, the solution of (7.18) is

$$\begin{aligned} \hat{g}(s) = \sum _{i=1}^N \ \hat{c}_i \eta _i(s), \quad s\in {\mathbb R}^+ \end{aligned}$$
(7.20)

where \(\hat{c}_i\) is the ith entry of the vector

$$\begin{aligned} \hat{c} = (O+\gamma I_N)^{-1}Y \end{aligned}$$
(7.21)

with \(Y=[y(t_1),\dots y(t_N)]^T\) and the (ij)th entry of O given by

$$\begin{aligned} O_{ij} = \int ^{\infty }_{0}\int ^{\infty }_{0}K(\tau ,s)u(t_i-s)u(t_j-\tau ) ds d\tau , \quad i,j=1,\dots ,N. \end{aligned}$$
(7.22)

Example 7.4

(Stable spline kernel for continuous-time system identification) In Example 6.5, we introduced the first-order spline kernel \(\min (x,y)\) on \([0,1]\times [0,1]\). It describes a RKHS of continuous functions f on the unit interval that satisfy \(f(0)=0\) whose squared norm is the energy of the first-order derivative, i.e.,

$$\begin{aligned} \int _0^1 \left( \dot{f}(x)\right) ^2 dx. \end{aligned}$$
(7.23)

To describe stable impulse responses g, we instead need a kernel defined over the positive real axis \(\mathbb {R}^+\) that induces the constraint \(g(+\infty )=0\). A simple way to obtain this is to exploit the composition of the spline kernel with an exponential change of coordinates mapping \(\mathbb {R}^+\) into [0, 1]. The resulting kernel is called (continuous-time) first-order stable spline kernel. It is given by

$$\begin{aligned} K(s,t) = \min (e^{-\beta s}, e^{-\beta t})=e^{-\beta \max (s,t)},\quad s,t\in {\mathbb R}^+, \end{aligned}$$
(7.24)

where \(\beta >0\) regulates the change of coordinates and, hence, the impulse responses decay rate. So, \(\beta \) can be seen as a kernel parameter related to the dominant pole of the system.

It is interesting to note the similarity between the kernel (7.15) and the first-order stable spline kernel (7.24). By letting \(\alpha =\exp (-\beta )\), the sampled version of the first-order stable spline kernel (7.24) corresponds exactly to the TC kernel (7.15). Top panel of Fig. 7.1 plots (7.24) and also some kernel sections: they are all continuous and exponentially decaying to zero. Such kernel inherits also the universality property of the splines. In fact, its kernel sections can approximate any continuous impulse response on all the compact subsets of \(\mathbb {R}^+\).

Fig. 7.1
figure 1

First-order (top left) and second-order (bottom left) stable spline kernel with some kernel sections (right panels) obtained with \(\beta =0.5\) and centred on \(0,0.5,1,\ldots ,10\) (bottom)

The relationship with splines permits also to easily achieve one spectral decomposition of (7.24). In particular, in Example 6.11, we obtained the following expansion of the spline kernel:

$$ \min (x,y)=\sum _{i=1}^{+\infty } \zeta _i \rho _i(x)\rho _i(y) $$

with

$$ \rho _i(x) = \sqrt{2} \sin \left( i \pi x - \frac{\pi x}{2}\right) , \ \ \zeta _i = \frac{1}{( i \pi - \pi /2)^2}, $$

where all the \(\rho _i\) are mutually orthogonal on [0, 1] w.r.t. the Lebesque measure. In view of the simple connection between spline and stable spline kernels given by exponential time transformations, one easily obtains that the first-order stable spline kernel can be diagonalized as follows:

$$\begin{aligned} e^{-\beta \max (s,t)} = \sum _{i=1}^\infty \zeta _i\phi _i(s)\phi _i(t) \end{aligned}$$
(7.25)

with

$$\begin{aligned} \phi _i(t)=\rho _i(e^{-\beta t}),\ \ \zeta _i = \frac{1}{( i \pi - \pi /2)^2}, \end{aligned}$$
(7.26)

where the \(\phi _i\) are now orthogonal on \([0,+\infty )\) w.r.t. the measure \(\mu \) of density \(\beta e^{-\beta t}\). In Fig. 6.3, we reported the eigenfunctions \(\rho _i\) with \(i=1,2,8\) and the eigenvalues \(\zeta _i\) for the first-order spline kernel (6.47). For comparison, we now show in Fig. 7.2 the corresponding eigenfunctions \(\phi _i\) of the first-order stable spline kernel (7.24) with \(\beta =1\) and also the \(\zeta _i\). While the eigenvalues are the same, differently from the \(\rho _i\) the eigenfunctions \(\phi _i\) now decay exponentially to zero.

Fig. 7.2
figure 2

Expansion of the continuous-time first-order stable spline kernel \(e^{-\beta \max (x,y)}\) with \(\beta =1\): eigenfunctions \(\rho _i(x)\) for \(i=1,2,8\) (left panel) and eigenvalues \(\zeta _i\) (right)

Having obtained one spectral decomposition of (7.24), we can now exploit Theorem 6.10 to obtain the following representation of the RKHS induced by the first-order stable spline kernel:

$$\begin{aligned} \mathscr {H}=\Big \{g \ | \ g(t)=\sum _{i=1}^\infty c_i\phi _i(t), \ t\ge 0, \ \ \sum _{i=1}^\infty \frac{c_i^2}{\zeta _i}<\infty \Big \}, \end{aligned}$$
(7.27)

and the squared norm of g turns out to be

$$\begin{aligned} \Vert g\Vert _{\mathscr {H}}^2 = \sum _{i=1}^\infty \frac{c_i^2}{\zeta _i}. \end{aligned}$$
(7.28)

Now we will exploit the above results to obtain a more useful expression for \(\Vert g\Vert _{\mathscr {H}}^2\). The deep connection between spline and stable spline kernel implies that these two spaces are isometrically isomorphic, i.e., there is an one-to-one correspondence that preserves inner products. In fact, we can associate to any stable spline function g(t) in \(\mathscr {H}\) the spline function f(t) in the space induced by (6.47) such that \(g(t)=f(e^{-\beta t})\). So, \(g(t)=\sum _{i=1}^\infty c_i\phi _i(t)\) implies \(f(t)=\sum _{i=1}^\infty c_i \rho _i(t)\) and the two functions have indeed the same norm \( \sum _{i=1}^\infty \frac{c_i^2}{\zeta _i}\). Now, using (7.23) and (7.28), we obtain

$$\begin{aligned} \Vert g\Vert _{\mathscr {H}}^2 = \int _0^1 \left( \dot{f}(t)\right) ^2 dt = \int _0^{+\infty } \left( \dot{g}(t)\right) ^2 \frac{e^{\beta t}}{\beta } dt. \end{aligned}$$
(7.29)

This expression gives insights into the nature of the stable spline space. Compared to the classical Sobolev space induced by the first-order spline kernel, the norm penalizes the energy of the first-order derivative of g with a weight proportional to \(e^{\beta t}\). Such norm thus enforces all the function in \(\mathscr {H}\) to be continuous impulse responses decaying to zero at least exponentially. Note also that (7.29) really seems the continuous-time counterpart of the norm (7.16) associated to the discrete-time stable spline kernel.

Let us see now how to generalize the kernel (7.24). In Sect. 6.6.6 of the previous chapter, we have introduced the general class of spline kernels. Here, we started our discussion using the first-order (linear) spline kernel \(\min (x,y)\) but we have seen that higher-order models can be useful to reconstruct smoother functions, an important example being the second-order (cubic) spline kernel (6.48). Applying exponential time transformations to the splines, the class of the so-called stable spline kernels is obtained. For instance, from (6.48), one obtains the second-order stable spline kernel

$$\begin{aligned} \frac{e^{-\beta (s+t+\max (s,t))}}{2}-\frac{e^{-3\beta \max (s,t)}}{6}. \end{aligned}$$
(7.30)

The bottom panels of Fig. 7.1 plots (7.30) and also some kernel sections: they exponentially decay to zero and are more regular than those associated to (7.24).    \(\square \)

7.1.3 More General Use of the Representer Theorem for Linear System Identification \(\star \)

Theorems 7.1 and 7.3 are special cases of the more general representer theorem involving function estimation from sparse and noisy data. It was reported as Theorem 6.16 in the previous chapter. Let us briefly recall it. Its starting point was the optimization problem

$$\begin{aligned} \hat{g} = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \ \sum _{i=1}^{N}\mathscr {V}_i(y_i,L_i[g])+ \gamma \Vert g\Vert _{\mathscr {H}}^2, \end{aligned}$$
(7.31)

where \(\mathscr {V}_i\) is a loss function, e.g., the quadratic loss adopted in this chapter, and each functional \(L_i: \mathscr {H} \rightarrow \mathbb {R}\) is linear and bounded. Then, all the solutions of (7.31) are given by

$$\begin{aligned} \hat{g} = \sum _{i=1}^N \ c_i \eta _i, \end{aligned}$$
(7.32)

where each \(\eta _i \in \mathscr {H}\) is the representer of \(L_i\) given by

$$\begin{aligned} \eta _i(t) = L_i[K(\cdot ,t)]. \end{aligned}$$
(7.33)

How to compute the expansion coefficients \(c_i\) will then depend on the nature of the \(\mathscr {V}_i\), as described in Sect. 6.5.

The estimator (7.31) can be exploited for linear system identification thinking of g as an impulse response, using e.g., a stable spline kernel to define \(\mathscr {H}\). The linear functional \(L_i\) is then defined by a convolution and returns the system noiseless outputs at instant \(t_i\). In particular, in discrete-time one has

$$\begin{aligned} L_i[g]=\sum ^{\infty }_{k=1}g(k)u(t_i-k), \quad t_i=1,\ldots ,N \end{aligned}$$
(7.34)

while in continuous time, it holds that

$$\begin{aligned} L_i[g]=\int _0^\infty g(\tau )u(t_i-\tau )d\tau . \end{aligned}$$
(7.35)

When quadratic losses are used, (7.31) becomes the regularization network described in Sect. 6.5.1 whose expansions coefficients are available in closed form. One has \(\hat{c} = (O+\gamma I_N)^{-1}Y\) with the (ts)-entry of the matrix O given by \(O_{ts} = L_s[L_t[K]]\), as given by (7.14) in discrete time and by (7.22) in continuous time. The use of losses \(\mathscr {V}_i\) different from quadratic then opens the way also to the definition of many new algorithms for impulse response estimation. For example, the use of the Vapnik’s \(\epsilon \)-insensitive loss described in Sect. 6.5.3 leads to support vector regression for linear system identification. Beyond promoting sparsity in the coefficients \(c_i\), it also makes the estimator robust against outliers since penalties on large residuals grows linearly. Outliers can be tackled also by adopting the \(\ell _1\) or Huber loss, see Sect. 6.5.2. A general system identification framework that includes all the convex piecewise linear quadratic losses and penalties is, e.g., described in [2].

Interestingly, the estimator (7.31) can be conveniently adopted  for linear system identification also giving g a different meaning from an impulse response. For instance, in system identification there are important IIR models that use Laguerre functions see e.g., [91, 92] whose z-transform is

$$ \frac{\sqrt{1-\alpha ^2}}{z-\alpha }\Big (\frac{1-\alpha z}{z-\alpha }\Big )^{j-1}, \quad j=1,2,\ldots . $$

They form an orthonormal basis in \(\ell _2\) and some of them are displayed in Fig. 7.3.

Fig. 7.3
figure 3

Discrete-time Laguerre functions of order \(j=1,2,8\) obtained with \(\alpha =0.99\) (samples are linearly interpolated)

Another option is given by the Kautz basis functions that allow also to include information on the presence of system resonances [46]. Using \(\phi _i\) to denote such basis functions, the impulse response model can be written as

$$ f(t) = \sum _{i=1}^{\infty } \ g_i \phi _i(t). $$

A problem is how to determine the coefficients \(g_i\) from data. Classical approaches use truncated expansions \(f = \sum _{i=1}^{d} \ g_i \phi _i\), with model order d estimated using, e.g., Akaike’s criterion, as discussed in Sect. 2.4.3, and then determine the \(g_i\) by least squares. An interesting alternative is to let \(d=+\infty \) and to think that the \(g_i\) define the function g such that \(g(i)=g_i\). One can then estimate the coefficients through (7.31) adopting a kernel, like TC and stable spline, that includes information on the expansion coefficients’ decay to zero. Working in discrete time, the functionals \(L_i\) entering (7.31) are in this case defined by

$$ L_i[g]= \sum _{j=1}^{\infty } \ g_j \sum ^{\infty }_{k=1} \phi _j(k) u(t_i-k), $$

while in continuous time, one has

$$ L_i[g]= \sum _{j=1}^{\infty } \ g_j \int _0^\infty \phi _j(\tau )u(t_i-\tau )d\tau . $$

7.1.4 Connection with Bayesian Estimation of Gaussian Processes

Similarly to what discussed in the finite-dimensional setting in Sect. 4.9, also the more general regularization in RKHS can be given a probabilistic interpretation in terms of Bayesian estimation. In this paradigm, the different loss functions correspond to alternative statistical models for the observation noise, while the kernel represents the covariance of the unknown random signal, assumed independent of the noise.  In particular, when the loss is quadratic, all the involved distributions are Gaussian.

We now discuss the connection under the linear system identification perspective where the “true” impulse response \(g^0\) is seen as the random signal to estimate. Consider the measurements model

$$\begin{aligned} y(t_i)=L_i[g^0]+ e(t_i), \quad i=1,\dots , N, \end{aligned}$$
(7.36)

where \(L_i\) is a linear functional of the true impulse response \(g^0\) defined by convolution with the system input evaluated at \(t_i\). One has

$$L_i[g^0]=\sum ^{\infty }_{k=1}g^0(k)u(t_i-k)$$

in discrete time and

$$L_i[g^0]=\int _0^\infty g^0(\tau )u(t_i-\tau )d\tau $$

in continuous time. So, the impulse response estimators discussed in this chapter can be compactly written as

$$\begin{aligned} \hat{g} = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \ \sum _{i=1}^N (y(t_i)-L_i[g])^2 + \gamma \Vert g\Vert _{\mathscr {H}}^2, \end{aligned}$$
(7.37)

where the RKHS \(\mathscr {H}\) contains functions \(g:\mathscr {X}\rightarrow {\mathbb R}\) with \(\mathscr {X}=\mathbb {N}\) in discrete time and \(\mathscr {X}={\mathbb R}^+\) in continuous time.

The following result (whose simple proof is in Sect. 7.7.2) shows that, under Gaussian assumptions on the impulse response and the noise, (7.37) provides the minimum variance estimate of \(g^0\) given the measurements \(Y=[y(t_1),\dots ,y(t_N)]^T\).

Proposition 7.1

Let the following assumptions hold:

  • the impulse response \(g^0\) is a zero-mean Gaussian process on \(\mathscr {X}\). Its covariance function is defined by

    $$ \mathscr {E} (g^0(t)g^0(s)) = \lambda K(t,s), $$

    where \(\lambda \) is a positive scalar and K is a kernel;

  • the e(t) are mutually independent zero-mean Gaussian random variables with variance \(\sigma ^2\). Moreover, they are independent of \(g^0\).

Let \(\mathscr {H}\) be the RKHS induced by K, set \(\gamma =\sigma ^2/\lambda \) and define

$$\begin{aligned} \hat{g} = \arg \min _{g \in \mathscr {H}} \left( \sum _{i=1}^{N} (y(t_i)-L_i[g])^2 + \gamma \Vert g\Vert _{\mathscr {H}}^2 \right) . \end{aligned}$$

Then, \(\hat{g}\) is the minimum variance estimator of \(g^0\) given Y, i.e.,

$$ \mathscr {E} [g^0(t) | Y] = \hat{g}(t) \quad \forall t \in \mathscr {X}. $$

Remark 7.2

The connection between regularization in RKHS and estimation of Gaussian processes was first pointed out in [51] in the context of spline regression, using quadratic losses, see also [41, 83, 90]. The connection also holds for a wide class of losses \(\mathscr {V}_i\) also different from quadratic. For instance, in this statistical framework, using the absolute value loss corresponds to Laplacian noise assumptions. The statistical interpretation of an \(\epsilon \)-insensitive loss in terms of Gaussians with mean and variance given by suitable random variables can be found in [79], see also [40, 67]. For all this kind of noise models, and many others, it can be shown that the RKHS estimate \(\hat{g}\) includes all the possible finite-dimensional maximum a posteriori estimates of \(g^0\), see [3] for details.

Fig. 7.4
figure 4

The largest space contains all the realizations of a zero-mean Gaussian process of covariance K. The smallest space is the RKHS \(\mathscr {H}\) induced by K, assumed here infinite dimensional. The probability that realizations of f fall in the RKHS is zero. Instead, when the assumptions underlying the representer theorem hold, the realizations of the minimum variance estimator \({\mathscr {E}}[f|Y]\) are contained in \(\mathscr {H}\) with probability one

Remark 7.3

The relation between RKHSs and Gaussian stochastic processes, or more general Gaussian random fields, is stated by Proposition 7.1 in terms of minimum variance estimators. In particular, since the representer theorem ensures that such estimator is sum of a finite number of basis functions belonging to \(\mathscr {H}\), it turns out that \(\hat{g}\) belongs to the RKHS induced by the covariance of \(g^0\) with probability one. Now, one may also wonder what happens a priori, before seeing the data. In other words, the question is whether realizations of a zero-mean Gaussian process of covariance K fall in the RKHS induced by K. If the kernel K is associated with an infinite-dimensional \(\mathscr {H}\), the answer is negative with probability one, as graphically illustrated in Fig. 7.4. While deep discussions can be found in [9, 34, 59, 68], here we give just a hint on this fact. Assume that the kernel admits the decomposition

$$ K(s,t)= \sum _{i=1}^M \zeta _i\phi _i(s)\phi _i(t) $$

inducing an M-dimensional RKHS \(\mathscr {H}\). Let the deterministic functions \(\phi _i\) be independent. Then, we know from Theorem 6.13 that, if \(f(t) =\sum _{i=1}^M a_i \phi _i(t)\), then

$$ \Vert f\Vert _{\mathscr {H}}^2 = \sum _{i=1}^M \ \frac{a_i^2}{\zeta _i}. $$

Now, think of K as a covariance and let \(a_i\) be zero-mean Gaussian and independent random variables of variance \(\zeta _i\), i.e.,

$$ a_i \sim \mathscr {N}(0,\zeta _i). $$

Then, the so-called Karhunen–Loève expansion of the Gaussian random field \(f\sim \mathscr {N}(0,K)\), also discussed in Sect. 5.6 to connect regularization and basis expansion in finite dimension, is given by

$$ f(t)=\sum _{i=1}^M a_i \phi _i(t) $$

with M possibly infinite and convergence in quadratic mean. The RKHS norm of f is now a random variable and, since the \(a_i\) are mutually independent with \({\mathscr {E}}a_i^2 = \zeta _i\), one has

$$ {\mathscr {E}}\Vert f\Vert _{\mathscr {H}}^2 = {\mathscr {E}}\sum _{i=1}^M \ \frac{a_i^2}{\zeta _i} = \sum _{i=1}^M \ \frac{{\mathscr {E}}a_i^2}{\zeta _i} = M. $$

So, if the RKHS is infinite dimensional, one has \(M=\infty \) and the expected (squared) RKHS norm of the process f diverges to infinity.

7.1.5 A Numerical Example

Our goal now is to illustrate the influence of the choice of the kernel on the quality of the impulse response estimate using also the Bayesian interpretation of regularization. The example is a simple linear discrete time system in the form of (7.1). Using the z-transform, its transfer function is

$$\begin{aligned} y(t)&=\frac{1}{z(z-0.85)}u(t)+e(t),\quad t=1,\dots , 20. \end{aligned}$$
(7.38)

The system’s impulse response is reported in Fig. 7.5. The disturbances e(t) are independent and Gaussian random variables with mean zero and variance \(0.05^2\). For ease of visualization, we let the input u(t) be an impulsive signal, i.e., \(u(0)=1\) and \(u(t)=0\) elsewhere. Thus, the impulse response have to be estimated from 20 direct and noisy impulse response measurements.

We consider a Monte Carlo simulation of 200 runs. At any run, the outputs are obtained by generating mutually independent measurement noises. One data set is shown in Fig. 7.5. For each of the 200 data sets, we use the regularized IIR estimator (7.10). For what regards \(K:\mathbb {N}\times \mathbb {N}\rightarrow {\mathbb R}\), , we will compare the performance of three kernels: the Gaussian (6.43), the cubic spline (6.48) and the stable spline (7.15) defined, respectively, by

$$ \exp \Big (-\frac{(i-j)^2}{\rho }\Big ), \quad \frac{i j \min \{i, j\}}{2}-\frac{(\min \{i, j\})^3}{6}, \quad \alpha ^{\max (i,j)}. $$

Recall that the Gaussian and the cubic spline kernel are the most used in machine learning to include information on smoothness. The cubic spline estimator could be also complemented with a bias space given, e.g., by a linear function, as described in Sect. 6.6.7. However, one would obtain results very similar to those described in what follows.

Fig. 7.5
figure 5

The true impulse response (thick line) and one out of the 200 data sets (\(\circ \))

To adopt the estimator (7.10), we need to find a suitable value for the regularization parameter \(\gamma \) and also for the unknown kernel parameters, i.e., the kernel width \(\rho \) in the Gaussian kernel and the stability  parameter \(\alpha \) for stable spline. As already done, e.g., in Sect. 1.2 for ridge regression, an oracle-based procedure is adopted to optimally balance bias and variance. The unknown parameters are achieved by maximizing the measure of fit defined as follows:

$$\begin{aligned} 100\left( 1 - \left[ \frac{\sum ^{50}_{k=1}|g_k^0-\hat{g}(k)|^2 }{\sum ^{50}_{k=1}|g_k^0-\bar{g}^0|^2}\right] ^{\frac{1}{2}}\right) ,\ \ \bar{g}^0=\frac{1}{50}\sum ^{50}_{k=1}g^0_k, \end{aligned}$$
(7.39)

where computation is restricted only to the first 50 samples where, in practice, the impulse response is different from zero. This tuning procedure is ideal since it exploits the true function \(g^0\). It is useful here since it excludes the uncertainty brought by the kernel tuning procedure and will fully reveal the influence of the kernel choice on the quality of the impulse response estimate.

Fig. 7.6
figure 6

True impulse response (thick line) and 200 impulse response estimates obtained using the cubic spline kernel (6.48) (top panel), the Gaussian kernel (6.43) (middle) and the stable spline kernel (bottom). The unknown parameters are estimated by an oracle that maximizes the fit (7.39) for each data set

The impulse response estimates obtained by the cubic spline, the Gaussian and the stable spline kernel are reported in Fig. 7.6. When the cubic spline kernel (6.48) is chosen, the impulse response estimates diverge as time goes. This result can be also given a Bayesian interpretation where (6.48) becomes the covariance of the stochastic process \(g^0\). Specifically, the cubic spline kernel models the impulse response as double integration of white noise. So, impulse responses coefficients are correlated but the prior variance increases in time. For stable systems, variability is instead expected to decay to zero as t progresses. When the Gaussian kernel (6.43) is chosen, quality of the impulse response estimates much improves, but many of them exhibit oscillations and the variance of the impulse response estimator is still large. Bayesian arguments here show that the Gaussian kernel models \(g^0\) as a stationary stochastic process. Smoothness information is encoded but not the fact that that one expects the prior variance to decay to zero. Finally, the impulse response estimates returned by the stable spline kernel (7.15) are all very close to the truth. These outcomes are similar to those described, e.g., in Example 5.4 in Sect. 5.5. In particular, even if this example is rather simple, it shows clearly that a straightforward application of standard kernels from machine learning and smoothing splines literature may give unsatisfactory results. Inclusion of dynamic systems features in the regularizer, like smooth exponential decay, greatly enhances the quality of the impulse response estimates.

7.2 Kernel Tuning

As we have seen in the previous parts of the book, the kernels depend on some unknown parameters, the so-called hyperparameters. They can, e.g., include scale factors, the kernel width of the Gaussian kernel or the impulse response’s decay rate in the TC and stable spline kernels. In real-world applications, the oracle-based procedure used in the previous section cannot be used. The kernels need instead to be tuned from data. Such procedure is referred to as hyperparameter estimation and is the counterpart of model order selection in the classical paradigm of system identification. It determines model complexity within the new paradigm where system identification is seen as regularized function estimation in RKHSs. This calibration step will thus have a major impact on model’s performance, e.g., in terms of predictive capability on new data. Due to the connection with the ReLS methods in quadratic form, the tuning methods introduced in Chaps. 3 and 4 can be easily applied also in the RKHS framework. In particular, let \(K(\eta )\) denote a kernel, where \(\eta \) is the hyperparameter vector belonging to the set \(\varGamma \). Such vector could also include other parameters not present in the kernel, e.g., the noise variance \(\sigma ^2\). Some calibration methods to estimate \(\eta \) from data are then reported below.

7.2.1 Marginal Likelihood Maximization

The first approach we describe is marginal likelihood maximization (MLM), also called the empirical Bayes method in Sect. 4.4. MLM relies on the Bayesian interpretation of function estimation in RKHS discussed in Sect. 7.1.4. Under the same assumptions stated in Proposition 7.1, \(\eta \) can be estimated by maximum likelihood

$$\begin{aligned} \hat{\eta }= \arg \max _{\eta \in \varGamma } {\mathrm p}(Y|\eta ), \end{aligned}$$
(7.40)

with \({\mathrm p}(Y|\eta )\) obtained by integrating out \(g^0\) from the joint density \({\mathrm p}(Y|g^0){\mathrm p}(g^0|\eta )\), i.e.,

$$\begin{aligned} {\mathrm p}(Y|\eta )=\int {\mathrm p}(Y|g^0){\mathrm p}(g^0|\eta )dg^0. \end{aligned}$$
(7.41)

The probability density \({\mathrm p}(Y|\eta )\) is the marginal likelihood and, hence, (7.40) is called the MLM method.

Computation of (7.41) is especially simple in our case since our measurements model is linear and Gaussian. In fact, in the Bayesian interpretation of regularized linear system identification in RKHS, the impulse response \(g^0\) is a zero-mean Gaussian process with covariance \(\lambda K,\) where \(\lambda \) is a positive scale factor. The impulse response is also assumed independent of the noises e(t) which are white and Gaussian of variance \(\sigma ^2\). Recall also the definition of the matrix O, now possibly function of \(\eta \), reported in (7.14) for the discrete-time case, i.e., when \(\mathscr {X}=\mathbb {N}\), and in (7.22) for the continuous-time case, i.e., when \(\mathscr {X}={\mathbb R}^+\). The matrix \(\lambda O(\eta )\) plays an important role in the MLM method since it corresponds to the covariance matrix of the noise-free output vector \([L_1[g^0],\ \dots ,\ L_N[g^0]]^T\) and is thus often called the output kernel matrix. Then, as also discussed in Sect. 7.7.2, it comes that the vector Y turns out to be Gaussian with zero mean, i.e.,

$$ Y \sim \mathscr {N}(0,Z(\eta )), $$

where the covariance matrix \(Z(\eta )\) is given by

$$\begin{aligned} Z(\eta )= \lambda O + \sigma ^2 I_N \end{aligned}$$

with \(I_N\) the \(N \times N\) identity matrix. Here, the vector \(\eta \) could, e.g., contain both \(\lambda \) and \(\sigma ^2\). One then obtains that the empirical Bayes estimate of \(\eta \) in (7.40) becomes

$$\begin{aligned} \hat{\eta }= \arg \min _{\eta \in \varGamma } \ Y^T Z(\eta )^{-1} Y +\log \det (Z(\eta )), \end{aligned}$$
(7.42)

where the objective is proportional to the minus log of the marginal likelihood.

As discussed in Chap. 4, the MLM method includes the Occam’s razor principle,  i.e., unnecessarily complex models are automatically penalized, see e.g., [83]. In particular, the Occam’s factor arises thanks to the marginalization and it manifests itself in the term \(\log \det (Z(\eta ))\) in (7.42). A simple example can be obtained thinking of the behaviour of the objective for different values of the kernel scale factor \(\lambda \). When \(\lambda \) increases, the model becomes more complex since, under a stochastic viewpoint, the prior variance of the impulse response \(g^0\) increases. In fact, the term \(Y^T Z(\eta )^{-1} Y\), related to the data fit, decreases since the inverse of \(Z(\eta )\) tends to the null matrix (the model has infinite variance and can describe any kind of data). But the Occam’s factor increases since \(\det (Z(\eta ))\) grows to infinity. In this way, \(\hat{\eta }\) will balance data fit and model complexity.

Fig. 7.7
figure 7

True impulse response (red thick line) and impulse response estimates obtained by ridge regression with hyperparameters estimated by an oracle that optimizes the fit (top panel), and by the stable spline kernels of order 1 (middle) and 2 (bottom) with hyperparameters estimated by marginal likelihood maximization

Fig. 7.8
figure 8

Boxplot of the fits over the 200 data sets achieved by ridge regression with oracle (left) and by the stable spline kernels of order 1 (middle) and 2 (right) with hyperparameters estimated via marginal likelihood maximization

7.2.1.1 Numerical Example

To illustrate the effectiveness of MLM, we revisit the example reported in Sect. 1.2. The problem is to reconstruct the impulse response reported in Fig. 7.7 (red line) from the 1000 input–output data displayed in Fig. 1.2. System input is low pass and this makes estimation hard due to ill-conditioning.

We will adopt three kernels. Using \(\delta \) to denote the Kronecker delta, the value K(ij) is defined, respectively, by

$$ \delta _{ij}, \quad \alpha ^{\max (i,j)}, \quad \frac{\alpha ^{i+j+\max (i,j)}}{2}-\frac{\alpha ^{3\max (i,j)}}{6}. $$

The first choice corresponds to ridge regression with the regularizer given by the sum of squared impulse response coefficients. The other two are the first- and second-order stable spline kernel reported in (7.15) and in (7.30), respectively. More specifically, the last kernel corresponds to the discrete-time version of (7.30) with \(\alpha =e^{-\beta }\).

In Fig. 1.5, we reported the ridge regularized estimate with \(\gamma \) chosen by an oracle to maximize the fit. To ease comparison with other approaches, such a figure is also reproduced in the top panel of Fig. 7.7. The reconstruction is not satisfactory since the regularizer does not include information on smoothness and decay. In fact, the Bayesian interpretation reveals that ridge regression describes the impulse response as realization of white noise, a poor model for stable dynamic systems. This also explains the presence of oscillations in the reconstructed profile.

The middle and bottom panel report the estimates obtained by the stable spline kernels with the noise variance and the hyperparameters \(\gamma ,\alpha \) tuned from data through MLM. Even if no oracle is used, the quality of the impulse response reconstruction greatly increases. This is also confirmed by a Monte Carlo study where 200 data sets are obtained using the same kind of input but generating new independent noise realizations. MATLAB boxplots of the 200 fits for all the three estimators are in Fig. 7.8. Here, the median is given by the central mark while the box edges are the 25th and 75th percentiles. Then, the whiskers extend to the most extreme fits not seen as outliers. Finally, the outliers are plotted individually. Average fits are \(73.7\%\) for ridge, \(83.9\%\) for first-order and \(90.2\%\) for second-order stable spline.

In this example, one can see that it is preferable to use the second-order stable spline kernel. This is easily explained by the fact that the true impulse response is quite regular so that increasing our expected smoothness improves the performance.

Interestingly, the selection between different kernels, like first- and second-order stable spline, can be also automatically performed by MLM, so addressing the problem of model comparison described in Sect. 2.6.2. In fact, let s denote an additional hyperparameter that may assume only value 0 or 1. Then, we can consider the combined kernel

$$ s \alpha ^{\max (i,j)} + (1-s) \Big (\frac{\alpha ^{i+j+\max (i,j)}}{2}-\frac{\alpha ^{3\max (i,j)}}{6}\Big ) $$

and optimize the hyperparameters \(s,\alpha \) and \(\gamma \) by MLM. Clearly, the role of s is to select one of the two kernels, e.g., if the estimate \(\hat{s}\) is 0, then the impulse response estimate will be given by a second-order stable spline. Applying this procedure to our problem, one finds that the second-order stable spline kernel is selected 177 times out of the 200 Monte Carlo runs. Obtained fits are shown in Fig. 7.9, their mean is \(88.8\%\).

Fig. 7.9
figure 9

Boxplot of the fits over the 200 data sets achieved by a stable spline estimator where, beyond hyperparameters, also the kernel order (1 or 2) is estimated by marginal likelihood maximization

Remark 7.4

Kernel choice via MLM has also connections with selection through the concept of Bayesian model probability discussed in Sect. 4.11, see also [50]. In fact, assume we are given different competitive kernels (covariances) \(K^i\) and, for a while, assume also that all the hyperparameter vectors \(\eta ^i\) are known. We can then interpret each kernel as a different model. We can also assign a priori probabilities that data have been generated by the ith covariance \(K^i\), hence thinking of any model as a random variable itself. If all the kernels are given the same probability, the marginal likelihood computed using \(K^i\) becomes proportional to the posterior probability of the ith model. This permits to exploit the marginal likelihood to select the “best” kernel-based estimate among those generated by the \(K^i\). When hyperparameters are unknown, the marginal likelihoods can be evaluated with each \(\eta ^i\) set to its estimate \(\hat{\eta }^i\). In this case, care is needed since maximized likelihoods define model posterior probabilities that do not account for hyperparameters uncertainty. For example, if the dimensions of \(\eta ^i\) change with i, the risk is to select a kernel that have many parameters and overfits. This problem can be mitigated, e.g., by adopting the criteria described in Sect. 2.4.3, e.g., using BIC, we compute

$$ \hat{i} = \arg \min _i \ -2\log {\mathrm p}(Y| \hat{\eta }^i) + (\dim \eta ^i) \log N, $$

where N is the number of available output measurements and \(\dim \eta ^i\) is the number of hyperparameters contained in the ith model. Note that, when using stable spline kernels as in the above example, the BIC penalty is irrelevant since the first- and the second-order stable spline estimator contain the same number of unknown hyperparameters.

7.2.2 Stein’s Unbiased Risk Estimator

The second method is the Stein’s unbiased risk estimator (SURE) method introduced in Sect. 3.5.3.2. The idea of SURE is to minimize an unbiased estimator of the risk, which is the expected in-sample validation error of the model estimate. In what follows, \(g^0\) is no more stochastic as in the previous subsection but corresponds to a deterministic impulse response. Identification data are given by

$$\begin{aligned} y(t_i)=L_i[g^0]+ e(t_i), \quad i=1,\dots , N, \end{aligned}$$

where the \(e(t_i)\) are independent, with zero mean and known variance \(\sigma ^2\), and each \(L_i\) is the linear functional defined by convolutions with the system input evaluated at \(t_i\). One thus has \(L_i[g^0]=\sum ^{\infty }_{k=1}g^0(k)u(t_i-k)\) in discrete time, where the \(t_i\) assume integer values, and \(L_i[g^0]=\int _0^\infty g^0(\tau )u(t_i-\tau )d\tau \) in continuous time. The N independent validation output samples \(y_{v}(t_i)\) are then defined by using the same input that generates the identification data but an independent copy of the noises, i.e.,

$$\begin{aligned} y_{v}(t_i) = L_i[g^0]+ e_{v}(t_i),\quad i=1,\dots ,N. \end{aligned}$$
(7.43)

So, all the 2N random variables \(e_{v}(t_i)\) and \(e(t_i)\) are mutually independent, with zero mean and noise variance \(\sigma ^2\). Consider the impulse response estimator

$$\begin{aligned} \hat{g} = \arg \min _{g \in \mathscr {H}} \left( \sum _{i=1}^{N} (y(t_i)-L_i[g])^2 + \gamma \Vert g\Vert _{\mathscr {H}}^2 \right) \end{aligned}$$

as a function of the hyperparameter vector \(\eta \). The predictions of the \(y_{v}(t_i)\) are then given by \(L_i[\hat{g}]\) and also depend on \(\eta \). The expected in-sample validation error of the model estimate \(\hat{g}\) is then given by the mean prediction error

$$\begin{aligned}&{\text {EVE}_{\text {in}}}(\eta )=\frac{1}{N}\sum _{i=1}^N{\mathscr {E}}(y_{v}(t_i)-L_i[\hat{g}])^2,\end{aligned}$$
(7.44)

where the expectation \({\mathscr {E}}\) is over the random noises \(e_v(t_i)\) and \(e(t_i)\). Note that the result not only depends on \(\eta \) but also on the unknown (deterministic) impulse response \(g^0\). So, we cannot compute the prediction error. However, it is possible to derive an unbiased estimate of it. To obtain this, let \(\hat{Y}(\eta )\) be the (column) vector with components \(L_i[\hat{g}]\). The output kernel matrix \(O(\eta )\), already introduced to describe marginal likelihood maximization, then gives the connection between the vector Y containing the measured outputs \(y(t_i)\) and the predictions. In fact, using the representer theorem to obtain \(\hat{g}\), and hence the \(L_i[\hat{g}]\), one obtains

$$\begin{aligned} \hat{Y}(\eta )= O(\eta )(O(\eta )+ \gamma I_N)^{-1}Y. \end{aligned}$$
(7.45)

Following the same line of discussion developed in Sect. 3.5.3.2 to obtain (3.96), we can derive the following unbiased estimator of (7.44):

$$\begin{aligned} \widehat{\text {EVE}_{\text {in}}}(\eta ) = \frac{1}{N}\Vert Y-\hat{Y}(\eta )\Vert ^2 +2\sigma ^2\frac{\text {dof}(\eta )}{N}, \end{aligned}$$
(7.46)

where \(\text {dof}(\eta )\) are the degrees of the freedom of \(\hat{Y}(\eta )\) given by

$$\begin{aligned} \text {dof}(\eta ) = \mathrm {trace}(O(\eta )(O(\eta )+\gamma I_N)^{-1}) \end{aligned}$$
(7.47)

that vary from N to 0 as \(\gamma \) increases from 0 to \(\infty \).

Note that (7.46) is function only of the N output measurements \(y(t_i)\). Thus, we can then estimate the hyperparameter \(\eta \) by minimizing the unbiased estimator \(\widehat{\text {EVE}_{\text {in}}}(\eta )\) of \({\text {EVE}_{\text {in}}}(\eta )\) to achieve

$$\begin{aligned} \hat{\eta }= \mathop {\mathrm {arg\,min}}\limits _{\eta \in \varGamma } \frac{1}{N}\Vert Y-\hat{Y}(\eta )\Vert ^2 +2\sigma ^2\frac{\text {dof}(\eta )}{N}. \end{aligned}$$
(7.48)

The above formula has the same form of the AIC criterion (2.33) computed assuming Gaussian noise of known variance \(\sigma ^2\) except that the dimension m of the model parameter \(\theta \) is now replaced by the degrees of freedom \(\text {dof}(\eta )\).

7.2.3 Generalized Cross-Validation

The third approach is the generalized cross-validation (GCV) method. As discussed in Sects. 2.6.3 and 3.5.2.3, cross-validation (CV) is a classical way to estimate the expected validation error by efficient reuse of the data and GCV is closely related with the N-fold CV with quadratic losses. To describe it in the RKHS framework, let \(\hat{g}^{k}\) be the solution of the following function estimation problem:

$$\begin{aligned} \hat{g}^k = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \sum _{i=1,i\ne k}^N (y(t_i) - L_i[g])^2 + \gamma \Vert g \Vert ^2_{\mathscr {H}}. \end{aligned}$$
(7.49)

So, \(\hat{g}^{k}\) is the function estimate when the kth datum \(y(t_k)\) is left out. As also described, e.g., in [90, Chap. 4], the following relation between the prediction error of \(\hat{g}\) and the prediction error of \(\hat{g}^{k}\) holds:

$$\begin{aligned} y(t_k)-L_k[\hat{g}^{k}] = \frac{y(t_k)-L_k[\hat{g}]}{1-H_{kk}(\eta )}, \end{aligned}$$
(7.50)

where \(H_{kk}(\eta )\) is the (kk)th element of the influence matrix

$$ H(\eta ) = O(\eta )(O(\eta )+ \gamma I_N)^{-1}. $$

Therefore, the validation error of the N-fold CV with quadratic loss function is

$$\begin{aligned} \sum _{k=1}^N \left( y(t_k)-L_k[\hat{g}^{k}]\right) ^2=\sum _{k=1}^N \left( \frac{y(t_k)-L_k[\hat{g}]}{1-H_{kk}(\eta )}\right) ^2. \end{aligned}$$
(7.51)

Minimizing the above equation as a criterion to estimate the hyperparameter \(\eta \) leads to the predicted residual sums of squares (PRESS) method

$$\begin{aligned} \hat{\eta } = \mathop {\mathrm {arg\,min}}\limits _{\eta \in \varGamma } \sum _{k=1}^N \left( \frac{y(t_k)-L_k[\hat{g}]}{1-H_{kk}(\eta )}\right) ^2. \end{aligned}$$
(7.52)

The above criterion coincides with that derived in (3.80) working in the finite-dimensional setting.

GCV is a variant of (7.52) obtained by replacing each \(H_{kk}(\eta )\), \(k=1,\dots , N\), in (7.52) with their average. One obtains

$$\begin{aligned} \hat{\eta } = \mathop {\mathrm {arg\,min}}\limits _{\eta \in \varGamma } \sum _{k=1}^N \left( \frac{y(t_k)-L_k[\hat{g}]}{1-\mathrm {trace}(H(\eta ))/N}\right) ^2. \end{aligned}$$
(7.53)

In view of (7.45), one has

$$ \hat{Y}(\eta )= H(\eta )Y. $$

and, from (7.47) one can see that \(\mathrm {trace}(H(\eta ))\) corresponds to the degrees of freedom \(\text {dof}(\eta )\), i.e.,

$$ \mathrm {trace}(H(\eta )) =\text {dof}(\eta ). $$

So, the GCV (7.53) can be rewritten as follows:

$$\begin{aligned} \hat{\eta } = \mathop {\mathrm {arg\,min}}\limits _{\eta \in \varGamma } \ \frac{\Vert Y-\hat{Y}(\eta )\Vert ^2}{(1-\text {dof}(\eta )/N)^2}. \end{aligned}$$
(7.54)

This corresponds to the criterion (3.82) obtained in the finite-dimensional setting. Differently from SURE, a practical advantage of PRESS and GCV is that they do not require knowledge (or preliminary estimation) of the noise variance \(\sigma ^2\).

7.3 Theory of Stable Reproducing Kernel Hilbert Spaces

In the numerical experiments reported in this chapter, we have seen that regularized IIR models based, e.g., on TC and stable splines provide much better estimates of stable linear dynamic systems than other popular machine learning choices like the Gaussian kernel. The reading key was the inclusion in the identification process of information on the decay rate of the impulse response. This motivates the study of the class of the so-called stable kernels that enforces the stability constraint on the induced RKHS.

7.3.1 Kernel Stability: Necessary and Sufficient Conditions

The necessary and sufficient condition for a linear system to be bounded-input–bounded-output (BIBO) stable is that its impulse response \(g \in \ell _1\) for the discrete-time case and \(g \in \mathscr {L}_1\) for the continuous-time case. Here, \(\ell _1\) is the space of absolutely summable sequences, while \(\mathscr {L}_1\) contains the absolutely summable functions on \(\mathbb {R}^+\) (equipped with the classical Lebesque measure), i.e.,

$$\begin{aligned} \sum _{k=1}^\infty |g_k|<\infty \ \ \forall g\in \ell _1 \ \ \text { and } \ \ \int _0^\infty |g(x)| dx < \infty \ \ \forall g\in \mathscr {L }^1. \end{aligned}$$
(7.55)

Therefore, for regularized identification of stable systems the impulse response should be searched within a RKHS that is a subspace of \(\ell _1\) in discrete time and a subspace of \(\mathscr {L}_1\) in continuous time. This naturally leads to the following definition of stable kernels.

Definition 7.1

(Stable kernel, based on [32, 73]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel and \(\mathscr {H}:\mathscr {X}\rightarrow {\mathbb R}\) be the RKHS induced by K. Then, K is said to be stable if

  • \(\mathscr {H} \subset \ell _1\) for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);

  • \(\mathscr {H} \subset \mathscr {L}_1\) for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).

If a kernel K is not stable, it is also said to be unstable. Accordingly, the RKHS \(\mathscr {H}\) is said to be stable or unstable if K is stable or unstable.

Assigned a kernel, the question is now how to assess its stability. For this purpose, a direct use of the above definition is often challenging since it can be difficult to understand which functions belong to the associated RKHS. Stability conditions directly on K would be instead desirable. One first observation is that, since \(\mathscr {H}\) contains all kernel sections according to Theorem 6.2, all of them must be stable. In discrete time, this means \(K(i,\cdot ) \in \ell _1\) for all i. However, this condition is necessary but not sufficient for stability, a fact which is not so surprising since we have seen in Sect. 6.2 that \(\mathscr {H}\) contains also all the Cauchy limits of linear combinations of kernel sections. For instance, in Example 6.4, we have seen that the identity kernel \(K(i,j)=\delta _{ij}\), connected with ridge regression but here defined over all \({\mathbb N}\times {\mathbb N}\), induces \(\ell _2\). Such space is not contained in \(\ell _1\). So, the identity kernel is not stable even if each kernel section is stable since it contains only one non-null element.

The following fundamental result can be found in a more general form in [16] and gives the desired charactherization of kernel stability. Maybe not surprisingly, we will see that the key test spaces are \(\ell ^\infty \), that contains bounded sequences in discrete time, and \(\mathscr {L}_\infty \), that contains essentially bounded functions in continuous time. The proof is reported in Sect. 7.7.3.

Theorem 7.5

(Necessary and sufficient condition for kernel stability, based on [16, 32, 73]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then,

  • one has

    $$\begin{aligned} \mathscr {H}\subset \ell _1 \iff \sum _{s=1}^\infty \left| \sum _{t=1}^\infty K(s,t)l_t\right| <\infty ,\ \forall \ l\in \ell _\infty \end{aligned}$$
    (7.56)

    for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);

  • one has

    $$\begin{aligned} \mathscr {H} \subset \mathscr {L}_1 \iff \int _0^\infty \left| \int _0^\infty K(s,t)l(t)dt\right| ds <\infty ,\ \forall \ l\in \mathscr {L}_\infty \end{aligned}$$
    (7.57)

    for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).

Figure 7.10 illustrates the meaning of Theorem 7.5 by resorting to a simple system theory argument. In particular, a kernel can be seen as an acausal linear time-varying system. In discrete time it induces the following input–output relationship

$$\begin{aligned} y_i = \sum _{j=1}^\infty K_i(j) u_j, \quad i=1,2, \ldots , \end{aligned}$$
(7.58)

where \(K_i(j)=K(i,j),\) while \(u_i\) and \(y_i\) denote the system input and output at instant i. Then, the RKHS induced by K is stable iff system (7.58) maps every bounded input \(\{u_i\}_{i=1}^\infty \) into a summable output \(\{y_i\}_{i=1}^\infty \). Abusing notation, we can also see K as an infinite-dimensional matrix with ij-entry given by \(K_i(j)\) with u and y infinite-dimensional column vectors. Then, using ordinary algebra notation to handle these objects, the input–output relationship becomes \(y=Ku\) and the stability condition is

$$ \mathscr {H} \subseteq \ell _1 \ \iff \ Ku \in \ell _1 \ \ \forall u \in \ell _{\infty }. $$
Fig. 7.10
figure 10

System theoretic interpretation of RKHS stability. The kernel K is associated to an acausal linear system. In discrete time, the input–output relationship is given by \(y_i = \sum _{j=1}^\infty K_i(j) u_j\). Then, K is stable iff every bounded input u is mapped into a summable output y

In Theorem 7.5, it is immediate to see that including the constraint \(-1 \le l_t \le 1 \ \forall t\) on the test functions does not have any influence on the stability test. With this constraint, one has

$$ \left| \sum _{t=1}^\infty K(s,t)l_t\right| \le \sum _{t=1}^\infty | K(s,t) | \ \ \ \text {and} \ \ \ \left| \int _0^\infty K(s,t)l(t)dt\right| \le \int _0^\infty | K(s,t)|dt. $$

The following result is then an immediate corollary of Theorem 7.5 obtained exploiting the above inequalities. It states that absolute summability is a sufficient condition for a kernel to be stable.

Corollary 7.1

(based on [16, 32, 73]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then,

  • one has

    $$\begin{aligned} \mathscr {H} \subset \ell _1 \sum _{s=1}^\infty \sum _{t=1}^\infty | K(s,t) | <\infty \end{aligned}$$
    (7.59)

    for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);

  • one has

    $$\begin{aligned} \mathscr {H} \subset \mathscr {L}_1 \int _0^\infty \int _0^\infty | K(s,t) | dtds <\infty \end{aligned}$$
    (7.60)

    for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).

Finally, consider the class of nonnegative-valued kernels \(K^{ \text{+ }}\), i.e., satisfying \(K(s,t) \ge 0 \ \forall s,t\). If a kernel is stable, using as test function \(l(t)=1 \ \forall t\), one must have

$$ \left| \sum _{t=1}^\infty K^{ \text{+ }}(s,t)l_t\right| = \sum _{t=1}^\infty K^{ \text{+ }}(s,t) < \infty $$

in discrete time, and

$$ \left| \int _0^\infty K^{ \text{+ }}(s,t)l(t)dt\right| = \int _0^\infty K^{ \text{+ }}(s,t) dt < \infty $$

in continuous time. So, for nonnegative-valued kernels, stability implies (absolute) summability of the kernel. But, since we have seen in Corollary 7.1 that absolute summability implies stability, the following result holds.

Corollary 7.2

(based on [16, 32, 73]) Let \(K^{\text{+ }}:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite and nonnegative-valued kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then,

  • one has

    $$\begin{aligned} \mathscr {H} \subset \ell _1 \iff \sum _{s=1}^\infty \sum _{t=1}^\infty K^{ \text{+ }}(s,t) <\infty \end{aligned}$$
    (7.61)

    for the discrete-time case where \(\mathscr {X}=\mathbb {N}\);

  • one has

    $$\begin{aligned} \mathscr {H} \subset \mathscr {L}_1 \iff \int _0^\infty \int _0^\infty K^{ \text{+ }}(s,t) dtds <\infty \end{aligned}$$
    (7.62)

    for the continuous-time case where \(\mathscr {X}={\mathbb R}^+\).

As an example, we can now show that the Gaussian kernel (6.43) defined e.g., over \({\mathbb R}^+ \times {\mathbb R}^+\) is not stable. In fact, it is nonnegative valued and one has

$$ \int _0^\infty \int _0^\infty \exp \left( -(s-t)^2 /\rho \right) ds dt = \infty \ \ \forall \rho . $$

The same holds for the spline kernels (6.45) extended to \(\mathbb {R}^+ \times \mathbb {R}^+\) and also for translation invariant kernels introduced in Example 6.12, as e.g., proved in  [32] using the Schoenberg representation theorem. Hence, all of these models are not suited for stable impulse response estimation.

Remark 7.5

Any unstable kernel can be made stable simply by truncation. More specifically, let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be an unstable kernel with \(\mathscr {X}=\mathbb {N}\) or \(\mathscr {X}={\mathbb R}^+\). Then by setting \(K(s,t)=0\) for \(s,t>T\) for any given \(T\in \mathscr {X}\), a stable kernel is obtained. Care should be however taken when a FIR model is obtained through this operation. In fact, consider e.g., the use of cubic spline or Gaussian kernel in the estimation problem depicted in Fig. 7.6 setting T equal to 20 or 50. Also after truncation, such models would not give good performance: the undue oscillations affecting the estimates in the top and middle panel of Fig. 7.6 would still be present. The reason is that these two kernels do not encode the information that the variability of the impulse response decreases as time progresses, as also already discussed using the Bayesian interpretation of regularization.

7.3.2 Inclusions of Reproducing Kernel Hilbert Spaces in More General Lebesque Spaces \(\star \)

We now discuss the conditions for a RKHS to be contained in the spaces \(\mathscr {L}_p^{\mu }\) equipped with a generic measure \(\mu \). The following analysis will then include both the space \(\mathscr {L}_1\) (considered before with the Lebesque measure) and \(\ell _1\) as special cases obtained with \(p=1\). First, we need the following definition.

Definition 7.2

(based on [16]) Let \(1 \le p \le \infty \) and \(q=\frac{p}{p-1}\) with the convention \(\frac{p}{p-1}=\infty \) if \(p=1\) and \(\frac{p}{p-1}=1\) if \(p=\infty \). Moreover, let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel. Then, the kernel K is said to be q-bounded if

  1. 1.

    the kernel section \(K_s \in \mathscr {L}_p^{\mu }\) for almost all \(s \in \mathscr {X}\), i.e., for every \(s \in \mathscr {X}\) except on a set of null measure w.r.t. \(\mu \);

  2. 2.

    the function \(\int _0^\infty K(s,t)l(t)d\mu (t) \in \mathscr {L}_p^{\mu }\), \(\forall l \in \mathscr {L}_{q}^{\mu }\).

The following theorem then gives the necessary and sufficient condition for the q-boundedness of a kernel and is a special case of Proposition 4.2 in [16].

Theorem 7.6

(based on [16]) Let \(K:\mathscr {X}\times \mathscr {X}\rightarrow {\mathbb R}\) be a positive semidefinite kernel with \(\mathscr {H}\) the induced RKHS. Then, \(\mathscr {H}\) is a subspace of \(\mathscr {L}_p^{\mu }\) if and only if K is q-bounded, i.e.,

$$ \mathscr {H} \subset \mathscr {L}_p^{\mu } \iff K \; \text{ is } \; q\text{-bounded. } $$

Theorem 7.6 permits thus to see if a RKHS is contained in \(\mathscr {L}_p^{\mu }\) by checking the properties of the kernel. Interestingly, setting \(p=1\), that implies \(q=\infty \), and \(\mu \) e.g., to the Lebesque measure one can see that the concept of stable and \(\infty \)-bounded kernel are equivalent. Theorem 7.5 is then a special case of Theorem 7.6.

7.4 Further Insights into Stable Reproducing Kernel Hilbert Spaces \(\star \)

In this section, we provide some additional insights into the structure of the stable kernels and associated RKHSs. The analysis is focused on the discrete-time case where the kernel K can be seen as an infinite-dimensional matrix with the (ij)-entries denoted by \(K_{ij}\). Thus, the function domain is the set of natural numbers \(\mathbb {N}\) and the RKHS contains discrete-time impulse responses of causal systems.

As discussed after (7.58) to comment Fig. 7.10, the kernel K can be also associated with an acausal linear time-varying system, often called kernel operator in the literature. It maps the infinite-dimensional input (sequence) u into the infinite-dimensional output Ku whose ith component is \(\sum _{j=1}^\infty K_{ij} u_j\). Two important kernel operators will be considered. The first one maps \(\ell _{\infty }\) into \(\ell _1\) and is key for kernel stability as pointed out in Theorem 7.5. The second one maps \(\ell _2\) into \(\ell _2\) itself and will be important to discuss spectral decompositions of stable kernels.

7.4.1 Inclusions Between Notable Kernel Classes

To state some relationships between stable kernels and other fundamental classes, we start introducing some sets of RKHSs. Define

  • the set \(\mathscr {S}_{s}\) that contains all the stable RKHSs;

  • the set \(\mathscr {S}_{1}\) with all the RKHSs induced by absolutely summable kernels, i.e., satisfying

    $$\begin{aligned} \sum _{ij} \ |K_{ij}| < +\infty ; \end{aligned}$$
  • the set \(\mathscr {S}_{ft}\) of RKHSs induced by finite-trace kernels, i.e., satisfying

    $$\begin{aligned} \sum _{i} \ K_{ii} < +\infty ; \end{aligned}$$
  • the set \(\mathscr {S}_{2}\) associated to squared summable kernels, i.e., satisfying

    $$\begin{aligned} \sum _{ij} \ K_{ij}^2 < +\infty . \end{aligned}$$

One has then the following result from [8] (see Sect. 7.7.4 for some details on its proof).

Theorem 7.7

(based on [8]) It holds that

$$\begin{aligned} \mathscr {S}_{1} \subset \mathscr {S}_{s} \subset \mathscr {S}_{ft} \subset \mathscr {S}_{2}. \end{aligned}$$
(7.63)

Figure 7.11 gives a graphical description of Theorem 7.7 in terms of inclusions of kernels classes. Its meaning is further discussed below.

Fig. 7.11
figure 11

Inclusion properties of some important kernel classes

In Corollary 7.1, we have seen that absolute summability is a sufficient condition for kernel stability. The result \(\mathscr {S}_{1} \subset \mathscr {S}_{s}\) shows also that such inclusion is strict. Hence, one cannot conclude that a kernel is unstable from the sole failure of absolute summability.

The fact that \(\mathscr {S}_{s} \subset \mathscr {S}_{ft}\) means that the set of finite-trace kernels contains the stable class. This inclusion is strict, hence the trace analysis can be used only to show that a given RKHS is not contained in \(\ell _1\). There are however interesting consequences of this fact. Consider all the RKHSs induced by translation invariant kernels

$$ K_{ij}=h(i-j), $$

where h satisfies the positive semidefinite constraints. The trace of these kernels is \(\sum _i \ K_{ii}=\sum _i \ h(0)\) and it always diverges unless h is the null function. So, all the translation invariant kernels are unstable (as already mentioned after Corollary 7.2). Other instability results become also immediately available. For instance, all the kernels with diagonal elements satisfying \(K_{ii} \propto i^{-\delta }\) are unstable if \(\delta \le 1\).

Finally, the strict inclusion \(\mathscr {S}_{ft} \subset \mathscr {S}_{2}\) shows that the finite-trace test is more powerful than a check of kernel squared summability.

7.4.2 Spectral Decomposition of Stable Kernels

As discussed in Sect. 6.6.3 and in Remark 6.3, kernels can define spaces rich of functions by (implicitly) mapping the space of the regressors into high-dimensional feature spaces where linear estimators can be used. This allows to reduce nonlinear algorithms even without knowing explicitly the feature map, i.e., without the exact knowledge of which functions are encoded in the kernel. In particular, in Sect. 6.3, we have seen that if the kernel admits the spectral representation

$$\begin{aligned} K(x,y) = \sum _{i=1}^{\infty } \ \zeta _i \rho _i(x) \rho _i(y), \end{aligned}$$
(7.64)

then the \(\rho _i(x)\) are the basis functions that span the RKHS induced by K. For instance, the basis functions \(\rho _1(x)=1,\rho _2(x)=x,\rho _3(x)=x^2,\ldots \) describe polynomial models which are, e.g., included up to a certain degree in the polynomial kernel discussed in Sect. 6.6.4. Now, we will see that stable kernels always admit an expansion of the type (7.64) with the \(\rho _i\) forming a basis of \(\ell _2\). The number of \(\zeta _i\) different from zero then corresponds to the dimension of the induced RKHS.

Formally, it is now necessary to consider the operator induced by a stable kernel K as a map from \(\ell _2\) into \(\ell _2\) itself. Again, it is useful to see K as an infinite-dimensional matrix so that we can think of Kv as the result of the kernel operator applied to \(v \in \ell _2\). An operator is said to be compact if it maps any bounded sequence \(\{v_i\}\) into a sequence \(\{K v_i\}\) from which a convergent subsequence can be extracted [85, 95]. From Theorem 7.7, we know that any stable kernel K is finite trace and, hence, squared summable. This fact ensures the compactness of the kernel operator, as discussed in [8] and stated below.

Theorem 7.8

(based on [8]) Any operator induced by a stable kernel is self-adjoint, positive semidefinite and compact as a map from \(\ell _2\) into \(\ell _2\) itself.

This result allows us to exploit the spectral theorem [35] to obtain an expansion of K. Now, recall that spectral decompositions were discussed in Sect. 6.3 where the Mercer’s theorem was also reported. Mercer’s theorem derivations exploit the spectral theorem and, as, e.g., in Theorem 6.9, they typically assume that the kernel domain is compact, see also [86] for discussions and extensions. Indeed, first formulations consider continuous kernels on compact domains (proving also uniform convergence of the expansion). However, the spectral theorem does not require the domain to be compact and, when applied to discrete-time kernels on \(\mathbb {N} \times \mathbb {N}\), it guarantees pointwise convergence. It thus becomes the natural generalization of the decomposition of a symmetric matrix in terms of eigenvalues and eigenvectors, initially discussed in the finite-dimensional setting in Sect. 5.6 to link regularization and basis expansion. This is summarized in the following proposition that holds in virtue of Theorem 7.8.

Proposition 7.2

(Representation of stable kernels, based on [8]) Assume that the kernel K is stable. Then, there always exists an orthonormal basis of \(\ell _2\) composed by eigenvectors \(\{\rho _i\}\) of K with corresponding eigenvalues \(\{\zeta _i\}\), i.e.,

$$ K \rho _i = \zeta _i \rho _i, \ \ i=1,2,\ldots . $$

In addition, the kernel admits the following expansion:

$$\begin{aligned} K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y), \end{aligned}$$
(7.65)

with \(x,y \in \mathbb {N}\).

While in the next subsection, we will use the above theorem to discuss the representation of stable RKHSs, some numerical considerations regarding (7.65) are now in order. Under an algorithmic viewpoint, many efficient machine learning procedures use truncated Mercer expansions to approximate the kernel, see [42, 52, 75, 93, 96] for discussions on their optimality in a stochastic framework. Applications for system identification can be found in [15] where it is shown that a relatively small number of eigenfunctions (w.r.t. the data set size) can well approximate impulse responses regularized estimates. These works trace back to the so-called Nyström method where an integral equation is replaced by finite-dimensional approximations [5, 6]. However, obtaining the Mercer expansion (7.65) in closed form is often hard. Fortunately, the \(\ell _2\) basis and related eigenvalues of a stable RKHS can be numerically recovered (with arbitrary precision w.r.t. the \(\ell _2\) norm) through a sequence of SVDs applied to truncated kernels [8]. Formally, let \(K^{(d)}\) denote the \(d \times d\) positive semidefinite matrix obtained by retaining only the first d rows and columns of K. Let also \(\rho _i^{(d)}\) and \(\zeta _i^{(d)}\) be, respectively, the eigenvectors of \(K^{(d)}\), seen as elements of \(\ell _2\) with a tail of zeros, and the eigenvalues returned by the SVD of \(K^{(d)}\). Assume, for simplicity, single multiplicity of each \(\zeta _i\). Then, for any i, as d grows to \(\infty \) one has

$$\begin{aligned}&\zeta _i^{(d)} \rightarrow \zeta _i \end{aligned}$$
(7.66a)
$$\begin{aligned}&\Vert \rho _i^{(d)} - \rho _i \Vert _2 \rightarrow 0, \end{aligned}$$
(7.66b)

where \(\Vert \cdot \Vert _2\) is the \(\ell _2\) norm.

In Fig. 7.12, we show some eigenvectors (left panel) and the first 100 eigenvalues (right) of the stable spline kernel \(K_{xy} =\alpha ^{\max {(x,y)}}\) with \(\alpha =0.99\). Results are obtained applying SVDs to truncated kernels of different sizes and monitoring convergence of eigenvectors and eigenvalues. The final outcome was obtained with \(d=2000\).

Fig. 7.12
figure 12

Expansion of the first-order discrete-time stable spline kernel \(K_{xy}=\alpha ^{\max (x,y)}\) with \(\alpha =0.99\): eigenfunctions \(\rho _i(x)\) orthogonal in \(\ell _2\) for \(i=1,2,8\) (left panel, samples are linearly interpolated) and eigenvalues \(\zeta _i\) (right)

7.4.3 Mercer Representations of Stable Reproducing Kernel Hilbert Spaces and of Regularized Estimators

Now we exploit the representations of the RKHSs induced by a diagonalized kernel as discussed in Theorems 6.10 and 6.13 (where compactness of the input space is not even required). In view of Proposition 7.2, assuming for simplicity all the \(\zeta _i\) different from zero, one obtains that the RKHS associated to a stable K always admits the representation

$$\begin{aligned} \mathscr {H} = \Big \{ g = \sum _{i=1}^{\infty } a_i \rho _i \ \ \text {s.t.} \ \ \sum _{i=1}^{\infty } \ \frac{a^2_i}{\zeta _i} < +\infty \Big \} , \end{aligned}$$
(7.67)

where the \(\rho _i\) are the eigenvectors of K forming an orthonormal basis of \(\ell _2\).Footnote 1 If \(g = \sum _{i=1}^{\infty } a_i \rho _i \), one also has

$$\begin{aligned} \Vert g\Vert _{\mathscr {H}}^2 = \sum _{i=1}^{\infty } \ \frac{a^2_i}{\zeta _i}. \end{aligned}$$
(7.68)

The fact that any stable RKHS is generated by an \(\ell _2\) basis gives also a clear connection with the important impulse response estimators which adopt orthonormal functions, e.g., the Laguerre functions illustrated in Fig. 7.3 [46, 91, 92]. A classical approach used in the literature is to introduce the model \(g=\sum _i a_i \rho _i\) and then to use linear least squares to determine the expansion coefficients \(a_i\). In particular, let \(L_t[g]\) be the system output, i.e., the convolution between the known input and g evaluated at the time instant t. Then, the impulse response estimate is

$$\begin{aligned} \hat{g}&= \sum _{i=1}^d \ \hat{a}_i \rho _i \end{aligned}$$
(7.69a)
$$\begin{aligned} \{\hat{a}_i\}_{i=1}^d&= \mathop {\mathrm {arg\,min}}\limits _{\{a_i\}_{i=1}^d} \ \sum _{t=1}^N \ \left( y(t) - L_t\left[ \sum _{i=1}^d \ a_i \rho _i\right] \right) ^2, \end{aligned}$$
(7.69b)

where d determines model complexity and is typically selected using AIC or cross-validation (CV) as discussed in Chap. 2.

In view of (7.67) and (7.68), the regularized estimator (7.10), equipped with a stable RKHS, is equivalent to

$$\begin{aligned} \hat{f}&= \sum _{i=1}^{\infty } \ \hat{a}_i \rho _i \end{aligned}$$
(7.70a)
$$\begin{aligned} \{\hat{a}_i\}_{i=1}^\infty&= \mathop {\mathrm {arg\,min}}\limits _{\{a_i\}_{i=1}^\infty } \ \sum _{t=1}^N \ \left( y(t) - L_t\left[ \sum _{i=1}^\infty \ a_i \rho _i\right] \right) ^2 + \gamma \sum _{i=1}^{\infty } \ \frac{a^2_i}{\zeta _i}. \end{aligned}$$
(7.70b)

This result is connected with the kernel trick discussed in Remark 6.3 and shows that regularized least squares in a stable (infinite-dimensional) RKHS always model impulse responses using an \(\ell _2\) orthonormal basis, as in the classical works on linear system identification. But the key difference between (7.69) and (7.70) is that complexity is no more controlled by the model order because d is set to \(\infty \). Complexity instead depends on the regularization parameter \(\gamma \) (and possibly also on other kernel parameters) that balances the data fit and the penalty term. This latter induces stability by using the kernel eigenvalues \(\zeta _i\) to constrain the decay rate to zero of the expansion coefficients.

7.4.4 Necessary and Sufficient Stability Condition Using Kernel Eigenvectors and Eigenvalues

We have seen that a fruitful way to design a regularized estimator for linear system identification is to introduce a kernel by specifying its entries \(K_{ij}\). This modelling technique translates our expected features of an impulse response into kernel properties, e.g., smooth exponential decay as described by stable spline, TC and DC kernels. This route exploits the kernel trick, i.e., the basis functions implicit encoding. In some circumstances, it could be useful to build a kernel starting from the design of eigenfunctions \(\rho _i\) and eigenvalues \(\zeta _i\). A notable example is given by the (already cited) Laguerre or Kautz functions that belong to the more general class of Takenaka–Malmquist orthogonal basis functions [46]. They can be useful to describe oscillatory behavior or presence of fast/slow poles.

Since any stable kernel can be associated with an \(\ell _2\) basis, the following fundamental problem then arises. Given an orthonormal basis \(\{\rho _i\}\) of \(\ell _2\), for example, of the Takenaka–Malmquist type, which are the conditions on the eigenvalues \(\zeta _i\) ensuring stability of \(K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y)\)? The answer is in the following result derived from [8] that reports the necessary and sufficient condition (the proof is given in Sect. 7.7.5).

Theorem 7.9

(RKHS stability using Mercer expansions, based on [8]) Let \(\mathscr {H}\) be the RKHS induced by K with

$$ K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y), $$

where the \(\{\rho _i\}\) form an orthonormal basis of \(\ell _2\). Let also

$$\begin{aligned} \mathscr {U}_{\infty } =\Big \{ \ u \in \ell _{\infty }: \ |u(i)|=1, \ \forall i \ge 1 \ \Big \}. \end{aligned}$$

Then, one has

$$\begin{aligned} \mathscr {H} \subset \ell _1 \ \iff \ \sup _{u \in {\mathscr {U}}_{\infty }} \sum _i \zeta _i \langle \rho _i, u \rangle _2^2< +\infty , \end{aligned}$$
(7.71)

where \(\langle \cdot , \cdot \rangle _2 \) is the inner product in \(\ell _2\).

Thus, clearly, there is no stability if one function \(\rho _i\) associated to \(\zeta _i > 0\) doesn’t belong to \(\ell _1\). In fact, one can choose u containing the signs of the components of \(\rho _i\) and this leads to \(\langle \rho _i,u \rangle _2= +\infty \). Nothing is instead required for the eigenvectors associated to \(\zeta _i=0\). Theorem 7.9 permits also to derive the following sufficient stability condition.

Corollary 7.3

(based on [8]) Let \(\mathscr {H}\) be the RKHS induced by the kernel \(K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y)\) with \(\{\rho _i\}\) an orthonormal basis of \(\ell _2\). Then, it holds that

$$\begin{aligned} \mathscr {H} \subset \ell _1 \sum _i \zeta _i \Vert \rho _i\Vert _1^2 < +\infty . \end{aligned}$$
(7.72)

Furthermore, such condition also implies kernel absolute summability and, hence, it is not necessary for RKHS stability.

It is easy to exploit the stability condition (7.72) to design models of stable impulse responses starting from an \(\ell _2\) basis. Let us reconsider, e.g., Laguerre or Kautz basis functions \(\{\rho _i\}\) to build the impulse response model

$$ g = \sum _{i=1}^{\infty } \ a_i \rho _i. $$

To exploit (7.70), one has to define stability constraints on the expansion coefficients \(a_i\). This corresponds to define \(\zeta _i\) in such a way that the regularizer

$$ \sum _{i=1}^{\infty } \ \frac{a^2_i}{\zeta _i} $$

enforces absolute summability of g. Laguerre and Kautz models belong to the Takenaka–Malmquist class of functions \(\rho _i\) that all satisfy

$$ \Vert \rho _i \Vert _1 \le M i, $$

with M a constant independent of i [46]. Then, Corollary 7.3 ensures that the choice

$$ \zeta _i \propto i^{-\nu }, \quad \nu >2 $$

includes the stability contraint for the entire Takenaka–Malmquist class.

Let us now consider the class of orthonormal basis functions \(\rho _i\) all contained in a ball of \(\ell _1\). Then, the necessary and sufficient stability condition assumes a form especially simple as the following result shows.

Corollary 7.4

(based on [8]) Let \(\mathscr {H}\) be the RKHS induced by the kernel \(K_{xy} = \sum _{i=1}^{+\infty } \zeta _i \rho _i(x) \rho _i(y)\) with \(\{\rho _i\}\) an orthonormal basis of \(\ell _2\) and \(\Vert \rho _i \Vert _1 \le M < +\infty \) if \(\zeta _i>0\), with M not dependent on i. Then, one has

$$\begin{aligned} \mathscr {H} \subset \ell _1 \iff \sum _i \zeta _i < +\infty . \end{aligned}$$
(7.73)

Finally, Fig. 7.13 illustrates graphically all the stability results here obtained starting from Mercer expansions.

Fig. 7.13
figure 13

Inclusion properties of some important kernel classes in terms of Mercer expansions. This representation is the dual of that reported in Fig. 7.11 and defines kernel sets through properties of the kernel eigenvectors \(\rho _i\), forming an orthonormal basis in \(\ell _2\), and of the corresponding kernel eigenvalues \(\zeta _i\). The condition \(\sum _i \zeta _i \Vert \rho _i\Vert _1^2 < \infty \) is the most restrictive since it implies kernel absolute summability. The necessary and sufficient condition for stability is \(\sup _{u \in {\mathscr {U}}_{\infty }} \ \sum _i \zeta _i \langle \rho _i, u \rangle _2^2< \infty \). Finally, \(\sum _i \zeta _i < \infty \) and \(\sum _i \zeta _i^2 < \infty \) are exactly the conditions for a kernel to be finite trace and squared summable, respectively

7.5 Minimax Properties of the Stable Spline Estimator \(\star \)

In this section, we will derive non-asymptotic upper bounds on the MSE of the regularized IIR estimator (7.10) valid for all the exponentially stable discrete-time systems whose poles belong to the complex circle of radius \(\rho \). Obtained bounds can be evaluated before any data is observed. This kind of results give insight into the so-called sample complexity, i.e., the number of measurements needed to achieve a certain accuracy on impulse response reconstruction. This is an attractive feature even if, since the bounds need to hold for all the models falling in a particular class, often they are quite loose for the particular dynamic system at hand. However, they have a considerable theoretical value since permit also to assess the quality of (7.10) through nonparametric minimax concepts. Such setting considers the worst-case inside an infinite-dimensional class and has been widely studied in nonparametric regression and density estimation [88]. In particular, obtained bounds will lead to conditions which ensure the optimality in order, i.e., the best convergence rate of (7.10) in the minimax sense. We will derive them by considering system inputs given by white noises and using the TC/stable spline kernel (7.15) as regularizer. The important dependence between the convergence rate of (7.10) to the true impulse response, the stability kernel parameter \(\alpha \) and the stability radius \(\rho \) will be elucidated.

7.5.1 Data Generator and Minimax Optimality

As in the previous part of the chapter, we use \(g^0\) to denote the impulse response of a discrete-time linear system. The measurements are generated as follows:

$$\begin{aligned} y(t) = \sum _{k=1}^{\infty } g^0(k) u_{t-k} + e(t), \end{aligned}$$
(7.74)

where \(g^0(k)\) are the impulse response coefficients. We will always assume \(g^0\) as a deterministic and exponentially stable impulse response, while the input u and the noise e are stochastic as specified below.

Assumption 7.10

The impulse response \(g^0\) belongs to the following set:

$$\begin{aligned} \mathscr {S}(\varrho , L) = \Big \{ g: |g(k) |\le L\varrho ^{k} \Big \}, \ \ 0\le \rho <1. \end{aligned}$$
(7.75)

The system input and the noise are discrete-time stochastic processes. One has that \(\lbrace u(t) \rbrace _{t \in \mathbb {Z}}\) are independent and identically distributed (i.i.d.) zero-mean random variables with

$$\begin{aligned} \;\mathscr {E}[u(t)^2] = \sigma ^2_u, \quad |u(t) |\le C_u < \infty . \end{aligned}$$
(7.76)

Finally, \(\lbrace e(t) \rbrace _{t \in \mathbb {Z}}\) are independent random variables, independent of \(\lbrace u(t) \rbrace _{t \in \mathbb {Z}}\), with

$$\begin{aligned} \mathscr {E}[e(t)] = 0,\; \quad \mathscr {E}[e(t)^2] \le \sigma ^2. \end{aligned}$$
(7.77)

The available measurements are

$$\begin{aligned} \mathcal{D}_T = \{u(1),\ldots ,u(N),y(1),\ldots ,y(N)\}, \end{aligned}$$
(7.78)

where N is the data set size.

The quality of an impulse response estimator \(\hat{g}\) function of \(\mathcal{D}_T\) will be measured by computing the estimation error \(\mathscr {E} \Vert g^0-\hat{g}\Vert _2\), where \(\Vert \cdot \Vert _2\) is the norm in the space \(\ell _2\) of squared summable sequences. Note that the expectation is taken w.r.t. the randomness of the system input and the measurement noise. The worst-case error over the family \(\mathscr {S}\) of exponentially stable systems reported in (7.75) will be also considered. In particular, the uniform \(\ell _2\)-risk of \(\hat{g}\) is

$$ \sup _{g \in \mathscr {S}} \ \mathscr {E} \Vert g-\hat{g}\Vert _2. $$

An estimator \(g^{*}\) is then said to be minimax if the following equality holds for any data set size N:

$$ \sup _{g \in \mathscr {S}} \ \mathscr {E} \Vert g-g^{*}\Vert _2 = \inf _{\hat{g}} \ \sup _{g \in \mathscr {S}} \ \mathscr {E} \Vert g-\hat{g}\Vert _2, $$

meaning that \(g^{*}\) minimizes the error w.r.t. the worst-case scenario. Building such kind of estimator is in general really difficult. For this reason, it is often convenient to consider just the asymptotic behaviour introducing the concept of optimality in order. Specifically, an estimator \(\bar{g}\) is optimal in order if

$$ \sup _{g \in \mathscr {S}} \ \mathscr {E} \Vert g-\bar{g}\Vert _2 \le C_N \sup _{g \in \mathscr {S}} \ \mathscr {E} \Vert g-g^{*}\Vert _2 $$

with \(C_N\) is function of the data set size and satisfies \(\sup _N \ C_N<\infty \) and \(g^{*}\) is minimax. In our linear system identification setting, optimality in order thus ensures that, as N grows to infinity, the convergence rate of \(\bar{g}\) to the true impulse response \(g^0\) cannot be improved by any other system identification procedure in the minimax sense.

7.5.2 Stable Spline Estimator

As anticipated, our study is focused on the following regularized estimator:

$$\begin{aligned} \hat{g} = \mathop {\mathrm {arg\,min}}\limits _{g \in \mathscr {H}} \sum _{t=1}^N (y(t) - \sum ^{\infty }_{k=1}g(k)u(t-k))^2 + \gamma \Vert g\Vert ^2_{\mathscr {H}}, \end{aligned}$$
(7.79)

equipped with the stable spline kernel

$$\begin{aligned} K(i,j) =\alpha ^{\max {(i,j)}}, \quad 0< \alpha <1, \quad (i,j) \in \mathbb {N}. \end{aligned}$$
(7.80)

For future developments, it is important to control complexity of (7.79) not only by using the hyperparameters \(\gamma \) and \(\alpha \) but also through the dimension d of the following subspace:

$$ \mathscr {H}_d = \left\{ g \in \mathscr {H} \ \text{ s.t. } \ g(d+1)=g(d+2)= \dots = 0 \right\} $$

over which optimization of the objective in (7.79) is performed. In particular, we will consider the estimator

$$\begin{aligned} \hat{g}^d = \arg \min _{g \in {\mathscr {H}_d}} \ \sum _{t=1}^N \ \left( y(t) - \sum _{k=1}^{d} g(k) u(t-k) \right) ^2 + \gamma \Vert g \Vert ^2_{\mathscr {H}}, \end{aligned}$$
(7.81)

and will study how N and the choice of \(\gamma ,\alpha ,d\) influence the estimation error and, hence, the convergence rate. This will lead to complexity control rules that are a hybrid of those seen in the classical and in the regularized framework. To obtain this, first, we rewrite (7.81) in terms of regularized FIR estimation by exploiting the structure of the stable spline norm (7.16) which shows that

$$\begin{aligned} g \in \mathscr {H}_d \implies \Vert g \Vert ^2_{\mathscr {H}} = \Bigg (\sum _{t=1}^{d-1} \ \frac{ \left( g(t+1) -g(t) \right) ^2}{(1-\alpha )\alpha ^{t}}\Bigg ) + \frac{g^2(d)}{(1-\alpha )\alpha ^{d}}. \end{aligned}$$
(7.82)

Let us define the matrix

$$\begin{aligned} \small R= \frac{1}{\alpha - \alpha ^2} \begin{bmatrix} 1 &{}-1 &{}0 &{}0 &{}\cdots &{}0\\ -1 &{}1+\frac{1}{\alpha } &{}-\frac{1}{\alpha } &{}0 &{}\cdots &{} 0\\ 0 &{}-\frac{1}{\alpha } &{}\frac{1}{\alpha }+\frac{1}{\alpha ^2} &{}-\frac{1}{\alpha ^2} &{}\cdots &{}0\\ 0 &{}0 &{}\ddots &{}\ddots &{}\ddots &{}\vdots \\ 0 &{}0 &{}\cdots &{}\cdots &{}-\frac{1}{\alpha ^{d-2}} &{}\frac{1}{\alpha ^{d-2}} + \frac{1}{\alpha ^{d-1}} \end{bmatrix} \end{aligned}$$
(7.83)

and the regressors

$$\begin{aligned} \varphi _d(t) = \left( \begin{array}{c} u(t-1) \\ \vdots \\ u(t-d) \end{array}\right) . \end{aligned}$$
(7.84)

Now, one can easily see that the first d components of \(\hat{g}^d\) in (7.81) are contained in the vector

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\theta } \ \sum _{t=1}^N \ \left( y(t) - \varphi _d(t)^T \theta \right) ^2 + \gamma \theta ^T R \theta . \end{aligned}$$
(7.85)

Hence, we obtain

$$\begin{aligned} \hat{g}^d = (\hat{g}(1), \dots , \hat{g}(d), 0, 0, \dots ) \end{aligned}$$
(7.86)

where

$$\begin{aligned} \left( \begin{array}{c} \hat{g}(1) \\ \vdots \\ \hat{g}(d) \end{array}\right) = \Bigg (\frac{1}{N} \sum _{t=1}^N \varphi _d(t) \varphi _d^T(t) + \frac{\gamma }{N} R \Bigg )^{-1} \frac{1}{N} \sum _{t=1}^N \varphi _d(t)y(t). \end{aligned}$$
(7.87)

In real applications, one cannot measure the inputs at all the time instants and our data set \(\mathcal{D}_T\) in (7.78) could contain only the inputs \(u(1),\ldots ,u(N)\). So, differently from what postulated in the above equations, in practice the regressors are never perfectly known. One solution is just to replace with zeros the unknown input values \(\{u(t)\}_{t<1}\) entering (7.84). Also under this model misspecification, all the results introduced in the next sections still hold.

7.5.3 Bounds on the Estimation Error and Minimax Properties

The following theorem will report non asymptotic bounds that illustrate the dependence of \(\mathscr {E} \Vert g^0-\hat{g}^d\Vert _2\) on the following three key variables:

  • the FIR order d which determines the truncation error;

  • the parameter \(\alpha \) contained in the matrix R reported in (7.83) that establishes the exponential decay of the estimated impulse response coefficients;

  • the regularization parameter \(\gamma \) which trades-off the penalty defined by R and the adherence to experimental data.

In addition, it gives conditions on \(\alpha \) which ensure optimality in order if some conditions on the stability radius \(\rho \) entering (7.75) and on the FIR order d (function of the data set size N) are fullfilled. Below, the notation O(1) indicates an absolute constant, independent of N. Furthermore, given \(x \in \mathbb {R}\), we use \(\left\lfloor {x}\right\rfloor \) to indicate the largest integer not larger than x. The following result then holds.

Theorem 7.11

(based on [74]) Let the FIR order d be defined by the following function of the data set size N:

$$\begin{aligned} d^{*} = \left\lfloor {\frac{\ln (N (1-\alpha ) \sigma _u^2) - \ln (8\gamma )}{\ln (1/\alpha )}}\right\rfloor , \end{aligned}$$
(7.88)

with N large enough to guarantee \(d^{*} \ge 1\).

Then, under Assumption 7.10, the estimator (7.81) satisfies

$$\begin{aligned}&\mathscr {E} \Vert g-\hat{g}^{d^{*}}\Vert _2 \\ \nonumber&\le O(1) \ \left[ \frac{L\rho ^{d^{*}+1}}{(1-\rho )}\left( \sqrt{\frac{d^{*}}{N}}+1\right) + \frac{\sigma }{\sigma _u} \sqrt{\frac{d^{*}}{N}} + \frac{4L\gamma }{1-\alpha }\frac{h_{d^{*}}}{N}\right] , \end{aligned}$$
(7.89)

where

$$\begin{aligned} \quad \qquad h_{d^{*}}= \left\{ \begin{array}{cl} \sqrt{d^{*}} &{} \quad \text{ if } \ \ \alpha =\rho \\ \frac{\rho }{\sqrt{\alpha ^2-\rho ^2}} &{} \quad \text{ if } \ \ \alpha >\rho \\ \frac{\rho }{\sqrt{\rho ^2-\alpha ^2}} \left( \frac{\rho }{\alpha }\right) ^{d^{*}} &{} \quad \text{ if } \ \ \alpha <\rho \end{array} \right. . \end{aligned}$$
(7.90)

Furthermore, if the measurement noise is Gaussian and \(\sqrt{\alpha } \ge \rho \), the stable spline estimator (7.81) is optimal in order.

To illustrate the meaning of Theorem 7.11, first is useful to recall a result obtained in [43] that relies on the Fano’s inequality. It shows that, if a dynamic system is fed with white input and the measurement noise is Gaussian, the expected \(\ell _2\) error of any impulse response estimator cannot decay to zero faster than \(\sqrt{\frac{\ln N}{N}}\) in a minimax sense.

Theorem 7.12

(based on [43]) Let Assumption 7.10 hold and assume also that the measurement noise is Gaussian. Then, if \(\hat{g}\) is any impulse response estimator built with \(\mathcal{D}_T\), for N sufficiently large one has

$$\begin{aligned} \sup _{g \in \mathscr {S}(\varrho , L) }\ \mathscr {E} \Vert \hat{g}-g\Vert _2 \ge O(1) \sqrt{\frac{\ln N}{N}}. \end{aligned}$$
(7.91)

\(\blacksquare \)

To illustrate the convergence rate of the stable spline estimator, first note that the FIR dimension \(d^{*}\) in (7.88) scales logarithmically with N. Apart from irrelevant constants, one in fact has

$$\begin{aligned} d^{*} \sim \frac{\ln (N)}{\ln (1/\alpha )}. \end{aligned}$$
(7.92)

We now consider the three terms on the r.h.s. of (7.89) with \(d=d^{*}\). Since

$$\begin{aligned} \sqrt{\frac{d^{*}}{N}} \sim \sqrt{\frac{\ln N}{N}} \ \ \ \text{ and } \ \ \ \rho ^{d^{*}} \sim N^{-\frac{\ln \rho }{\ln \alpha }}, \end{aligned}$$
(7.93)

the first two terms decay to zero at least as \(\sqrt{\frac{\ln N}{N}}\). Regarding the third one, one has

$$\begin{aligned} \qquad \qquad \frac{h_{d^{*}}}{N} \sim \left\{ \begin{array}{cl} \frac{\sqrt{\ln N}}{N} &{} \quad \text{ if } \ \ \alpha =\rho \\ \frac{1}{N} &{} \quad \text{ if } \ \ \alpha >\rho \\ N^{-\frac{\ln \rho }{\ln \alpha }} &{} \quad \text{ if } \ \ \alpha <\rho \end{array} \right. \end{aligned}$$
(7.94)

and this shows that the optimal convergence rate is obtained if \( \alpha \ge \rho \) but the case \(\alpha <\rho \) can be critical. In particular, combining (7.89) with (7.93) and (7.94), the following considerations arise:

  • the convergence rate of the stable spline estimator (7.81) does not depend on \(\gamma \) but only on the relationship between the kernel parameter \(\alpha \) and the stability radius \(\rho \) defining the class of dynamic systems (7.75);

  • using Theorem 7.12, one can see from (7.94) that if \(\alpha <\rho \) the achievement of the optimal rate is related to the term \(N^{-\frac{\ln \rho }{\ln \alpha }}\) which appears as third term in (7.89). The key condition is

    $$ \frac{\ln \rho }{\ln \alpha } \ge 0.5 \implies \sqrt{\alpha } \ge \rho . $$

    This indeed corresponds to what was stated in the final part of Theorem 7.11: under Gaussian noise the stable spline estimator is optimal in order if \(\sqrt{\alpha }\) is an upper bound on the stability radius \(\rho \).

Relationships (7.93) and (7.94) clarify also what happens when the kernel includes a too fast exponential decay rate, i.e., when \(\sqrt{\alpha } <\rho \). In this case, the error goes to zero as \(N^{-\frac{\ln \rho }{\ln \alpha }}\), getting worse as \(\sqrt{\alpha }\) drifts apart \(\rho \). Such phenomenon has a simple explanation. A too small \(\alpha \) enforces the impulse response estimate to decay to zero also when the true impulse response coefficients are significantly different from zero. This corresponds to a strong bias: a wrong amount of regularization is introduced in the estimation process, hence compromising the convergence rate. This is also graphically illustrated in Fig. 7.14 that plots the convergence rate \(\ln \rho / \ln \alpha \) as a function of \(\sqrt{\alpha }\) for five different values of \(\rho \).

The analysis thus shows how \(\alpha \) plays a fundamental role in controlling impulse response complexity and, hence, in establishing the properties of the regularized estimator. This is not surprising also in view of the deep connection between the decay rate and the degrees of freedom of the model. This was illustrated in Fig. 5.6 of Sect. 5.5.1 using the class of DC kernels which includes TC as special case.

Fig. 7.14
figure 14

Convergence rate \(\ln \rho / \ln \alpha \) of the stable spline estimator as a function of \(\sqrt{\alpha }\) for \(\sqrt{\alpha } <\rho \) with \(\rho \) in the set \(\{0.7,0.8,0.9,0.95,0.99\}\). When \(\sqrt{\alpha } <\rho \) the estimation error converges to zero as \(N^{-\frac{\ln \rho }{\ln \alpha }}\). Instead, if \(\sqrt{\alpha } \ge \rho \) the error decays as \(\sqrt{\frac{\ln N}{N}}\), making the stable spline estimator optimal in order when the measurement noise is Gaussian

7.6 Further Topics and Advanced Reading

The idea to handle linear system identification with regularization methods in the RKHS framework first appears in [72]. As already mentioned, the representer theorems introduced in this chapter are special cases of that involving linear and bounded functionals reported in the previous chapter, see Theorem 6.16. More general versions of representer theorems with, e.g., more general loss functions and/or regularization terms can be found in, e.g., [33]. Similarly to the spline smoothing problem studied in Sect. 6.6.7, it could be useful to enrich the regularized impulse response estimators here described with a parametric component. Of course, the corresponding regularized estimator will still have a closed-form finite-dimensional representation that depends on both the number of data N and the number of enriched parametric components, e.g., see [72, 90].

The stable spline kernel [72] and the diagonal correlated kernel [19] are the first two kernels introduced in the linear system identification literature. The stability of a kernel (or equivalently the stability of a RKHS) first appeared in [32, 73]. The stability of a kernel is equivalent to the \(\infty \)-boundedness of the kernel, which is a special case of the more general q-boundedness with \(1<q\le \infty \) in [16]. The proof in [16] for the sufficiency and necessity of the q-boundedness of a kernel is quite involved and abstract. Theorem 7.5 is also discussed in [24], see also [76] where the stability analysis exploits the output kernel. The optimal kernel that minimizes the mean squared error was studied in [19, 73]. As already discussed, unfortunately, the optimal kernel cannot be applied in practice because it depends on the true impulse response to be estimated. Nevertheless, it offers a guideline to design kernels for linear system identification and more general function estimation problems. Motivated by these findings, many stable kernels have been introduced over the years, e.g., [17, 21, 77, 80, 97]. In particular, [17] proposed linear multiple kernels to handle systems with complicated dynamics, e.g., with distinct time constants and distinct resonant frequencies, and [77] further extended this idea and proposed “integral” versions of the stable spline kernels. To design kernels to embed more general prior knowledge, e.g., the overdamped/underdamped dynamics, common structure, etc., it is natural to divide the prior knowledge into different types and then develop systematic ways to design kernels accordingly, see [21, 80, 97]. In particular, the approaches proposed in [21] are based on machine learning and a system theory perspectives, those in [80] rely on the maximum entropy principle, and the method proposed in [97] uses harmonic analysis.

Along with the kernel design, many efforts have also been spent on “kernel analysis”. In particular, many kernels can be given maximum entropy interpretations including the stable spline kernel, the diagonal correlated kernel and the more general simulation-induced kernel [14, 21, 23]. This can help to understand the prior knowledge embedded in the model. Many kernels have the Markov property e.g., [83]. Examples are the diagonal correlated kernel and some carefully designed simulation induced kernels [21]. Exploring this property could help to design efficient implementation. As we have seen, the spectral analysis of kernels is often not available in closed form, even is it can be numerically recovered, but exceptions include the stable spline and the diagonal correlated kernel [20, 22, 72].

The hyperparameter tuning problem has been studied for a long time in the context of function estimation problem from noisy observations, e.g., [83, 90]. The marginal likelihood maximization method depends on the connection with the Bayesian estimation of Gaussian processes, which was first studied in [51] in spline regression, see also [41, 83, 90]. More discussions on its relation to Bayesian evidence and Occam’s razor principle can be found in e.g., [27, 60]. Stein’s unbiased risk estimation method is also known as the \(C_p\) statistics [61]. The generalized cross-validation method is first proposed in [28] and found to be rotation invariant in [44]. The problem can also be tackled using full Bayes approaches relying on stochastic simulation techniques, e.g., Markov chain Monte Carlo [1, 39].

In the context of linear system identification, some theoretical results on the hyperparameter estimation problem have been derived. In particular, it was shown in [4] that the marginal likelihood maximization method is consistent for diagonal kernels in terms of the mean square error and asymptotically minimizes a weighted mean square error for nondiagonal kernels. In [78], the robustness of the marginal likelihood maximization is analysed with the help of the excess degrees of freedom. It is further shown in [63, 64, 66] that Stein’s unbiased risk estimation as well as many cross-validation methods are asymptotically optimal in the sense of mean square error. In [4, 17, 94], the optimal hyperparameter of the marginal likelihood maximization is shown to be sparse. By exploring such property it is possible to handle various structure detection problems in system identification like sparse dynamic network identification [17, 26]. Full Bayes approaches can be found, e.g., in [69].

As also recalled in the previous chapter, straightforward implementation of the regularization method in RKHS framework has computational complexity \(O(N^3)\) and thus is prohibitive to apply when N is large. Many efficient approximation methods have been proposed in machine learning, e.g., [53, 81, 82]. In the context of linear system identification, there is another practical issue that must be noted in the implementation: the ill-conditioning possibly arising from the use of stable kernels, which is unavoidable due to the nature of stability. Hence, extra care has to be taken when developing efficient implementations. Some approximation methods have been proposed to reduce the computational complexity and avoid numerical computation. The first one is to truncate the IIR at a suitable finite-order n. Then, computational complexity becomes \(O(n^3)\) and one can also use the approach proposed in [18] relying on some fundamental algebraic techniques and reliable matrix factorizations. The other one is to truncate the infinite expansion of a kernel at a finite-order l. Then, computational complexity becomes \(O(l^3)\), see [15]. See also [36] for efficient kernel-based regularization implementation using Alternating Direction Method of Multipliers (ADMM). Another practical issue is the difficulty caused by local minima. For kernels with few number of hyperparameters, e.g., the stable spline kernel and the diagonal correlated kernel, this difficulty can be well faced using different starting points or also some grid methods. For systems with complicated dynamics, it is suggested to apply linear multiple kernels [17] since the corresponding marginal likelihood maximization is a difference of convex programming problem and a stationary point can be found efficiently using sequential convex optimization technique, e.g., [48, 87].

We only considered single-input single-output linear systems in the chapter with white measurement noise. For multiple-input single-output linear systems, it is natural to use multi-input impulse response models and then assume that the overall system has a block diagonal kernel [73]. The regularization method can also be extended to handle linear systems with colored noise, e.g., ARMAX models. One can exploit the fact that such systems can be approximated arbitrarily well by finite-order ARX models [57]. The problem thus becomes a special case of multiple-output single-input systems where the regressors contain also past outputs [71]. This will be also illustrated in Chap. 9.

In practice, the data could be contaminated by outliers due to a failure in the measurement or transmission equipment, e.g., [56, Chap. 15]. In the presence of outliers, it is suggested to use heavy-tailed distributions instead of the commonly used Gaussian distribution for the noise in robust statistics, e.g., [49]. For regularization methods in the RKHS framework, the key difficulty is that the hyperparameter estimation criteria and the regularized estimate may not have closed-form expressions. Several methods have been proposed to overcome this difficulty. In particular, an expectation maximization (EM) method was proposed in [10] and further improved in [55] exploiting a variational expectation method.

Input design is an important issue for classical system identification and many results have been obtained, e.g., [38, 45, 47, 56]. For regularized system identification in RKHS, some results have been reported recently. The first result was given in [37] where the mutual information between the output and the impulse response was chosen as the input design criterion. Unfortunately, obtaining the optimal input involves the solution of a nonconvex optimization problem. Differently from [37, 65] adopts scalar measures of the Bayesian mean square error as input design criterion, proposing a two-step procedure to find the global optimal input through convex optimization.

For what concerns the building of uncertainty regions around the dynamic system estimates, approaches are available which return bounds that, beyond being non-asymptotic, are also exact, i.e., with the desired inclusion probability. This requires some assumptions on data generation, like the introduction of prior distributions on the impulse response. An important example, already widely discussed in this book, is the use of a Bayesian framework that interprets regularization as Gaussian regression [83]. The posterior density becomes available in closed form and Bayes intervals can be easily obtained. Another approach to compute bounds for linear regression is the sign-perturbed sums (SPS) technique [30]. Following a randomization principle, it builds guaranteed uncertainty regions for deterministic parametric models in a quasi-distribution free setup [11, 12]. Recently, there have been notable extensions to the class of models that SPS can handle. The first line of thought still sees the unknown parameters as deterministic but introduces regularization, see [29, 70, 89] and also [31] which is a first attempt to move beyond the strictly parametric nature of SPS. A second line of thought allows for the exploitation of some form of prior knowledge at a more fundamental probabilistic level [13, 70].

Finally, the interested readers are referred to the survey [73] for more references, see also [25, 58].