Abstract
Regularization has been intensively used in statistics and numerical analysis to stabilize the solution of ill-posed inverse problems. Its use in System Identification, instead, has been less systematic until very recently. This chapter provides an overview of the main motivations for using regularization in system identification from a “classical” (Mean Square Error) statistical perspective, also discussing how structural properties of dynamical models such as stability can be controlled via regularization. A Bayesian perspective is also provided, and the language of maximum entropy priors is exploited to connect different form of regularization with time-domain and frequency-domain properties of dynamical systems. Some numerical examples illustrate the role of hyper parameters in controlling model complexity, for instance, quantified by the notion of Degrees of Freedom. A brief outlook on more advanced topics such as the connection with (orthogonal) basis expansion, McMillan degree, Hankel norms is also provided. The chapter is concluded with an historical overview on the early developments of the use of regularization in System Identification.
You have full access to this open access chapter, Download chapter PDF
5.1 Preliminaries
As we have discussed in the preceding chapters, system identification can be framed as an inverse problem which aims at finding a dynamical model \(\mathcal{M}\) from a set of measured input output “training” data \(\mathcal{D}_{T}:=\{u(t),y(t)\}_{t=1,\ldots ,N}\). The field of inverse problems [5] has motivated the development of, and is pervaded by, regularization techniques; as such it is evident that regularization could and should play a major role also in the system identification arena.
On the contrary, we believe it is fair to say that regularization has not had a pervasive impact in system identification until very recently. To introduce its use in this field, we will refer to linear models \(\mathscr {M}=\{\mathscr {M}(\theta )|\theta \in D_{\mathscr {M}}\}\) introduced in Chap. 2, Eq. (2.1). Note that this notation not only includes classical parametric structures, such as ARX, ARMAX, Box–Jenkins models but also so-called nonparametric ones where the “parameter” \(\theta \) may be infinite dimensional, e.g., containing all the impulse response coefficients of the filters \(W_y(q)\) and \(W_u(q)\) which characterize the predictor
The transfer functions \(W_y(q)\) and \(W_u(q)\) are related to the input–output model
by the relation
see also (2.4).
For simplicity here, we consider the single-output case \(y(t) \in {\mathbb R}\). In the prediction error framework described in Chap. 2, the model fit is typically measured by the negative log likelihood
which in the Gaussian case is, up to constants, proportional to the sum of squared prediction errors
As discussed in Chap. 3, regularization can be added to make the inverse problem of estimating the model \(\mathcal{M}(\theta )\) from data well-posed, and therefore regularized estimators \(\hat{\theta }_R \) of the form
are considered. This framework has been extensively discussed in the previous chapter in the context of linear regression under the squared loss \( V_N(\theta ) =\) \( \Vert Y - \varPhi \theta \Vert _2^2,\) see e.g., Eq. (3.57).
The function \(J_\gamma (\theta )\) is usually referred to as the penalty function, and possibly depends on some (hyper-)parameter \(\gamma \). In the simplest case \(J_\gamma (\theta )\) takes the multiplicative form
and \(\gamma \) acts a scaling factor which controls the “amount” of regularization. The most famous example is the so-called ridge regression problem, in which a quadratic loss \(V_N(\theta )\) is used and \(J(\theta ): = \Vert \theta \Vert ^2\) so that (see also (3.61a)):
However, ridge regression has not had a significant impact in the context of System Identification, i.e., when the vector \(\theta \) contains the impulse response coefficients of a (linear) dynamical system. To understand why, it is important to discuss the choice of \(J_\gamma (\theta )\). We will see that it plays a fundamental role and strongly influences the properties of the estimator \(\hat{\theta }_R\). In particular, we will see how \(J_\gamma (\theta )\) should be designed to encode properties of dynamical systems such as BIBO stability, smoothness in time domain and frequency domain, oscillatory behaviour and so on; this is a form of “inductive bias” well known and studied in the machine learning community, see e.g., [61].
As argued in Chap. 4, regularization can be given a Bayesian interpretation. In fact, introducing a probabilistic prior on model parameters \(\theta \) of the form
and the Likelihood function:
the maximum a posteriori (MAP) estimator of \(\theta \) (see (4.2)), becomes
In what follows, we will therefore use interchangeably the “regularization” framework, and thus think of \(J_\gamma (\theta )\) as a penalty function, or the “Bayesian” framework, and thus think of \(p_\gamma (\theta )\) as a prior (with some caution in the infinite-dimensional case).
5.2 MSE and Regularization
The final goal of modelling is to perform some task, e.g., prediction or control, on future unseen data. As such the estimated model quality should be measured having the objective in mind. For simplicity, we will consider a prediction task, referring the reader to the literature discussed in Sect. 5.9 for extensions. To this purpose, in addition to the training data \(\mathcal{D}_{T},\) let us introduce testing data:
A model \(\hat{\mathcal{M}}:= \mathcal{M}(\hat{\theta })\) estimated using the training data \(\mathcal{D}_{T}\) should then predict well testing data \(\mathcal{D}_{test}\). In particular, let \(\hat{y}(t|\hat{\theta })\) be the output prediction at instant t constructed using the estimated model. Then, we can measure the performance of \(\hat{\mathcal{M}}\) using the Mean Squared Error (MSE) on output (Y) prediction and assuming that data are generated by some “true”, yet unknown parameter vector \( \theta _0\). This is defined as
where, for simplicity, we have assumed stationary statistics for the couples \(u_{test}(t),y_{test}(t)\) in the last passage. In this section, we will argue that using regularization in estimating \(\hat{\theta }\) can indeed help in obtaining a small \(MSE_{Y}(\hat{\mathscr {M}},\theta _0)\). Let us first assume that data are generated by an unknown “true” linear time-invariant (LTI) causal model:
where the “true” “parameter” \(\theta _0 = [g_1,g_2,g_3,\ldots ,g_n,\ldots ]\) is an infinite sequence in \(\ell ^1\), i.e.,
We now consider the model class \(\mathscr {M}(\theta )\) of Finite Impulse Response (FIR) Output Error (OE) models
where the parameter vector \(\theta \in {\mathbb R}^n\) contains the coefficients of an nth-order finite impulse response model. Under the assumption that the input process is unit variance white noise, independent of the measurement noise, and defining
the MSE (5.12) has the expression
This is nothing but the usual bias-variance trade-off discussed in Chap. 1: the model (\(\theta \) in this case) has to be rich enough (i.e., n large) to capture the “true” data generating mechanism (low bias) but also simple enough (i.e., n small) to be estimated using the available data with low variability (low variance). The squared loss
present on the right-hand side of (5.15), after the third equality, is called a compound loss on the (possibly infinite) vector \(\theta \) [60, 63] and defines the MSE.
Considering compound losses of this type allows us to connect with the discussion made in Chap. 1 on Stein’s effect. To simplify exposition, let us assume that the identification input is a discrete impulse \(u(t) = \delta (t)\) so that we can think of y(t) as direct noisy measurements of all the (nonzero) impulse response coefficients
Defining \(Y:=[y(1),\ldots ,y(n)]^T\) and \(E:=[e(1),\ldots ,e(n)]^T\) the measurement model (5.16) can be written in vector form
As we have seen in Chap. 1, the least squares estimator \(\hat{\theta }_{LS}\) for model (5.17) is dominated (for \(n>2)\) by the James–Stein estimator discussed in Sect. 1.1.1. As argued in Chap. 1, the James–Stein estimator (1.3) is a special case of a regularized estimator (5.3) where \(J_\gamma (\theta ) = \gamma \Vert \theta \Vert ^2\) and \(\gamma \) takes the data-dependent form (1.4)
Following this route, the James–Stein estimator favours “small” parameters values (the regularization term \(J_\gamma (\theta )= \gamma \Vert \theta \Vert ^2\) penalises large \(\Vert \theta \Vert \)) and therefore it is to be expected that the gap w.r.t. the least square estimator is larger under these circumstances; this has been illustrated in Fig. 1.1.
As pointed out in Sect. 1.1.2, there is actually nothing special in having chosen the origin as a reference. In fact, the penalty term can be replaced with \(J_\gamma (\theta ) = \gamma \Vert \theta -a\Vert ^2\) for any \(a\in {\mathbb R}^n\) yielding to estimators which always dominate least squares provided \(\gamma \) is chosen as
This teaches us that under certain circumstances it is possible to steer the estimators, using a suitable penalty functional, towards certain regions of the parameter space (or more generally model space); most importantly, this can be done without any loss (actually with a gain) for any possible occurrence of the “true” yet unknown system. However the reader should remind that this only holds for the compound loss (5.15) and should not be seen as panacea. For instance, James–Stein estimators may provide only marginal improvements over Least Squares in situations where the signal-to-noise ratio is highly non-uniform over parameter space, a situation often encountered in system identification when input signals are not white and poor excitation may be present, e.g., in certain frequency bands. This has been illustrated in Example 1.2.
Therefore, as a take home message from Chap. 1 and the discussion above, we should remind that regularization has potential to offer, yet its use in system identification is not straightforward. The main reasons are as follows:
-
1.
Often one cannot restrict to Output Error models (i.e., also noise models should be included) and the input process is neither impulsive nor white. Thus, the MSE (5.12) takes a different form than (5.15). This calls for extensions of James–Stein estimators to weighted losses and non-orthogonal design; to some extent this has been pursued in the statistics literature, the reader is referred to [4, 9, 43, 64] and references therein. See also [13, Sect. 6].
-
2.
While James–Stein estimators have been built with the purpose of showing that the least squares estimator is not admissible (see Sect. 1.1.1, for a formal definition), it may not necessarily be our primary goal to dominate least squares (or another estimator) uniformly over parameter space. In order to cure the ill-conditioning phenomenon widely discussed in Chap. 3, it could be advantageous to tailor regularization to certain “dynamical-system” oriented properties, thus gaining a lot in certain regions of the model space, while possibly incurring in minor losses in other regions which are very unlikely.
The latter is one of the main goals of this book, i.e., to provide the reader with a thorough understanding of the role of regularization in estimating dynamical systems so as to optimally design regularization methods depending on the intended use of the model. In the remaining part of the chapter, we will first introduce the concept of “optimal” prior and derive its expression. We will then connect the structure of the optimal prior to the notion of BIBO stability for linear dynamical systems and also its link with smoothness in time and frequency domains. Connection with the Bayesian setting will also be provided. The chapter will be concluded with an historical overview of how the use of regularization in the context of estimation of dynamical systems has evolved, illustrating also the role played by time- and frequency-domain smoothness.
5.3 Optimal Regularization for FIR Models
Let us consider the problem of estimating the impulse response \(\{\theta _k\}_{k=1,\ldots ,n}\) of the FIR model (5.14) using data \(\{y(t)\}_{t=1,\ldots ,N}\). The FIR model can be compactly written as
where \(Y:=[y(1),\ldots ,y(N)]^T\), \(E:=[e(1),\ldots ,e(N)]^T\) and \(\varPhi \) contains the input samples, which are assumed to be available for all times needed to avoid issues related to the initial condition. Then, we will still use \(\theta _0\) to denote the “true” value that has generated the data.
We now consider the class of regularized estimators
parametrized by the regularization matrix \(P = P^T > 0\). As shown in Chap. 3, see Eq. (3.60), the generalized ridge regression estimator \(\hat{\theta }^{R}\) can be extended also to the case P is singular so that we can assume \(P = P^T \succeq 0\). As a matter of fact, in the Bayesian framework introduced in Chap. 4, \(\theta ^{R}\) can be also interpreted as the MAP estimator
obtained under the assumption that the noise E is Gaussian, zero mean and variance \(\sigma ^2 I\) and that \(\theta \) is independent of E, zero-mean Gaussian with (possibly singular) variance \(P = P^T \succeq 0\) (the singular case was described in (4.19)).
In this section, to emphasize the dependence of the estimator \( \hat{\theta }^{R}\) on \(P=P^T \succeq 0\), we will use the notation
Our objective now is to study the performance of the estimator \(\hat{\theta }^P\), in terms of MSE, as a function of \(P = P^T \succeq 0\), under the assumption that Y has been generated by a “true model” of the form (5.18) with a deterministic and unknown parameter \(\theta _0\). Thus, the only source of “randomness” is the noise vector E and the system input which is seen as a stochastic process (independent of E) in this section.
We consider a test experiment with a new input \(u_{test}(t)\), independent of the input u(t) used for identification; for convenience of notation, we define the lagged test input vector
so that under (5.14) the test output is given by
Let us also define the covariance matrix
(note that stationary assumptions are present here, in fact \(W_u\) does not depend on time t) and the MSE matrix
If we now consider the output mean squared error \(MSE_{Y}(\hat{\mathscr {M}},\theta _0) \) in (5.12) computed for the model \(\hat{\mathcal{M}},\) we obtain
where in the second to last equation, we have used that the test inputs and noises are assumed to be independent of the training inputs and noise in the identification data used for estimating \(\hat{\theta }^P\).
A direct consequence of this fact is that, given two prior covariance matrices P and \(P^*\), if \(M_{\theta _0}(P) \succeq M_{\theta _0}(P^*)\), then
i.e., estimator \(\theta ^{P^*}\) outperforms \(\theta ^{P}\) in terms of output prediction for any possible choice of the test input covariance \(W_u\). Thus, if the modelling purpose is output prediction, it is of interest to minimize, w.r.t. all possible \(P=P^T \succeq 0\), the matrix \(M_{\theta _0}(P)\), i.e., to find
so that \(\hat{\theta }^{P^*}\) outperforms any other \(\hat{\theta }_{P}\) in terms of output error (5.15) for any choice of the (test) input covariance \(W_u\). Under the assumption that the true model generating the data is an FIR model of length n with impulse response
the solution \(P^*\) of the minimization problem in (5.20) has been derived in Proposition 3.1, and takes the form
where \(\theta _0\) is the “true” impulse response of the data-generating mechanism (5.14). An alternative proof of the optimal solution (5.21) to problem (5.20) can be found in Sect. 5.10.1. Since \(P^*\) depends on the unknown true system, this result is not of practical interest; however, if we think of the FIR model (5.14) as the approximation of a BIBO stable infinite impulse response model
the impulse response \(\theta _0\) should have finite \(\ell _1\) norm \(\Vert \theta _0\Vert _1\), i.e.,
and therefore \(\theta _{0,k}\) should decay as a function of the index k. As a result, the entries \([P^*]_{ij} = \theta _{0,i}\theta _{0,j}\) of optimal kernel decay as functions of the row and column indexes i and j. In Bayesian terms, it is thus expected that also the elements \([P]_{ij}\) of any “good” candidate prior variance should do the same. As we will see later in this chapter, recent forms of regularization for system identification include a decay rate condition on the elements \([P]_{ij}\), so as to guarantee that the estimated system is BIBO stable. Therefore, we will often refer to conditions on the decay rate of P as “stability conditions”. While condition (5.23) is obviously satisfied when \(\theta \) is a finite dimensional vector, this loose connection between decay rate of the kernel and stability needs to be tightened. We will see in the next section that this can be properly formulated in a Bayesian framework.
5.4 Bayesian Formulation and BIBO Stability
In the previous section, we have considered only FIR models which are reasonable approximations of any BIBO LTI system in most practical scenarios. However, it is of interest to formulate the estimation of LTI BIBO stable systems in full generality, without assuming the impulse response to be of finite support. This entails working with infinite dimensional impulse responses \(\{\theta _k\}_{k \in {\mathbb N}}\). In this chapter, we first consider the Bayesian framework, while regularization in infinite-dimensional Hilbert spaces will be addressed in Chap. 6. To start with, we model the unknown impulse response \(\{\theta _k\}_{k \in {\mathbb N}}\) as a stochastic process indexed over time k; this is the straightforward extension to the infinite-dimensional case of (5.18) where \(\theta \) was a finite-dimensional random vector. In this context, it is of interest to introduce the concept of “stable” priors:
Definition 5.1
(Stable priors) A prior on \(\{\theta _k\}_{k \in {\mathbb N}}\) is said to be stable if realizations are sequences almost surely in \(\ell _1\), i.e.,
In most of this book, mostly for computational reasons, we will also assume that \(\{\theta _k\}_{k \in {\mathbb N}}\) be Gaussian (i.e., that any finite collection of random variables \(\{\theta _k\}_{k \in I}\), \(I=\{i_1,\ldots ,i_\ell \}\), \(i_k \in {\mathbb N}\), \(\ell \in {\mathbb N}\) are jointly Gaussian). This is formalized in the following assumption.
Assumption 5.1
Under the Bayesian framework, we assume \(\{\theta _k\}_{k \in {\mathbb N}}\) to be a Gaussian stochastic process with mean \(\{m_k\}_{k \in {\mathbb N}}\) and covariance function K(t, s), \(t,s \in {\mathbb N}\).
\(\square \)
It is an interesting fact that, under additional assumptions on the mean and covariance functions, the prior is stable according to Definition 5.1, as formalized in the following lemma whose proof is in Sect. 5.10.2.
Lemma 5.1
Under Assumption 5.1 and if the following additional conditions hold
then the prior is stable as per Definition 5.1, i.e.,
In most of this book, we will also make the assumption that the a priori mean \(m_t\) is identically zero, and thus only the condition on the covariance K(t, s) should be checked to ensure stability. We will now discuss different form of prior covariances K encountered in the literature.
5.5 Smoothness and Contractivity: Time- and Frequency-Domain Interpretations
As seen in Sect. 5.3, the optimal regularizer should mimic the “true” impulse response, which is clearly unfeasible since the impulse response is unknown. However, as already discussed in Sect. 5.4, we can use the prior to encode qualitative behaviour of impulse responses of BIBO stable linear systems. In particular we have seen in Lemma 5.1 that a certain decay condition on the prior mean and covariance guarantees the description of only (almost surely) BIBO stable linear systems. The simplest example of such a prior model is the following.
Example 5.2
(Diagonal (DI) prior) Assume the prior mean to be zero \(m_t = 0\), \(\forall t\in {\mathbb N}\) and the covariance function to be diagonal with exponentially decaying entries
The parameters \(\lambda \) (scale factor) and \(\alpha \) (decay rate) are treated as hyperparameters to be estimated from data, using e.g., marginal likelihood maximization, as described in Sect. 4.4. It is worth observing that the assumptions of Lemma 5.1 are satisfied, indeed
and hence this is a stable prior.
\(\square \)
It is interesting to observe that a decay rate condition on the impulse response coefficients is equivalent to assuming a smoothness condition in the frequency domain. To see this, let us introduce the frequency response function
The \(L_2\)-norm of the first derivative \(\frac{dG(e^{j\omega })}{d\omega }\) can be considered
which using Parseval’s theorem can be expressed in time domain
Computing higher-order derivatives, and using again Parseval’s theorem, the \(L_2\)-norm of the mth-order derivative is given by
Hence, the condition that the \(\{\theta _k\}\) decay rapidly (and possibly exponentially as postulated by the Diagonal kernel) with k, implies a bound on the \(L_2\) norm of the mth-order derivatives, i.e., smoothness in the frequency domain of the model.
As illustrated in Fig. 5.1, smoothness in the frequency domain decreases when \(\alpha \) increases. However, under this prior, the impulse response coefficients are modelled as independent (yet not identically distributed) random variables. Thus no smoothness in the time domain is included, as for instance, is typically performed with priors based on random walk, which are the discrete-time counterpart of spline models as discussed in Sect. 4.9. A prior model that, in addition to stability, also includes a smoothness condition in the time domain, is the so-called TC-kernel:
Example 5.3
(Tuned-Correlated (TC) prior) Assume the prior mean is zero \(m_t = 0\), \(\forall t\in {\mathbb N}\) and the covariance function takes the form
As in the previous example, the parameters \(\lambda \) (scale factor) and \(\alpha \) (decay rate) are treated as hyperparameters to be estimated from data, using e.g., marginal likelihood maximization. It is worth observing that also in this case the assumptions of Lemma 5.1 are satisfied, indeed
and hence this is a stable prior. In addition, the TC prior now introduces correlation between impulse response coefficients, e.g., one has
So, the correlation is different from zero and exponentially decays to zero. \(\square \)
Figure 5.2 shows two typical realizations from the TC prior, both in time domain and frequency domain, for \(\alpha = 0.4\) (top) and \(\alpha = 0.8\) (bottom), while Fig. 5.3 shows 30 sample realizations from the DI (top) and TC (bottom) priors, respectively.
Example 5.4
(Importance of stable priors) In order to illustrate the advantage of using stable priors, we now consider a simple example of identification of an output error model. In particular, we consider a system of the form
where the measured input u(t) and the noise e(t) are realizations from white Gaussian noise with zero mean and unit variance. The impulse response is
For the purpose of identification, we assume the input is available at all time instances needed. For illustration purposes, the impulse response has been truncated at \(k=50\), since it is in practice zero for \(k>50\). We also assume that output measurements y(t) are available for \(t=1,\ldots ,35\). The hyperparameters are all estimated using marginal likelihood maximization, see Sect. 4.4. The results are shown in Fig. 5.4. The reconstruction error is measured using the percentage root mean square (RMS) error:
As illustrated in Fig. 5.4, it is apparent that the results obtained by using the stable priors, see panels (b) and (c), outperform those returned by the spline (random walk) prior, see panel a, that does not include the stability constraint. The best relative error is obtained by the TC priors (\(\simeq \)10%) and goes up to as much as \(\simeq \)33% for the spline priors. It can also be observed that while for stable priors (b) and (c) confidence intervals shrink as time index k grows, the same does not hold for the spline prior. The same behaviour had been observed in Sect. 4.9, see Fig. 4.1. \(\square \)
In the next section, a class of stable priors, which includes TC as a special case, will be derived following a first-principle maximum entropy framework.
5.5.1 Maximum Entropy Priors for Smoothness and Stability: From Splines to Dynamical Systems
The class of Stable Spline priors introduced in the paper [49] extends smoothness priors ideas used in splines models introduced in Sect. 4.9, embedding exponential decay conditions on the impulse response prior. They ultimately lead to estimated models which are BIBO stable with probability 1.
In this section, we will introduce a simple construction of these stable spline priors in discrete time. In particular, we will exploit a very natural axiomatic derivation in the maximum entropy framework introduced in Chap. 4. For the sake of illustration, we will only consider the so-called stable spline prior of order one (also known as the TC prior, see Example 5.3) and its extension known as DC prior. Possible extensions will be discussed, but not developed in full detail.
The most natural construction, inspired by smoothing spline ideas, is based on the following two observations:
-
1.
Stability: the variance of \(\theta _k\) should decay “sufficiently fast” (see Lemma 5.1), possibly exponentially, with the lag k. Assuming a zero-mean process, this can be expressed using a condition on second-order moments of the form:
$$\begin{aligned} {\mathscr {E}}\left[ \theta _{k}^2\right] = \lambda _S \alpha ^{k} \quad k=1,\ldots ,n \quad 0< \alpha <1. \end{aligned}$$(5.28)For reasons that will become clear later on, imposing equality (as done above) rather than inequality constraints is convenient.
-
2.
Smoothness: the difference between adjacent coefficients should be constrained, e.g., as measured by the relative variance,
$$\begin{aligned} \frac{{\mathscr {E}}\left[ (\theta _{k-1}-\theta _k)^2\right] }{ {\mathscr {E}}\left[ \theta _{k-1}^2\right] } = \lambda _R \quad k=2,\ldots ,n. \end{aligned}$$(5.29)
Using the stability constraint and redefining the constant \(\lambda _R\), condition (5.29) can be rewritten as
The following theorem (whose proof is reported in Sect. 5.10.3) derives the class of maximum entropy priors under the constraints (5.28) and (5.29). Next, in Corollary 5.1 (whose proof is in Sect. 5.10.4), we will see that for special choices of \(\lambda _S\) and \(\lambda _R\) the well-known TC and DC priors [10, 52] are obtained.
Theorem 5.5
Let \(\{\theta _{k}\}_{k=1,\ldots ,n}\) be a zero mean, absolutely continuous random vector with density \(p_\theta (\theta )\), that satisfies the following constraints (with \(0< \alpha <1\)):
with \(\lambda _S\in {\mathbb R}\) and \(\lambda _R \in {\mathbb R}\) such that
Then, the solution \(p_{\theta ,ME}(\theta )\) of the maximum entropy problem
has the following form:
where the matrix \(\varSigma ^{-1}\) has the band structure:
The maximum entropy process admits the backward representation
with
and terminal condition
Last, the autocovariance of \(\theta _k\) satisfies the relation:
Corollary 5.1
Under the conditions of Theorem 5.5 and defining
the maximum entropy model in Theorem 5.5 corresponds to the so-called DC-kernel [10], i.e.,
In particular, for \(\lambda _R = \lambda _S(1-\alpha ) \), this reduces to the so-called TC kernel [10] with
while for \(\lambda _R = \lambda _S(1+\alpha ),\) we obtain the covariance of the “diagonal” kernel
Remark 5.1
In the maximum entropy kernel derived in Theorem 5.5, which includes DC, TC and DI as special cases as stressed in Corollary 5.1, the constant \(\lambda _S\) plays only the role of a scale factor while \(\alpha \) is a “decay rate”. Therefore, by fixing \(\lambda _S=1\) and \(\alpha =0.8\) we can study the behaviour as the “regularity” constant \(\lambda _R\) varies in the interval \(\lambda _S(1-\sqrt{\alpha })^2 = \lambda _{R,min}\le \lambda _R \le \lambda _{R,max} = \lambda _S(1+\sqrt{\alpha })^2\). This is entirely equivalent to studying the behaviour of the kernel as a function of the ratio \(\lambda _R/\lambda _S\). We thus consider a grid of 9 possible values \(\lambda _{R,min} = \lambda _{R,1}< \lambda _{R,2}< \dots < \lambda _{R,9} = \lambda _{R,max} \). Then, Fig. 5.5 plots 5 sample realizations for each of these values with panel (i) corresponding to the value \(\lambda _{R,i}\). In particular, \(\lambda _{R,4} = \lambda _S(1-\alpha )\) corresponds to the TC kernel and \(\lambda _{R,6} = \lambda _S(1+\alpha )\) induces the DI kernel. For each realization from the prior (solid line) also its best single-exponential fit is shown in order to highlight the “overall” decay rate which can be thought of as an envelope of the curves. In panel (1), with \(\lambda _{R}\) taking the smallest possible value, hence imposing the “maximum” amount of regularity, all realizations are pure exponentials. In panel (9), with \(\lambda _{R}\) taking its maximum value, all realizations are pure damped oscillations. In fact, in both cases, it can be checked that the corresponding kernel is singular.
Degrees of Freedom of the DC Kernels
Theorem 5.5 provides a class of kernels \(K_{\eta }\) parametrized by the hyperparameter vector \(\eta := [\lambda _S, \lambda _R, \alpha ]\). In Fig. 5.5, we have illustrated how realizations from the prior change as a function of the regularity parameter \(\lambda _R\) having fixed \(\lambda _S = 1\) (or equivalently as a function of the ratio \(\lambda _R/\lambda _S\)). As discussed in Chap. 4, choosing the prior is equivalent to describing the model class. In the linear system identification context, this then defines a penalty function on impulse responses. A way to measure the “size” of the model class is to use the concept of equivalent degrees of freedom, introduced in the Bayesian context in Sect. 4.8. Unfortunately, the degrees of freedom are defined in terms of the output predictor sensitivity and they thus require to specify not only the model class but also the experimental conditions under which the model is estimated. Only in limiting cases (such as improper prior on finitely and linearly parametrized model classes), degrees of freedom become independent of the experiment and coincide with the number of parameters. In this section, we thus consider the prototypical setup in Eq. (5.18):
We recall that the matrix \(\varPhi \) is an Hankel matrix built with the input samples \(\{u(t)\}\) so that \(\varPhi \theta _0\) implements the convolution of u with \(\theta _0\). The input \(\{u(t)\}\) is now assumed to be a zero-mean unit variance white noise. We also assume the noise \(\{e(t)\}\) is zero-mean unit variance white noise. We consider two scenarios in which the order of the system (length n of \(\theta _0\)) is assumed to be either \(n=30\) or \(n=100\). Exploiting the derivation in Chap. 4 (see Definition 4.2 and Proposition 4.3), the degrees of freedom \(\mathrm {dof}(\eta )\), as a function of the hyperparameter vector \(\eta \), are given by
Assuming also here that \(\lambda _S = 1\), we study how \({\mathrm {dof}}(\eta )\) varies as a function of \(\lambda _R\) for three different values of \(\alpha \) (0.6, 0.8, and 0.95). The behaviour is illustrated in Fig. 5.6 where it is apparent that the maximum is achieved for the DI kernel, and the minimum (a bit smaller than 1) is attained at the extremum points, where the kernel has rank exactly equal to 1. It is interesting to observe the intertwining between the value of \(\alpha \) (that controls the decay rate) and the length of the FIR model n. As the coefficient vector \(\theta _0\) changes from length \(n=30\) (left) to \(n=100\) (right) the effective “size” of the model doesn’t change much for \(\alpha = 0.6\) and \(\alpha = 0.8\), while it does increase when \(\alpha = 0.95\). This confirms the fact that the kernel, for \(\alpha \) fixed, effectively controls the model complexity so that the estimator becomes insensitive to the chosen length, provided n is “big enough” w.r.t. \(\alpha \). In particular \(n=15\) would be sufficient for \(\alpha =0.6\), \(n=30\) for \(\alpha =0.8\) while for \(\alpha =0.95\) the effective size is about \(n=100\).
Extension to Smoothness Conditions on Filtered Versions \(\star \)
So far, we limited our attention to so-called “first-order” stable splines, which are derived imposing conditions on “first-order” differences, leading to first-order, i.e., AR(1), realizations. Of course these constructions can be generalized by replacing (5.31) with a higher-order constraint of the form
While the first constraint is a “standard” stability condition, the second constraint can be interpreted as a filtered frequency domain smoothness condition. In fact, defining the filter \(F(q): = 1 - \sum _{i=1}^p a_i q^i\), let us denote with \(\theta ^F_k\) the sequence obtained filtering \(\theta _k\) with F(q). The condition
implies that \(\theta ^F_k\) should decay “fast” enough (in mean square) and thus
should be small for any integer m. As a consequence, if
using Parseval’s theorem,
should be small as well, implying that \(\theta _k\) should concentrate most of his energy (variance) in frequency bands where the (absolute value of the) filter \(F(e^{j\omega })\) is small.
We regard developments of this type, in principle, as a straightforward extension of the basic ideas discussed in this chapter to obtain DC kernels. In particular, the choice of the coefficients a in (5.44) is a design issue, which can be guided by prior knowledge on the candidate models, and its underlying principles and ideas are the same as those illustrated above. There are however additional complications due to the richer structure of the constraints, which might entail non-trivial issues to derive an analytic expression of the kernel.
5.6 Regularization and Basis Expansion \(\star \)
The \(\ell _2\) (ridge regression) regularized estimators that have been discussed in this chapter can also be framed in the context of basis expansion using the so-called Karhunen–Loève decomposition of the random process \(\theta \). For the sake of exposition, we will now consider the finite-dimensional case, i.e., we will study FIR models of length n of the form (5.14). Extension to the infinite-dimensional case will be discussed in the framework of Reproducing Kernel Hilbert Spaces illustrated in Chap. 6. Under this finite-dimensional assumption, we consider the covariance matrix \(\mathbf{K}\in {\mathbb R}^{n\times n}\) whose entries satisfy \([\mathbf{K}]_{(t,s)}:=K(t,s) = \mathrm{cov}(\theta _t,\theta _s)\). The matrix \(\mathbf{K}\) can be written in terms of its spectral decomposition (Singular Value Decomposition) in the form:
where
The set of vectors \(u_i \in {\mathbb R}^n\) provides an orthonormal basis of \({\mathbb R}^n\) so that any impulse response \(\theta \in {\mathbb R}^n\) can be written using the orthonormal basis expansion
where the coefficients \(\beta _i = <\theta ,u_i> = u_i^T \theta \) are therefore zero-mean random vectors with covariances
Clearly, the argument above can be reversed. Namely, starting from (a possibly orthonormal) basis \(u_i\), \(i=1,\ldots ,n\) the random basis expansion
induces a probability description of the candidate \(\theta \)’s which turns out to be zero mean and with covariance matrix as in (5.45). This interpretation provides a clear link between “standard” models described in terms of basis expansions, regularization and the Bayesian view.
Remark 5.2
(Low-Rank Kernel Approximation) The spectral decomposition of the kernel (5.45) suggests also that, when some singular values \(\xi _i\) are “very small”, it can be easily approximated by a low-rank matrix
This is equivalent to approximating the \(\xi _i\) below a certain threshold with zero singular values. This threshold can be chosen by a standard SVD-truncation criterion, e.g., neglecting singular values below a certain fraction of the largest singular value \(\xi _1\), i.e., that satisfy
In Fig. 5.7, the value \(R = 20\) has been chosen to plot the most relevant eigenfunctions. Low-rank kernel approximation can also be exploited to reduce the computational burden in computing the solutions.
Figure 5.7 shows the eigenfunctions of the DC kernel for different choices of the hyperparameters. As already studied in the previous section, the “complexity” of the kernel, measured e.g., by the degrees of freedom as illustrated in Fig. 5.6, varies as the hyperparameters change. In the context of basis expansions, this is clear from Fig. 5.8 where the singular values of the kernel, i.e., the variances of the basis expansion coefficients \(\beta _i\), introduced in (5.47), vary as the hyperparameters change. For instance when \(\lambda _R = \lambda _{R,min}\), see panel (1), and \(\lambda _R = \lambda _{R,max}\), see panel (9), the kernel has rank 1. Instead the singular values decay slower for the DI kernel, see panel (5), that also has the largest number of degrees of freedom, see Fig. ().
Even if this section is devoted to finite impulse response models (i.e., n finite, and therefore BIBO stable systems), it still makes sense to discuss what happens to the coefficients \(\theta _n\) when n becomes “large” and its relation with BIBO stability. In Lemma 5.1, we have seen that a sufficient conditions for a.s. BIBO stability of realizations from the Gaussian prior, is that the diagonal elements of K satisfy the summability condition
which requires a “sufficiently fast” decay rate of the diagonal K(t, t). A quite natural question concerns how the behaviour of K(t, t) reflects on the basis vectors \(u_i\). The following lemma, whose proof is in Sect. 5.10.5, gives the answer.
Lemma 5.2
The basis vectors \(u_i\) introduced in (5.45), whose tth elements are denoted by \(u_{it}\), satisfy the inequality
Condition (5.48) holds also in the infinite dimensional case, i.e., as \(n\rightarrow \infty \), provided K(t, s) admits the spectral decomposition
where the \(u_{i}\) are orthonormal sequences in \(\ell _2\) and the condition \(\sum _{t=1}^\infty K(t,t) = C <\infty \) is satisfied.
While this result is essentially trivial for n finite, it becomes important when \(n\rightarrow \infty \), since it provides a condition on the tail behaviour of the eigenvectors (eigenfunctions). For instance, if the diagonal entries (variances) of the kernel K(t, t) decay exponentially fast as a function of t, also the \(u_{it}\) do so. The decay of the eigenfunctions can be visually inspected in Fig. 5.7.
5.7 Hankel Nuclear Norm Regularization
As discussed above, regularization can be used to enforce smoothness and stability of impulse responses. Yet this is just one way, and possibly not the most common in the field of dynamical systems, to control the “complexity” of model classes.
For instance, in the parametric approach to system identification, the complexity can be measured by the dimension of a minimal state-space realization of the unknown system. For ease of exposition, let us now only consider the single-input single-output output error case (i.e., \(H(z) = 1\)). In this case, the number of free parameters is \(2n+1\) where n is the degree of the denominator of the transfer function \(G_\theta (z)\), that also equals the dimension n of a minimal state-space realization of \(G_\theta (z)\) which is called the McMillan degree of \(G(z,\theta )\), as seen in Sect. 2.2.1.1. To fix notation, let us introduce a minimal state-space realization of \(G(z,\theta )\)
which is such that \(G(z,\theta )= C(zI-A)^{-1}B\). If \(\{g(k,\theta )\}_{k \in {\mathbb N}}\) is the impulse response sequence, parametrized by \(\theta \), then one has \(g(k,\theta ) = CA^{k-1}B\) \(\forall k>0\).
It is well known from realization theory that the McMillan degree has a close connection with the so-called Hankel matrix formed with the impulse response coefficients, i.e.,
with r block rows and c block columns. The following lemma holds.
Lemma 5.3
(based on [65]) The linear time-invariant system with impulse response \(\{g(k,\theta ) \}_{k\in {\mathbb N}}\) admits a minimal state-space realization of order n (i.e., has McMillan degree equal to n) if and only if, for some choice of r, c the following holds:
In practice, only a finite number of impulse response (Markov) parameters \(g(k,\theta ) \), \(k=1,\ldots ,p\) is available and the problem of finding a state-space model of the form (5.49) such that \(g(k,\theta ) = CA^{k-1}B\) \(\forall \; k=1,\ldots ,p\), is known as partial realization problem.
This shows that, indeed, a notion of “complexity” can be attached to the dimension n of a minimal state-space realization (5.49); therefore the rank of the Hankel matrix \(\mathcal{H}_{c,r}(\theta )\) can be considered as a candidate for performing regularization. This leads to the choice of a penalty given by
for suitable values of the integers c, r. Unfortunately, similarly to what happens for the 0 quasi-norm \(\Vert x\Vert _0\) (defined as the number of non-zero entries in the vector x) discussed in Sect. 3.6.2.1, the rank functional is not convex; as a result solving optimization problems involving penalties of the form (5.52) is problematic. The very same issue arise in a variety of rank-constrained optimization problems.
As seen in Chap. 3, to overcome this limitations, inspired by work on \(\ell _1\) regularization, researchers have suggested to use the nuclear norm \(\Vert A \Vert _*\) of a matrix \(A\in {\mathbb R}^{m\times n}\) defined as
where \(\sigma _i(A)\) denotes the ith singular value of the matrix A, as a surrogate for the rank of the matrix A. The nuclear norm is also known as Ky–Fan n-norm or trace norm. This choice is motivated by the following lemma.
Lemma 5.4
(based on [20]) Given a matrix \(A \in {\mathbb R}^{m\times n}\) the nuclear norm of A is the convex envelope of the rank function on the set \(\mathcal{A}:=\{ A \in {\mathbb R}^{m\times n}, \; \Vert A\Vert \le 1\}\).
These considerations have led to a whole class of regularization methods which build upon the nuclear norm of the Hankel matrix
as a possible regularizer. Also several extensions have been considered, including weighted versions of the form
where \(W_c\) and \(W_r\) are, respectively, “column” and “row” weightings. These latter can be possibly adapted iteratively, in the framework of iteratively reweighted methods such as those commonly used in conjunction with \(\ell _1\) and/or \(\ell _2\) reweighted schemes, see e.g., [72].
The Hankel norm regularizer can also be studied from a Bayesian perspective, considering the prior
To gain some intuition on the structure of this prior, let \(g(k,\theta )=\theta _k\) and consider the following modified prior which penalizes the nuclear norm of the squared Hankel matrix, i.e.,
The reason for introducing \( \tilde{ \mathrm {p}}\) is twofold. The first is related to the fact that the prior (5.55) is equivalent to assuming that the entries \(\theta _k\) of the impulse response are independent zero mean Gaussians, as formalized in the following proposition.
Proposition 5.1
(based on [53]) Let \( \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) be as in (5.55) and let \(\theta \in {\mathbb R}^m \sim \tilde{\mathrm {p}}_{\mathcal{H},\gamma }(\theta ),\) where \(\mathcal{H}_{p,p}(\theta )\) is its \(p\times p\) Hankel matrix (with \(m=2p-1\)). Then the \(\theta _k\)’s are zero mean, independent and Gaussian. In particular:
As illustrated in Fig. 5.9, from (5.56) one sees that the variance of \(\theta _k\) is not decaying with the lag k, and hence the prior \( \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) does not induce a BIBO stable hypothesis space.
Second, the prior \( \tilde{ \mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) can be used as a proposal distribution for an MCMC scheme, as introduced in Sect. 4.10, to sample from the Hankel prior \(\mathrm {p}_{\mathcal{H},\gamma }(\theta ) \) in (5.54) with \(g(k,\theta )=\theta _k\). Samples from \(\mathrm {p}_{\mathcal{H},\gamma }(\theta )\) can then be used to approximate the variances \(\text{ Var }\{\theta _k\}\) and the correlations \(\text{ Corr }\{\theta _{P},\theta _{h}\}\). These are shown in Fig. 5.9. In particular, the solid line in the left panel shows \(\text{ Var }\{\theta _k\}\) as a function of k, while the right panel \(\text{ Corr }\{\theta _{P},\theta _{k+h}\}\) as a function of h for k fixed to 50. It is clear that, even though under \(\mathrm {p}_{\mathcal{H},\gamma }(\theta )\) the \(\theta _k\)’s are not Gaussian, the variances resemble those of \(\tilde{\mathrm {p}}_{\mathcal{H},\gamma }(\theta )\) (left panel, dashed line) as and also their correlations resemble those of independent variables. For the sake of comparison, the left panel plots also the profiles of the impulse response coefficients’ variances using the TC prior for two different decay rates (dashdot lines).
These observations suggest that, while the nuclear norm regularization (prior) accounts for system-theoretic notions of model complexity as defined by the McMillan degree, it fails to include decay rate and smoothness constraints. One would expect, therefore, that Hankel regularization alone may not give satisfactory results as it is not able to properly bound the candidate set of models. It turns out that the maximum entropy framework discussed in Sect. 5.5.1 can be used to build prior distribution which account for stability, smoothness as well as “complexity”. The following theorem (whose proof is given in Sect. 5.10.6) gives the structure of the MaxEnt prior under a simple “TC”-like condition on the stability-smoothness constraint.
Theorem 5.6
Let \(\{\theta _{P}\}_{k=1,\ldots ,m}\) be a zero mean, absolutely continuous random vector with density \(p_\theta (\theta )\), which satisfies the following constraints:
Then, the solution \(\mathrm {p}_{\theta ,MEH}(\theta )\) of the maximum entropy problem
has the following form:
where the Lagrange multipliers \(\mu _H,\mu _1,\ldots ,\mu _m\) are determined so that the constraints (5.57) are satisfied.Footnote 1
Hankel nuclear norm discussed in this chapter is only one possible way to favour “simple” (in the sense of having small McMillan degree) models. Indeed, it is by no means trivial to use priors of the form (5.59), that involve nuclear norm terms, in conjunction with marginal likelihood optimization to estimate hyperparameters. Several variations are possible and, indeed, matricial reweighting schemes such as those used in [55] can be used in a Bayesian context, leading to iteratively reweighted schemes that remind of \(\ell _1/\ell _2 \) reweighting [72].
5.8 Historical Overview
The framework discussed in this chapter has indeed a long history that can be traced back, by and large outside the control community, until the early ’70s of the last century. In this section, we will review these developments and point to similarities and differences with the theory developed in this chapter.
5.8.1 The Distributed Lag Estimator: Prior Means and Smoothing
To the best of our knowledge Bayesian methods for estimating dynamical systems have first been advocated in the early ’70s in the econometrics literature for FIR models of the form (5.14), which were referred to as distributed lag models. The length n of the FIR model was actually left unspecified, and possibly let going to infinity.
In particular, [40, 62] were the first to talk about (and apply) Bayesian methods for system identification, arguing that “rigid parametric” structures may be inadequate, extending arguments which can be found in [66] for “static” linear regression models to the “dynamical” systems scenario. In the paper [40], having in mind that modes of linear time-invariant systems have an exponentially decaying behaviour of the type \(\alpha ^t\), it was suggested to describe the unknown impulse response \(\theta \) with a process having an exponentially decaying prior mean
Other possible response patterns had been considered, such as the hump, composed of the response build-up, the maximum and its decay, see [40] for details and alternative patterns. The covariance function K(t, s) in [40] was taken so that the ratio
remains constant over time t. This was called the “proportionality principle”. and can be achieved with the choice
so that the normalized standard deviation
is indeed constant if \(w_{ts}\) is so. This would imply that prior credible intervals have constant relative size w.r.t. their means, see p. 1065 of [40].
The choice (5.61) left the coefficients \(w_{ts}\) unspecified and, indeed in [40], it was emphasized that “the selection of the values of the set of \(w_{ij}\) still remains a relatively difficult task”; one suggestion, inspired by work on smoothing [34], has been to take
leading to
which is exactly the DC kernel introduced in Corollary 5.1. It is also interesting to observe that [40] already suggested the use of marginal likelihood to choose the most suitable prior distribution in the class.
Of course, postulating a prior mean m introduces in the estimation procedures a remarkable prejudice and requires quite accurate knowledge on the expected \(\theta \). The paper [62], inspired by “smoothing priors” arguments, suggested instead that the prior mean should be zero, and only smoothness conditions on the lags should be enforced; this leads to a zero mean prior, i.e., \(c=0\) in (5.60), with a dth degree smoothing covariance. For instance, for \(d=2\), the prior model can be expressed in terms of the second-order differences:
postulating \({\mathscr {E}}\beta \beta ^T = S {\mathscr {E}}\theta \theta ^T S^T =I\).
It is clear from Fig. 5.10 that this prior guarantees smoothness in time domain (and therefore low-pass behaviour in frequency domain) but no guarantee on stability.
5.8.2 Frequency-Domain Smoothing and Stability
The “time-domain” smoothing discussed in the previous section has been criticized by Akaike [1] who posed the question whether time-domain smoothness conditions would “be the most natural ones”. Akaike suggested that instead smoothness should be enforced in the frequency domain, i.e., considering the frequency response
To this purpose, the \(L_2\)-norm of the first derivative \(\frac{dG(e^{j\omega })}{d\omega }\) can be considered and we have already seen in (5.25) that one obtains
Discouraging large \(\left\| \frac{dG(e^{j\omega })}{d\omega }\right\| ^2\) can thus be obtained using the right-hand side of (5.64) as a penalty, which can be written in the form:
where
This is of course equivalent to assuming that the impulse response vector \(\theta \) has a zero-mean normal prior with covariance \(K_\gamma \).
Unfortunately, in the limit \(n\rightarrow \infty \), the covariance function (5.65) does not meet the (more stringent) sufficient conditions of Lemma 5.1; of course rather straightforward extensions include setting penalties on higher-order derivatives, which would result in a faster decay rate of the diagonal elements of (5.65). This is a manifestation of the well-known link between regularity in the frequency domain and decay rate of the impulse response already discussed in Sect. 5.5.
5.8.3 Exponential Stability and Stochastic Embedding
More recently, Gaussian priors for dynamical systems have been considered in the control literature; in particular, a zero-mean Gaussian prior with diagonal and exponentially decaying covariance
has been proposed in the so-called “stochastic embedding” framework [25, 26]. Let us now briefly introduce the problem: consider an Output Error model of the form
where \(g_k(\theta )\), \(\theta \in {\mathbb R}^n\) is a parametric description of the unknown impulse response \(\{g_k\}_{k=1,\ldots ,\infty }\) in the model class \(\mathcal{M}_n(\theta )\). Let \(\hat{\theta }\) be some parametric estimator of \(\theta \), e.g., the PEM estimator
Let now
be the corresponding estimator of the transfer function \(G(z,\theta )= \sum _{k=1}^\infty g_k(\theta ) z^{-k}\).
In the Model Error Modelling framework, it is assumed that the “true” transfer function G(z) is only partially captured by the chosen model class \(\mathcal{M}_n(\theta )\) so that
and \(\tilde{G}(z)\) represents a model error. The purpose of Model Error Modelling is to obtain a statistical description of the model error, say
which may be used, for instance, to estimate the model order, e.g., the dimension n of the parameter vector \(\theta \). This can be achieved by minimizing an estimate of the MSE
while accounting for the model error model \(\tilde{G}(z)\), see e.g., Eqs. (89)–(92) in [26].
The model error \(\tilde{G}(z)\) is estimated in [26] starting from the least squares residuals \(v_{\hat{\theta }}(t):=y(t) - G(z,\hat{\theta })u(t)\) which, under assumption (5.68), is expected to be described by the model
It is remarkable that [26] propose to estimate the parameters \(\alpha \) and \(\rho \) that characterize the covariance (5.66) resorting to marginal likelihood maximization
where \(V_{\hat{\eta }}:=[v_{\hat{\eta }}(1),\ldots ,v_{\hat{\eta }}(N)]\). It is also interesting to observe that the exponential decay of the covariance sequence (5.66) implies a smoothness condition in the frequency response function similar in spirit to that advocated in [1]. This is formalized in the following result whose proof is in Sect. 5.10.7.
Lemma 5.5
Let \(\{g_{k,\alpha }\}_{k=0,\ldots ,\infty }\) be a zero-mean Gaussian process with covariance (5.66) and let
be its Fourier transform. Then the Lipschitz-like condition
holds.
5.9 Further Topics and Advanced Reading
Section 1.3 already reported a list of topics and readings on inverse problems, Stein estimators and their link with the Empirical Bayes framework.
The use of regularization and Bayesian priors can be probably dated back to the paper [71] were smoothing ideas have been advocated for a denoising problem in the field of Actuarial Science. See also the much later reference [34]. The later developments are essentially impossible to survey in this short section and we refer the reader to [66] for an early overview on the use of Bayes priors in the context of linear regression; the interested reader may also consult [22, 31, 32, 42, 59] where generalized ridge regression has been proposed to stabilize ill-conditioned inverse problems.
To the best of our knowledge, [40, 62] have been the first to use these ideas in the context of dynamical systems, named “distributed-lag” models in these early references. This work has been subsequently taken up by Akaike [1] and later on by Kitagawa and Gersh in a series of papers, see e.g., [35, 36], which culminated in the well-known book [37]. The seminal papers by Leamer and Shiller have also been continued by the econometrics community, starting with the work by Doan, Litterman and Sims, see e.g., [18] for an overview and further references. This has lead to the so-called “Minnesota prior”, which has been discussed quite extensively in the econometrics literature; several variations and extensions are found, see for instance [23, 41].
The econometrics literature has since then studied Bayesian procedures for system identification rather intensively, mostly under the acronym Bayesian VARs; the main driving motivation was that of handling high-dimensional time series (i.e., p large, called cross-sectional dimension in the econometrics literature) with possibly many explicative variables (m large), see for instance [2, 17, 23, 38].
The problem of tuning the regularization parameters (or equivalently the hyperparameters describing the prior in a Bayesian setting) has received relatively little attention in the econometrics literature: [40] already suggested the use of Empirical Bayes procedure, while [2, 18] propose tuning the hyperparameters using out-of-sample and in-sample errors, respectively. The paper [38] and the most recent work [23] adopt again an Empirical Bayes approach using the marginal likelihood; [23] claims the superiority of this approach w.r.t. previous “ad-hoc” techniques [2, 18].
Despite this long history, the use of Bayesian priors for system identification has only gained popularity in relatively recent times, e.g., see the survey [52]. We believe it is fair to say that reason for this is to be attributed to the fact that much more efforts have been recently devoted to developing prior models tailored to estimating dynamical system. In the remaining part of the book, these issues will be dealt with in some details. The reader is referred to [10, 11, 49, 50, 55] for various classes of prior models and to [6, 7, 12, 55] for more details on Maximum Entropy derivations. Extensions include prior models to estimate sparse models for high-dimensional time series [14, 74] as well as classes of priors for nonlinear dynamical models [51], that will be thoroughly discussed in Chap. 8. In particular, the techniques described in this chapter can be also used to identify the so-called dynamic networks that consist of a large set of interconnected dynamic systems. Modelling such complex physical systems is important in several fields of science and engineering, including also biomedicine and neuroscience [27, 30, 46, 56]. Estimation is difficult since they are often large scale and the network topology is typically unknown [14, 44, 67]. One typically postulates the existence of many connections and then has to understand from data which are really active. Since in real physical systems often only a small fraction of links is really working, the estimation process needs to exploit sparsity regularizers as those introduced in Chap. 3 and their stochastic interpretation like the Bayesian Lasso [47]. In the context of linear dynamic networks, where modules are defined by impulse responses, many approaches have been recently designed e.g., relying on local multi-input single-output (MISO) models [16, 19, 45]. Contributions based on variational Bayesian inference and/or nonparametric regularization, deeply connected with the techniques discussed in this book, are in [14, 33, 58, 73]. Methods to infer the full network dynamics using (structured) multiple-input multiple-output (MIMO) models can be found instead in [21, 69], with estimates consistency analyzed in [57]. A contribution based on the combination of the stable spline kernel and the so-called horseshoe sparsity prior [8, 54, 68] has been developed in [48]. See also [3, 24, 29, 70] for insights on identifiability issues and [28] where compressed sensing is exploited.
5.10 Appendix
5.10.1 Optimal Kernel
Theorem 5.7
The solution \(P^*\) of problem (5.20) is given by
where \(\theta \) is the “true” impulse response of the data-generating mechanism (5.14).
Proof
The proof will proceed as follows: let us denote with \(\hat{\theta }^{P^*}\) the estimator obtained with P as in (5.71). Consider the error
which can be written as
We shall show that the following orthogonality property holds:
so that
and therefore:
which will prove the claim that \(P^*= \theta _0 \theta _0^T\) is the optimal solution to (5.20).
It now just remains to show that (5.72) holds. To do so, let us rewrite (4.7) assuming null \(\mu _{\theta }\) and using the matrix inversion lemma as (3.145):
Therefore, the error \(\tilde{\theta }^P:= \theta _0 - \hat{\theta }^P\) can be written in the form:
Now, using (5.74), we have:
Now, let us compute
If we now use the identity
we obtain
so that, using (5.75),
which proves (5.72) and thus the theorem. \(\square \)
5.10.2 Proof of Lemma 5.1
Consider the following upper bound on the probability that the \(\ell _1\) norm of \(\theta \) be larger than a given threshold \(T_{\ell _1}\):
where we have used the equality \({\mathscr {E}}|X| =\sigma \sqrt{2/\pi } \) for \(X\sim \mathcal{N}(0,\sigma ^2)\). Using the hypothesis (5.24) we have that
and therefore
Taking the limit as \(T_{\ell _1}\rightarrow +\infty \) we have
which concludes the proof.
5.10.3 Proof of Theorem 5.5
The proof is based on the fact that the Maximum Entropy distribution \({\mathrm p}(\theta )\) under constrains \({\mathscr {E}}f_k(\theta ) = F_k\) and \({\mathscr {E}}g_k(\theta ) = G_k\) has the “Gibbs” structure, i.e., it is the exponential of a weighted sum of the constraint functionals (see e.g., [15]):
In our case, we have \(f_k(\theta )=\theta _k^2\) and \(g_k(\theta ) = (\theta _{k-1}-\theta _k)^2\), and therefore the max-ent solution has the form
Using a well-known result in graphical models (see e.g., Lauritzen [39]), the variables \(\theta _k\) and \(\{\theta _{k+2},\ldots ,\theta _n\}\) are conditionally independent given \(\theta _{k+1}\) (because \(\theta _{k+1}\) is the only neighbour of \(\theta _{P}\) in the graph representing \({\mathrm p}(\theta _k,\theta _{k+1},\ldots ,\theta _n)\) (or equivalently \(\theta _{k+1}\) separates \(\theta _k\) from \(\theta _{k+2},\theta _{k+3},\ldots ,\theta _n\)).
In our case, this conditional independence implies that the best linear estimator \(\hat{\theta }_{k-1}\) of \(\theta _{k-1}\) given \( \theta _k, \theta _{k+1},\ldots ,\theta _n\) depends only \(\theta _{P}\) (i.e., \(\hat{\theta }_{k-1} = a_{B,k} \theta _k \)) so that the vector \(\theta \) admits the fFootnote 2 representation:
with \(w_k:=\theta _{k-1} - \hat{\theta }_{k-1}=\theta _{k-1} - a_{B,k} \theta _k \) zero mean and uncorrelated of \( \theta _k, \theta _{k+1},\ldots ,\theta _n\). Let us define \(\sigma _k^2: = {\mathscr {E}}w_k^2\). In order to express \(a_{B,k}\) and \(\sigma _k^2\) as a function of \(\lambda _R,\lambda _S,\alpha \), we exploit the constraints (5.31) and the dynamical model (5.77). In particular we have
Substracting (5.78) from (5.79) we obtain
which implies that
that is independent of k, thus denoted with \(a_B\) as in (5.35). From (5.79) we also have that
where the last equality follows after a few manipulations and proves (5.36). Replacing
in the previous equation we have:
Of course \(\sigma _k^2\), and thus the right hand side, should be positive (for simplicity we exclude the singular case \(\sigma _k^2 =0\)):
which in turn is equivalent to
This happens if and only if
This is a degree two polynomial in \(\lambda _R\) with two positive roots
and therefore our problem is feasible if and only if
thus proving (5.32). Now it remains to prove that (5.76) takes the form (5.34). First let us observe that the exponent of (5.76) is a quadratic form in \(\theta \), and therefore (5.76) can be written in the form
Last, since in (5.76) only products of the form \(\theta _k\theta _h\) for \(h\in [k-1,k,k+1]\) appear, the matrix \(\varPhi =\varPhi ^T\) has the following band structure:
In addition, for \(p_{\theta ,ME}(\theta )\) to be a density, \(\varPhi \) needs to be positive semidefinite (otherwise there would be directions in which the density grows indefinitely). Since \(\theta \) admits the backward AR representation (5.77) with \({\mathscr {E}}w_k^2 =\sigma ^2_k>0\), the covariance matrix \(\varSigma = {\mathscr {E}}\theta \theta ^T \) is positive definite and thus \(\varPhi = \varSigma ^{-1}\). To compute the autocovariance function \({\mathscr {E}}\theta _h \theta _k\) we consider the following cases: if \(k=h\) we have
If \(k>h\) we have
and iterating the relation we find
Analogously, if \(h>k\) we have
Combining the three cases we obtain
proving (5.38).
5.10.4 Proof of Corollary 5.1
Using the definition (5.39) in Eq. (5.38) we obtain:
In addition, if the matching condition \(\lambda _R = \lambda _S(1-\alpha ) \) is satisfied, then from (5.35) \( a_B = 1 \) and from (5.39) \(\rho = \sqrt{\alpha }\); substituting in (5.40) we obtain
i.e., the covariance sequence of the well known TC kernel.
5.10.5 Proof of Lemma 5.2
The proof of this lemma is a simple application of Schwartz inequality. In particular we have:
where the last inequality follows from the fact that \(u_{it}\) has 2-norm equal to 1 for all i. The same condition clearly holds also in the infinite dimensional case, i.e., as \(n\rightarrow \infty \) if K(t, s) admits the spectral decomposition
and the condition \(\sum _{t} K(t,t) = C <\infty \) holds. In particular this latter condition holds true if the more stringent condition \(\sum _{t} K^{1/2}(t,t) <\infty \) in Lemma 5.1 is satisfied.
5.10.6 Proof of Theorem 5.6
The proof follows from fact that the Maximum Entropy distribution \({\mathrm p}(x)\) under constrains \({\mathscr {E}}f_i(x) \le \gamma _i\) has the “Gibbs” structure, i.e., it is the exponential of a weighted sum of the constraint functionals (see e.g., [15]):
5.10.7 Proof of Lemma 5.5
Since \(\{g_{k,\alpha }\}_{k=0,\ldots ,\infty }\) is zero mean, then clearly also \(G_\alpha (e^{j\omega })\) is so, i.e., \({\mathscr {E}}G_\alpha (e^{j\omega }) =0\). If we now consider the difference
taking the expected value of the squared norm, and using the fact the \({\mathscr {E}}g_{k,\alpha } g_{k,\alpha } = c\alpha ^k \delta _{k-h}\), we have
Now, using
and the expression for the sum of the geometric series \(\alpha ^k\) the thesis follows.
5.10.8 Forward Representations of Stable-Splines Kernels \(\star \)
A major drawback of the backward construction is that it is not straightforward to extend it to an infinite interval, i.e., to let \(n\rightarrow \infty \) in order to consider infinitely long impulse response models \(\{\theta _{k}\}_{k\in {\mathbb N}}\). However this difficulty can be circumvented exploiting the “forward” representation of (5.77), which turns out to be again a time varying AR(1) model.Footnote 3 Theorem 5.8 derives the forward AR(1) representation of the maximum entropy process found in Theorem 5.5.
Theorem 5.8
The maximum entropy solution to (5.33) found in Theorem 5.5 admits the forward AR(1) representation
with zero-mean initial condition such that \( {\mathscr {E}}\theta _0^2 = \lambda _S\), and where
and \(w_k\) is a sequence of zero mean variables, uncorrelated with the initial condition \(\theta _0\) and such that
with \( \sigma _{F,k}^2 = \lambda _S \alpha ^{k+1}(1-\rho ^2)\).
Proof
First of all let us observe that, if \(\theta _k\) admits an AR(1) forward representation of the form (5.80) (with \(w_k\) that satisfies (5.82)), \(a_F\) should satisfy the relation
Using the expression (5.38), we obtain:
and recalling that \(\rho = a_B \alpha ^{1/2}\) we also obtain
In addition, denoting \( \sigma _{F,k}^2: = {\mathscr {E}}w_k^2\),
must hold. Therefore,
It also straightforward to verify that, if \(\theta _k\) is generated by (5.80), then
which is exactly of the form
provided \(h = k + \tau \), \(\tau > 0\). This concludes the proof. \(\square \)
Notes
- 1.
Using the complementary slackness conditions it follows that a multiplier may be nonzero only if the corresponding inequality in (5.57) holds with an equality sign.
- 2.
We prefer here to work with backward representations since, as we will see, with this choice we will have \(a_{B,k} = a_B\), independent of k. Forward representations are discussed in Sect. 5.10.8.
- 3.
There are several ways to see this: perhaps the simplest is to recall that the inverse covariance matrix of an AR(1) process has a band (tridiagonal) structure, which implies that forward and backward models share the same conditional dependence structure.
References
Akaike H (1979) Smoothness priors and the distributed lag estimator. Technical report, Department of Statistics, Stanford University
Banbura M, Giannone D, Reichlin L (2010) Large Bayesian VARs. J Appl Econ 25(1):71–92
Bazanella AS, Gevers M, Hendrickx JM, Parraga A (2017) Identifiability of dynamical networks: which nodes need be measured? In: 2017 IEEE 56th annual conference on decision and control (CDC), pp 5870–5875
Berger JO (1982) Selecting a minimax estimator of a multivariate normal mean. Ann Stat 10:81–92
Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75:1–120
Carli F (2014) On the maximum entropy property of the first-order stable spline kernel and its implications. In: Proceedings of the 2014 IEEE multi-conference on systems and control, pp 409–414
Carli FP, Chen T, Ljung L (2017) Maximum entropy kernels for system identification. IEEE Trans Autom Control 62(3):1471–1477
Carvalho C, Polson N, Scott J (2010) The horseshoe estimator for sparse signals. Biometrika 97(2):465–480
Casella G (1980) Minimax ridge regression estimation. Ann Stat 8:1036–1056
Chen T, Ohlsson H, Ljung L (2012) On the estimation of transfer functions, regularizations and Gaussian processes - revisited. Automatica 48:1525–1535
Chen T, Andersen MS, Ljung L, Chiuso A, Pillonetto G (2014) System identification via sparse multiple kernel-based regularization using sequential convex optimization techniques. IEEE Trans Autom Control 59(11):2933–2945
Chen T, Ardeshiri T, Carli FP, Chiuso A, Ljung L, Pillonetto G (2016) Maximum entropy properties of discrete-time first-order stable spline kernel. Automatica 66:34–38
Chiuso A (2016) Regularization and Bayesian learning in dynamical systems: past, present and future. Annu Rev Control 41:24–38
Chiuso A, Pillonetto G (2012) A Bayesian approach to sparse dynamic network identification. Automatica 48(8):1553–1565
Cover TM, Thomas JA (2006) Elements of information theory (Wiley series in telecommunications and signal processing). Wiley-Interscience, New York
Dankers AG, Van den Hof PMJ, Heuberger PSC, Bombois X (2016) Identification of dynamic models in complex networks with prediction error methods: predictor input selection. IEEE Trans Autom Control 61(4):937–952
De Mol C, Giannone D, Reichlin L (2008) Forecasting using a large number of predictors: is Bayesian shrinkage a valid alternative to principal components? J Econ 146(2):318–328
Doan T, Litterman R, Sims CA (1984) Forecasting and conditional projection using realistic prior distributions. Econ Rev 3:1–100
Everitt N, Galrinho M, Hjalmarsson H (2018) Open-loop asymptotically efficient model reduction with the Steiglitz-Mcbride method. Automatica 89:221–234
Fazel M, Hindi H, Boyd SP (2001) A rank minimization heuristic with application to minimum order system approximation. In: Proceedings of the American control conference, pp 4734–4739
Fonken SJM, Ferizbegovic M, Hjalmarsson H (2020) Consistent identification of dynamic networks subject to white noise using weighted null-space fitting. In: Proceedings of the 21st IFAC world congress, Berlin, Germany
Foster M (1961) An application of the Wiener-Kolmogorov smoothing theory to matrix inversion. J Soc Ind Appl Math 9(3):387–392
Giannone D, Lenza M, Primiceri GE (2015) Prior selection for vector auto regressions. Rev Econ Stat 97(2):436–451
Goncalves J, Warnick S (2008) Necessary and sufficient conditions for dynamical structure reconstruction of LTI networks. IEEE Trans Autom Control 53(7):1670–1674
Goodwin GC, Salgado M (1989) A stochastic embedding approach for quantifying uncertainty in estimation of restricted complexity models. Int J Adapt Control Signal Process 3:333–356
Goodwin GC, Gevers M, Ninness B (1992) Quantifying the error in estimated transfer functions with application to model order selection. IEEE Trans Autom Control 37(7):913–929
Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) Mapping the structural core of human cerebral cortex. PLOS Biol 6(7):1–15
Hayden D, Hwan Chang Y, Goncalves J, Tomlin CJ (2016) Sparse network identifiability via compressed sensing. Automatica 68:9–17
Hendrickx JM, Gevers M, Bazanella AS (2019) Identifiability of dynamical networks with partial node measurements. IEEE Trans Autom Control 64(6):2240–2253
Hickman R, Van Verk MC, Van Dijken AJH, Mendes MP, Vroegop-Vos IA, Caarls L, Steenbergen M, Van der Nagel I, Wesselink GJ, Jironkin A, Talbot A, Rhodes J, De Vries M, Schuurink RC, Denby K, Pieterse CMJ, Van Wees SCM (2017) Architecture and dynamics of the jasmonic acid gene regulatory network. Plant Cell 29(9):2086–2105
Hoerl AE (1962) Application of ridge analysis to regression problems. Chem Eng Prog 58:54–59
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67
Jin J, Yuan Y, Goncalves J (2020) High precision variational Bayesian inference of sparse linear networks. Automatica 118:109017
Kimeldorf GS (1965) Applications of Bayesian statistics to actuarial graduation. PhD dissertation, University of Michigan
Kitagawa G, Gersh H (1984) A smoothness priors-state space modeling of time series with trends and seasonalities. J Am Stat Assoc 79(386):378–389
Kitagawa G, Gersh H (1985) A smoothness priors long AR model methods for spectral estimation. IEEE Trans Autom Control 30(1):57–65
Kitagawa G, Gersch W (1996) Smoothness priors analysis of time series. IMA volumes in mathematics and its applications. Springer, New York
Knox T, Stock JH, Watson MW (2001) Empirical Bayes forecast of one time series using many predictors. Technical report, National Bureau of Economic Research
Lauritzen SL (1996) Graphical models. Oxford University Press, Oxford
Leamer E (1972) A class of informative priors and distributed lag analysis. Econometrica 40(6):1059–1081
Lütkepohl H (2007) New introduction to multiple time series analysis. Springer Publishing Company, Incorporated, New York
Marquardt DW, Snee RD (1975) Ridge regression in practice. Am Stat 29(1):3–20
Maruyama Y, Strawderman WE (2005) A new class of generalized Bayes minimax ridge regression estimators. Ann Stat 1753–1770
Materassi D, Innocenti G (2010) Topological identification in networks of dynamical systems. IEEE Trans Autom Control 55(8):1860–1871
Materassi D, Salapaka MV (2020) Signal selection for estimation and identification in networks of dynamic systems: a graphical model approach. IEEE Trans Autom Control 65(10):4138–4153
Pagani GA, Aiello M (2013) The power grid as a complex network: a survey. Phys A Stat Mech Appl 392(11):2688–2700
Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103(482):681–686
Pillonetto G (2021) Estimation of sparse linear dynamic networks using the stable spline horseshoe prior. arXiv:2107.11155
Pillonetto G, De Nicolao G (2010) A new kernel-based approach for linear system identification. Automatica 46(1):81–93
Pillonetto G, Chiuso A, De Nicolao G (2011) Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica 47(2):291–305
Pillonetto G, Quang MH, Chiuso A (2011) A new kernel-based approach for nonlinear system identification. IEEE Trans Autom Control 56(12):2825–2840
Pillonetto G, Dinuzzo F, Chen T, De Nicolao G, Ljung L (2014) Kernel methods in system identification, machine learning and function estimation: a survey. Automatica 50
Pillonetto G, Chen T, Chiuso A, De Nicolao G, Ljung L (2016) Regularized linear system identification using atomic, nuclear and kernel-based norms: the role of the stability constraint. Automatica 69:137–149
Polson NG, Scott JG (2012) On the half-Cauchy prior for a global scale parameter. Bayesian Anal 7(4):887–902
Prando G, Chiuso A, Pillonetto G (2017) Maximum entropy vector kernels for MIMO system identification. Automatica 79:326–339
Prando G, Zorzi M, Bertoldo A, Corbetta M, Zorzi M, Chiuso A (2020) Sparse DCM for whole-brain effective connectivity from resting-state FMRI data. NeuroImage 208:116367
Ramaswamy KR, Van den Hof PMJ (2021) A local direct method for module identification in dynamic networks with correlated noise. IEEE Trans Autom Control
Ramaswamy KR, Bottegal G, Van den Hof PMJ (2021) Learning linear models in a dynamic network using regularized kernel-based methods. Automatica 129(109591)
Riley JD (1955) Solving systems of linear equations with a positive definite, symmetric, but possibly ill-conditioned matrix. Math Tables Other Aids Comput 9(51):96–101
Robbins H (1951) Asymptotically subminimax solutions of compound statistical decision problems. In: Berkeley symposium on mathematical statistics and probability, pp 131–149
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge
Shiller RJ (1973) A distributed lag estimator derived from smoothness priors. Econometrica 41(4):775–788
Stein C (1956) Inadmissibility of the usual estimator for the mean of a multivariate distribution. In: Proceedings of the 3rd Berkeley symposium on mathematical statistics and probability, vol I. University of California Press, pp 197–206
Strawderman WE (1978) Minimax adaptive generalized ridge regression estimators. J Am Stat Assoc 73:623–627
Tether A (1970) Construction of minimal linear state-variable models from finite input-output data. IEEE Trans Autom Control 15(4):427–436
Tiao GC, Zellner A (1964) Bayes’s theorem and the use of prior knowledge in regression analysis. Biometrika 51(1/2):219–230
Van den Hof PMJ, Dankers AG, Heuberger PSC, Bombois X (2013) Identification of dynamic models in complex networks with prediction error methods: basic methods for consistent module estimates. Automatica 49(10):2994–3006
Van der Pas SL, Kleijn BJK, van der Vaart AW (2014) The horseshoe estimator: posterior concentration around nearly black vectors. Electron J Stat 8(2):2585–2618
Weerts HHM, Van den Hof PMJ, Dankers AG (2018) Prediction error identification of linear dynamic networks with rank-reduced noise. Automatica 98:256–268
Weerts HM, Van den Hof PMJ, Dankers AG (2018) Identifiability of linear dynamic networks. Automatica 89:247–258
Whittaker ET (1922) On a new method of graduation. Proc Edinb Math Soc 41:63–75
Wipf DP, Nagarajan SS (2010) Iterative reweighted \(\ell _1\) and \(\ell _2\) methods for finding sparse solutions. IEEE J Sel Top Signal Process 4(2):317–329
Yue Z, Thunberg J, Pan W, Ljung L, Goncalves J (2021) Dynamic network reconstruction from heterogeneous datasets. Automatica 123:109339
Zorzi M, Chiuso A (2017) Sparse plus low rank network identification: a nonparametric approach. Automatica 76:355–366
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Pillonetto, G., Chen, T., Chiuso, A., De Nicolao, G., Ljung, L. (2022). Regularization for Linear System Identification. In: Regularized System Identification. Communications and Control Engineering. Springer, Cham. https://doi.org/10.1007/978-3-030-95860-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-95860-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95859-6
Online ISBN: 978-3-030-95860-2
eBook Packages: EngineeringEngineering (R0)