1 Introduction

As Tienstra (1947) noted, “in our observations correlation is present, always and everywhere.” The importance of modeling auto-correlations in the context of regression models has been pointed out, for instance, by Kuhlmann (2001). In the context of time series measured by a sensor, auto-correlations can be expected, e.g., as a consequence of calibration corrections being applied to all of the measurements (cf. Lira and Wöger 2006). Due to their efficiency, flexibility and straightforward relationships with auto-covariance and spectral density functions, autoregressive (AR) models have been used in diverse fields of geodetic scienceFootnote 1 (e.g., Xu 1988; Nassar et al. 2004; Amiri-Simkooei et al. 2007; Park and Gao 2008; Luo et al. 2011; Schuh et al. 2014; Didova et al. 2016; Zeng et al. 2018). In previous geodetic studies involving AR models, stochastic dependencies between different, contemporaneously observed times series, have usually been neglected. Thus, the common practice of multivariate geodetic time series analysis consists of a fusion of time series at the level of a functional (usually spatial regression) model, with separate modeling of the correlation pattern, say, through multiple independent AR models (cf. Alkhatib et al. 2018).

However, it is known for certain multivariate geodetic observables that stochastic dependencies between them indeed occur. For instance, spatial correlations between GPS position time series at different sites have been reported by Williams et al. (2004). Amiri-Simkooei (2009) suggested to model such correlations by means of a cross-covariance matrix involving adjustable variance–covariance components. Cross-covariance matrices have been estimated also by Koch et al. (2010) to analyze the correlations between coordinates triples of grid points measured by a terrestrial laser scanner. In analogy to the univariate case, multivariate (“vector”) AR (VAR) models may be an efficient and flexible alternative to cross-covariance matrices or functions. However, to the best knowledge of the authors, VAR models have not been considered to describe the colored noise of multivariate geodetic observables that include a functional, spatial regression model. Instead, VAR processes have been applied in two main geodetic studies to model pure time series, which are not fused with a spatial model. In the first one, Niedzielski and Kosek (2012) used a VAR process for predictions of Universal Time (UT1-UTC) based on geodetic and geophysical data. In the second one, Zhang and Gui (2015) modeled double-differenced carrier phase observations by a VAR process to identify outliers. In climate research, VAR processes have been used frequently to investigate relationships between different datasets. Mosedale et al. (2006) fitted daily wintertime sea surface temperatures and North Atlantic Oscillation (NAO) time series by means of a bivariate VAR model. Strong et al. (2009) employed a VAR process to quantify feedback between the NAO and winter sea ice variability. Matthewman and Magnusdottir (2011) fitted weekly averaged observations of Bering Sea ice concentrations and the Western Pacific pattern (defined in Wallace and Gutzler 1981). Furthermore, Matthewman and Magnusdottir (2012) applied a VAR-based test of causality to relationships between geopotential height anomalies at different locations. Recently, Papagiannopoulou et al. (2017) explored climate-vegetation dynamics in the framework of VAR processes.

To extend the applicability of VAR models to time series consisting more generally of a deterministic regression model and a VAR (correlation) model, it is necessary to more systematically study both the underlying theory and computationally convenient inferential procedures. In doing so, it is prudent to incorporate the possibility of outliers, which geodetic measurements are often susceptible to. Outlier-resistant (“robust”) parameter estimation is one of the main strategies to accounting for outliers. In this category, iteratively least squares (IRLS) procedures have been used by geodesists for decades due to their conceptual simplicity and proven ability to reduce the effect of outliers by down-weighting outlier-afflicted observations. It should be mentioned that the robustness of these procedures is usually limited to adjustment problems which do not contain outliers that occur in leverage points (i.e., in poorly controlled observations characterized by very small partial redundancies) or as clusters. Such problems, which are not considered in the present paper, can be dealt with by employing a robust estimator with high breakdown point, e.g., the sign-constrained robust least squares method introduced by Xu (2005). Particularly popular robust IRLS procedures have been the L1-norm- and M-estimators (cf. Huber and Ronchetti 2009). The robustness of these estimators can be explained by the fact that random errors are modeled by means of a probability density function (pdf) f(x) that decreases rather slowly (often as a power of x, i.e., \(f(x) \sim c \cdot x^{-\alpha }\)) as |x| increases. Such a pdf is thus characterized by “heavy tails” (cf. Koch and Kargoll 2013; Alkhatib et al. 2017), which are reflected by high kurtosis (cf. Westfall 2014). Thus, outliers occur more likely with a heavy-tailed pdf than with a “light-tailed” pdf, such as a Gaussian pdf. Therefore, the former may be expected to define a more realistic stochastic model than the latter when numerous outliers are present. If outliers are thus modeled stochastically by means of a heavy-tailed pdf, then robust maximum likelihood (ML) estimators can be employed (cf. Wiśniewski 2014). Particularly attractive appear to be procedures that include the widely used method of least squares alongside more robust procedures. Since the true nature of the random error, including the stochastic outlier characteristics, is usually unknown, it makes sense to adjust the actual probability distribution of the random errors from the given measurements themselves, as already pointed out by Helmert (1907). One possibility for doing so is to adopt the family of scaled t-distributions (cf. Lange et al. 1989). The shape characteristics of the pdfs defining the members of this family are controlled by two parameters: The scale factor determines the spread, and the so-called degree of freedomFootnote 2 (df) is related to the tail behavior. These pdfs have a number of beneficial mathematical properties such as heavy tails (high kurtosis), so that t-distributions can be used for the purpose of robust estimation (cf. Schön et al. 2018; Alkhatib et al. 2017; Briegel and Tresp 1999; Lange et al. 1989). As this family contains the Gaussian distributions as limit cases (as the df tends to infinity), the former is merely a generalization of the latter. Thus, the often-made assumption of a Gaussian random error law is not really abandoned, but becomes a testable hypothesis in light of the fact that the df of the t-distribution can be estimated by means of ML estimation (as demonstrated in Sect. 3). Such an estimator that adapts the df to the data itself was called a self-tuning robust estimator by Parzen (1979). The usage of Student distributions with a relatively low df\(\nu \in [3,4]\) has also been recommended by advocates of the Guide to the Expression of Uncertainty in Measurement (ISO/IEC 2008) in situations, where input quantities with statistically determined (“type-A”) standard uncertainties affect the output quantities (i.e., the observables to be adjusted) (cf. Sommer and Siebert 2004).

Recently, Liu et al. (2017) used the t-distribution in a VAR process added to a deterministic model consisting solely of the unknown observation offsets (“intercepts”). While the focus of that study was on influence diagnostics for model perturbations, one purpose of our current contribution is to demonstrate effective estimation algorithms in the more general situation where the deterministic model is described by either a linear or nonlinear regression. This is achieved by setting up an expectation maximization (EM) algorithm in the linear case and a generalized expectation maximization (GEM) algorithm in the nonlinear case. These are suitable approaches in view of their generally stable convergence and their capacity to reduce an ML estimation with random errors (based on the t-distribution) to a computationally simple IRLS method (cf. McLachlan and Krishnan 2008; Koch and Kargoll 2013). The expectation step (“E-step”) corresponds to the computation of the weights in the current iteration, which are used to downweight outliers in the maximization step (“M-step”) to solve for the unknown model parameters. With an EM algorithm in the context of a linear deterministic model, the likelihood function is maximized globally within the M-step. However, when the functional model is nonlinear and therefore linearized, this cannot be achieved. In this situation, a GEM algorithm may be applied instead to compute, after the usual E-step, a solution that increases the likelihood function, but which does not necessarily maximize it (cf. Phillips 2002). Our proposed algorithm extends the GEM algorithm for adjusting nonlinear, multivariate regression time series where the random errors of each component follow a univariate AR process with univariate t-distributed white-noise components (cf. Alkhatib et al. 2018). Algorithms for the simpler case of linear, univariate time series with AR and t-distributed random errors were independently investigated by Kargoll et al. (2018a), Tuac et al. (2018) and Nduka et al. (2018).

The new GEM algorithm dealing more generally with a VAR process (derived in Sect. 3) will be established in two variants for two different stochastic models (defined in Sect. 2): The first stochastic model (“A”) is based on the assumption that the different white-noise series of the VAR process have independent, univariate t-distributions whose scale factors and dfs are time-constant but possibly different for all series. The second stochastic model (“B”) is defined by a multivariate t-distribution, which has only a single df applying to all series jointly, besides a cofactor matrix. Since these GEM algorithms maximize the original (log-)likelihood function, information criteria for model selection are readily available. In particular, a bias-corrected version of the Akaike information criterion (AIC) is defined. As a tool specifically for selecting an adequate VAR model order, the multivariate portmanteau test by Hosking (1980) is adapted to the current observation models. The convergence behavior of the suggested GEM algorithms is investigated through Monte Carlo (MC) simulations, the results of which are presented and analyzed in Sect. 4. In Sect. 5, various Global Positioning System (GPS) time series from the EUREF Permanent Network (EPN) are modeled and adjusted by means of the proposed methodology. In this numerical example, the issue of model selection is also discussed.

2 The observation models

Our goal is to adjust N groups \(\varvec{\mathcal {L}}_{1:}, \ldots , \varvec{\mathcal {L}}_{N:}\) of observables, where each group \(k \in \{ 1,\ldots ,N \}\) consists of n observables \(\varvec{\mathcal {L}}_{k:} = [{\mathcal {L}}_{k,1} \; \ldots \; {\mathcal {L}}_{k,n}]^T\), defined to be a regression time series

$$\begin{aligned} {\mathcal {L}}_{k,t} = h_{k,t}(\varvec{\xi }) + \mathcal {\MakeUppercase {E}}_{k,t} \;\;\;\;\; (t = 1,\ldots ,n). \end{aligned}$$
(1)

These equations specify then a vector time seriesFootnote 3\((\varvec{\mathcal {L}}_t \, | \, t \in \{ 1,\ldots ,n \})\) with \(\varvec{\mathcal {L}}_t = [{\mathcal {L}}_{1,t} \; \ldots \; {\mathcal {L}}_{N,t}]^T\). The measurements \(\ell _{1,1},\ldots ,\ell _{N,n}\) are used to form the total observation vector \(\varvec{\ell }\). The observation model on the right-hand side of (1) is defined by a deterministic model, through possibly nonlinear (differentiable) functions, \(h_{k,t}\) of \(m < n\) unknown parameters \(\varvec{\xi } = [\xi _1,\ldots ,\xi _m]^T\), with random errors \(\varvec{\mathcal {E}}_{k:} = [\mathcal {\MakeUppercase {E}}_{k,1} \; \ldots \; \mathcal {\MakeUppercase {E}}_{k,n}]^T\) forming a vector time series with terms \(\varvec{\mathcal {E}}_t = [\mathcal {\MakeUppercase {E}}_{1,t} \; \ldots \; \mathcal {\MakeUppercase {E}}_{N,t}]^T\). Similarly, the vector-valued function \(\mathbf {h}_t\) can be defined from the scalar functions \(h_{1,t}\), \(\ldots \), \(h_{N,t}\), so that one can write

$$\begin{aligned} \varvec{\mathcal {L}}_t = \mathbf {h}_t(\varvec{\xi }) + \varvec{\mathcal {E}}_t. \end{aligned}$$
(2)

Clearly, observation equations (1) can also be stated in the form of “random error equations”

$$\begin{aligned} {\mathcal {E}}_{k,t} = {\mathcal {L}}_{k,t} - h_{k,t}(\varvec{\xi }) \;\;\;\;\; (t = 1,\ldots ,n) \end{aligned}$$
(3)

or in vector form

$$\begin{aligned} \varvec{\mathcal {E}}_t = \varvec{\mathcal {L}}_t - \mathbf {h}_t(\varvec{\xi }). \end{aligned}$$
(4)

Numerical realizations \(e_{k,t}\) of the random errors \({\mathcal {E}}_{k,t}\) are also referred to as “residuals” below. To allow for both auto-correlations within each time series and cross-correlations between the time series, the random errors \(\varvec{\mathcal {E}}_t\) are modeled jointly as a p-th order VAR process \(\varvec{\mathcal {E}}_t = \mathbf {A}^{(1)} \varvec{\mathcal {E}}_{t-1} + \cdots + \mathbf {A}^{(p)} \varvec{\mathcal {E}}_{t-p} + \varvec{\mathcal {U}}_t\) with components

$$\begin{aligned} \begin{bmatrix} \mathcal {\MakeUppercase {E}}_{1,t} \\ \vdots \\ \mathcal {\MakeUppercase {E}}_{N,t} \end{bmatrix}&= \begin{bmatrix} \alpha _{1;1,1} &{} \cdots &{} \alpha _{1;1,N} \\ \vdots &{} \ddots &{} \vdots \\ \alpha _{1;N,1} &{} \cdots &{} \alpha _{1;N,N} \end{bmatrix} \begin{bmatrix} \mathcal {\MakeUppercase {E}}_{1,t-1} \\ \vdots \\ \mathcal {\MakeUppercase {E}}_{N,t-1} \end{bmatrix} + \cdots \nonumber \\&\;\;\;\;\; + \begin{bmatrix} \alpha _{p;1,1} &{} \cdots &{} \alpha _{p;1,N} \\ \vdots &{} \ddots &{} \vdots \\ \alpha _{p;N,1} &{} \cdots &{} \alpha _{p;N,N} \end{bmatrix} \begin{bmatrix} \mathcal {\MakeUppercase {E}}_{1,t-p} \\ \vdots \\ \mathcal {\MakeUppercase {E}}_{N,t-p} \end{bmatrix} + \begin{bmatrix} \mathcal {\MakeUppercase {U}}_{1,t} \\ \vdots \\ \mathcal {\MakeUppercase {U}}_{N,t} \end{bmatrix} \end{aligned}$$
(5)

Here, the entries \(\alpha _{1;1,1}\), \(\ldots \), \(\alpha _{p;N,N}\) of the N-by-N matrices \(\mathbf {A}^{(1)}\), \(\ldots \), \(\mathbf {A}^{(p)}\) are unknown parameters and \(\mathcal {\MakeUppercase {U}}_{1,1}\), \(\ldots \), \(\mathcal {\MakeUppercase {U}}_{N,n}\) are, at least within every group, stochastically independent random variables, having zero mean and group-specific variances \(\sigma _{k,0}^2\). Since a computationally convenient method of parameter estimation is desirable, the random errors \(\varvec{\mathcal {E}}_0\), \(\ldots \), \(\varvec{\mathcal {E}}_{1-p}\) are all assumed to take the value 0. The conditions for asymptotic covariance-stationarity of VAR(p) model (5) can be found, e.g., in Hamilton (1994, p. 259). We now rewrite process equations (5) as

$$\begin{aligned} \varvec{\mathcal {U}}_t = \varvec{\mathcal {E}}_t - \mathbf {A}^{(1)} \varvec{\mathcal {E}}_{t-1} - \cdots - \mathbf {A}^{(p)} \varvec{\mathcal {E}}_{t-p} \;\;\;\;\; (t = 1,\ldots , n).\nonumber \\ \end{aligned}$$
(6)

Denoting by \(\mathbf {A}_k^{(j)}\) the kth row of the matrix \(\mathbf {A}^{(j)}\) with \(j \in \{ 1,\ldots ,p \}\) and by \(\mathbf {i}_k^T\) the kth row of the N-dimensional identity matrix \(\mathbf {I}\), one may write for the kth component of \(\varvec{\mathcal {U}}_t\)

$$\begin{aligned} \mathcal {\MakeUppercase {U}}_{k,t} = \mathbf {i}_k^T \varvec{\mathcal {E}}_t - \mathbf {A}_k^{(1)} \varvec{\mathcal {E}}_{t-1} - \cdots - \mathbf {A}_k^{(p)} \varvec{\mathcal {E}}_{t-p}. \end{aligned}$$
(7)

It is also assumed that the random variables \(\varvec{\mathcal {L}}_0\), \(\ldots \), \(\varvec{\mathcal {L}}_{1-p}\) and the functions \(h_{k,0}\), \(\ldots \), \(h_{k,1-p}\) occurring after substitution of lagged versions of (3) into (7) take values of 0. It will be convenient to collect all VAR coefficients contained in the kth rows \(\mathbf {A}_k^{(1)}\), \(\ldots \), \(\mathbf {A}_k^{(p)}\) within the single \((pN \times 1)\)-vector

$$\begin{aligned} \varvec{\alpha }_{k:} = [\mathbf {A}_k^{(1)} \; \ldots \; \mathbf {A}_k^{(p)}]^T, \end{aligned}$$
(8)

and to stack then all of these column vectors within the single \((pN^2 \times 1)\)-vector

$$\begin{aligned} \varvec{\alpha } = [\varvec{\alpha }_{1:}^T \; \ldots \; \varvec{\alpha }_{N:}^T]^T. \end{aligned}$$
(9)

To set up computationally efficient filter schemes, it is useful to introduce the notation \(L \mathbf {Z}_t := \mathbf {Z}_{t-1}\) and more generally \(L^j \mathbf {Z}_t := \mathbf {Z}_{t-j}\), where L is the so-called lag operator that shifts the time index t by \(j \in \{ 1,2,\ldots \}\) into the past (see Chap. 2 in Hamilton 1994, for details). As shown in Brockwell and Davis (2016, p. 243) this lag operator can be used to define so-called lag polynomials

$$\begin{aligned} \mathbf {A}(L)&= \mathbf {I} - \mathbf {A}^{(1)} L - \cdots - \mathbf {A}^{(p)} L^p, \end{aligned}$$
(10)
$$\begin{aligned} \mathbf {A}_k(L)&= \mathbf {i}_k^T - \mathbf {A}_k^{(1)} L - \cdots - \mathbf {A}_k^{(p)} L^p, \end{aligned}$$
(11)

which may then be multiplied with some element \(\mathbf {Z}_t\) of a given time series to obtain filtered quantities

$$\begin{aligned} \overline{\mathbf {Z}}_t&= \mathbf {A}(L) \mathbf {Z}_t \end{aligned}$$
(12)
$$\begin{aligned}&= \mathbf {I} \mathbf {Z}_t - \mathbf {A}^{(1)} L \mathbf {Z}_t - \cdots - \mathbf {A}^{(p)} L^p \mathbf {Z}_t \nonumber \\&= \mathbf {Z}_t - \mathbf {A}^{(1)} \mathbf {Z}_{t-1} - \cdots - \mathbf {A}^{(p)} \mathbf {Z}_{t-p} \end{aligned}$$
(13)

and

$$\begin{aligned} {\overline{Z}}_{k,t}&= \mathbf {A}_k(L) \mathbf {Z}_t \end{aligned}$$
(14)
$$\begin{aligned}&= \mathbf {i}_k^T \mathbf {Z}_t - \mathbf {A}_k^{(1)} L \mathbf {Z}_t - \cdots - \mathbf {A}_k^{(p)} L^p \mathbf {Z}_t \nonumber \\&= \mathbf {i}_k^T \mathbf {Z}_t - \mathbf {A}_k^{(1)} \mathbf {Z}_{t-1} - \cdots - \mathbf {A}_k^{(p)} \mathbf {Z}_{t-p}. \end{aligned}$$
(15)

Thus, (6) and (7) become with (4)

$$\begin{aligned} \varvec{\mathcal {U}}_t&= \overline{\varvec{\mathcal {E}}}_t = \mathbf {A}(L) \varvec{\mathcal {E}}_t = \mathbf {A}(L)(\varvec{\mathcal {L}}_t - \mathbf {h}_t(\varvec{\xi })), \end{aligned}$$
(16)
$$\begin{aligned} \mathcal {\MakeUppercase {U}}_{k,t}&= \overline{\mathcal {\MakeUppercase {E}}}_{k,t} = \mathbf {A}_k(L) \varvec{\mathcal {E}}_t = \mathbf {A}_k(L)(\varvec{\mathcal {L}}_t - \mathbf {h}_t(\varvec{\xi })). \end{aligned}$$
(17)

Here, \(\mathbf {A}(L)\) and the collection of \(\mathbf {A}_1(L)\), \(\ldots \), \(\mathbf {A}_N(L)\) may be viewed as digital filters, which transform the auto- and cross-correlated (“colored-noise”) time series \((\varvec{\mathcal {E}}_t \, | \, t \in \{ 1,\ldots ,n \})\) into the “white-noise” time series \((\varvec{\mathcal {U}}_t \, | \, t \in \{ 1,\ldots ,n \})\). The white noise components \(\varvec{\mathcal {U}}_t\) and \(\mathcal {\MakeUppercase {U}}_{k,t}\) in (16)–(17) simplify to the right-hand sides of (3)–(4) if the VAR(p) process is partially “switched off” by setting \(\mathbf {A}(L) = \varvec{I}\) and \(\mathbf {A}_k(L) = \mathbf {i}_k^T\), respectively. Numerical values \(u_{k,t}\) of the white-noise variables \({\mathcal {U}}_{k,t}\) will also be called “white-noise residuals” in the sequel. Having defined the functional and the correlation model, the stochastic model is specified next, for which at least two structurally different alternatives are conceivable.

2.1 Stochastic model A: group-specific univariate t-distributions

To allow for flexible modeling of random errors, which may involve heavy tailedness or multiple outliers of unknown forms, as well as different measurement precisions across the various observation groups, it may be assumed that the white-noise variables \((\mathcal {\MakeUppercase {U}}_{k,t}, k=1,\ldots ,N; t=1,\ldots ,n)\) are all independently t-distributed with mean 0, group-specific (generally unknown) df\(\nu _k\) and group-specific (generally unknown) variance component \(\sigma _k^2\), symbolically,

$$\begin{aligned} \mathcal {\MakeUppercase {U}}_{k,t} \, | \, \varvec{\theta } {\mathop {\sim }\limits ^{\mathrm {ind.}}} t_{\nu _k}(0,\sigma _k^2) \;\;\;\;\; (t = 1,\ldots , n). \end{aligned}$$
(18)

Here, \(\varvec{\theta } = [\varvec{\xi }^T \; \varvec{\alpha }^T \; (\varvec{\sigma ^2})^T \; \varvec{\nu }^T]^T\) is the complete parameter vector consisting of the \((m \times 1)\)-vector \(\varvec{\xi }\) of functional parameters, the \((pN^2 \times 1)\)-vector \(\varvec{\alpha }\) of VAR coefficients (8)–(9), the \((N \times 1)\)-vector \(\varvec{\sigma ^2} = [\sigma _1^2 \; \ldots \; \sigma _N^2]^T\) of variance components and the \((N \times 1)\)-vector \(\varvec{\nu } = [\nu _1 \; \ldots \; \nu _N]^T\) of dfs. The estimation of the model parameters \(\varvec{\theta }\) is explained in Sect. 3. Note that the widely used assumption of normally distributed random errors is approximated by this model as \(\nu _k \rightarrow \infty \). Practically, a close approximation of a normal distribution is reached already for \(\nu _k = 120\) (cf. Koch 2017). Note also that the white-noise variance \(\sigma _{k,0}^2\) is linked to the variance component \(\sigma _k^2\) via the relationship \(\sigma _{k,0}^2 = \frac{\nu _k}{\nu _k-2} \cdot \sigma _k^2\), which requires that \(\nu _k > 2\).

Due to the assumed stochastic independence of the white-noise variables, the joint pdf of \(\varvec{\mathcal {U}}\) factors into

$$\begin{aligned} f_A(\mathbf {u}|\varvec{\theta })&= f(u_{1,1},\ldots ,u_{N,n}|\varvec{\theta }) \nonumber \\&= \prod \limits _{k=1}^N \prod \limits _{t=1}^n \frac{{\Gamma }\left( \frac{\nu _k+1}{2} \right) }{\sqrt{\nu _k \pi } \, \sigma _k \, {\Gamma }\left( \frac{\nu _k}{2} \right) } \left[ 1 + \left( \frac{u_{k,t}}{\sigma _k} \right) ^2 / \nu _k \right] ^{-\frac{\nu _k+1}{2}}, \end{aligned}$$
(19)

where \({\Gamma }\) is the gamma function and where the realizations \(u_{k,t}\) of \(\mathcal {\MakeUppercase {U}}_{k,t}\) carry the complete information of the functional and the correlation model via (17). This extends, in a natural way, the pdf considered in Alkhatib et al. (2018) from a multivariate regression time series with independent, univariate autoregressive random errors to one with random error following a fully multivariate VAR process. The log-likelihood function based on this pdf, as our basic observation model, is then given by

$$\begin{aligned} \log L_A(\varvec{\theta };\varvec{\ell })&= \log f(\mathbf {u}|\varvec{\theta }) = \sum _{k=1}^N \left( n \log \left[ \frac{{\Gamma }\left( \frac{\nu _k+1}{2} \right) }{\sqrt{\nu _k \pi } \sigma _k {\Gamma }\left( \frac{\nu _k}{2} \right) } \right] \right. \nonumber \\&\quad \left. -\frac{\nu _k+1}{2} \cdot \sum \limits _{t=1}^n \log \left[ 1 + \left( \frac{\mathbf {A}_k(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))}{\sigma _k} \right) ^2/\nu _k \right] \right) . \end{aligned}$$
(20)

Note that the vector-filtering operation \(\mathbf {A}_k(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))\) in (20) simplifies to a scalar-filtering operation \(\varvec{\alpha }_k(L)(\ell _{k,t} - h_{k,t}(\varvec{\xi }))\) used in the log-likelihood function involving multiple independent AR processes (see Alkhatib et al. 2018, Eq. (7)). As mentioned before, these filters make use of fixed boundary conditions for the unobservable past \(\varvec{\mathcal {L}}_0\), \(\ldots \), \(\varvec{\mathcal {L}}_{-p}\), so that (20) constitutes a conditional log-likelihood function (see also Hamilton 1994, p. 291).

Before deriving the expectation maximization (EM) algorithm in analogy to the algorithms in Kargoll et al. (2018a) and Alkhatib et al. (2018), the Student distribution model is recast as a conditionally Gaussian variance inflation model. For this purpose, stochastically independent, gamma-distributed random variables

$$\begin{aligned} \mathcal {\MakeUppercase {W}}_{k,t} \, | \, \varvec{\theta } {\mathop {\sim }\limits ^{\mathrm {ind.}}} G \left( \frac{\nu _k}{2}, \frac{\nu _k}{2} \right) \;\;\;\;\; (k = 1,\ldots ,N; t = 1,\ldots , n)\nonumber \\ \end{aligned}$$
(21)

are introduced, which are viewed as unobservable (“latent”) variables or “missing data.” Under the conditions that the random variable \(\mathcal {\MakeUppercase {W}}_{k,t}\) takes the value \(w_{k,t}\) and that the parameter values \(\varvec{\theta }\) are also given, the white-noise components are assumed to be Gaussian random variables

$$\begin{aligned} \mathcal {\MakeUppercase {U}}_{k,t} \, | \, w_{k,t},\varvec{\theta } {\mathop {\sim }\limits ^{\mathrm {ind.}}} N(0,\sigma _k^2/w_{k,t}). \end{aligned}$$
(22)

Thus, the value \(w_{k,t}\) that the distribution of the white noise component \(\mathcal {\MakeUppercase {U}}_{k,t}\) is conditional on plays the role of an additional weight, a decrease of which causes an increase (“inflation”) of the variance of \(\mathcal {\MakeUppercase {U}}_{k,t}\). Since these weights cannot be observed directly, it will be the essential idea of the GEM algorithms developed in Sect. 3 to determine their values as conditional expectations of the latent variables \(\mathcal {\MakeUppercase {W}}_{k,t}\) given the observations \(\ell _{k,t}\) and an estimate of the parameters \(\varvec{\theta }\). The conditional independence in (22) means that each random variable \(\mathcal {\MakeUppercase {U}}_{k,t}\) is independent of the white-noise components \(\mathcal {\MakeUppercase {U}}_{\kappa ,\tau }\) and latent variables \(\mathcal {\MakeUppercase {W}}_{\kappa ,\tau }\) occurring within the series k at the other time instances (i.e., for \(\kappa = k\) and \(\tau = 1,\ldots ,t-1,t+1,\ldots ,n\)) and within the other series at all time instances (i.e., for \(\kappa = 1,\ldots ,k-1,k+1,\ldots ,N\) and \(\tau = 1,\ldots ,n\)), conditional on the value \(w_{k,t}\). This crucial assumption enables the simplification

$$\begin{aligned}&f(u_{k,t} | u_{k,1},w_{k,1}\ldots ,u_{k,t-1},w_{k,t-1},u_{k,t+1},w_{k,t+1},\ldots ,u_{k,n},w_{k,n} \nonumber \\&\;\;\; ,w_{k,t},\varvec{\theta }) = f(u_{k,t} | w_{k,t},\varvec{\theta }). \end{aligned}$$
(23)

Since assumptions (21)–(22) are exactly the same as the ones applied by Alkhatib et al. (2018) in connection with group-specific (univariate) AR processes, their derivation of the log-likelihood function can be applied to the current model involving a VAR process with only one modification: \(u_{k,t}\) is not replaced by the quantity \(\varvec{\alpha }_k(L)(\varvec{\ell }_{k,t} - \mathbf {h}_{k,t}(\varvec{\xi }))\) using decorrelation filters \(\varvec{\alpha }_k(L)\) based on group-specific AR processes, but by the quantity \(\mathbf {A}_k(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))\) where the decorrelation filter \(\mathbf {A}_k(L)\) is defined by a VAR process. With this modification, the log-likelihood function in (20) becomes

$$\begin{aligned} \log L_A(\varvec{\theta }; \varvec{\ell },\mathbf {w})&= \frac{n}{2} \log (2\pi ) - \frac{n}{2} \sum _{k=1}^N \log (\sigma _k^2) \nonumber \\&\quad + \frac{n}{2} \sum _{k=1}^N \nu _k \log \left( \frac{\nu _k}{2} \right) - n \sum _{k=1}^N \log \left( {\Gamma } \left( \frac{\nu _k}{2} \right) \right) \nonumber \\&\quad - \sum \limits _{k=1}^N \sum \limits _{t=1}^n \frac{w_{k,t}}{2} \left[ \nu _k + \left( \frac{\mathbf {A}_k(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))}{\sigma _k} \right) ^2 \right] \nonumber \\&\quad + \sum \limits _{k=1}^N \sum \limits _{t=1}^n \frac{1}{2} (\nu _k - 1) \log (w_{k,t}), \end{aligned}$$
(24)

A certain disadvantage of this t-distribution model is that it does not allow for the modeling of stochastic dependence between the white-noise variables \(\mathcal {\MakeUppercase {U}}_{1,t}\), \(\ldots \), \(\mathcal {\MakeUppercase {U}}_{N,t}\) at a given time instance t, so that the stochastic dependence between the colored-noise variables \(\mathcal {\MakeUppercase {E}}_{1,t}\), \(\ldots \), \(\mathcal {\MakeUppercase {E}}_{N,t}\) arises merely from the past time instances \(t-1\), \(t-2\), \(\ldots \), as shown by (5). The following type of stochastic model overcomes this limitation.

2.2 Stochastic model B: multivariate t-distribution

The second stochastic model uses a multivariate scaled t-distribution (cf. Lange et al. 1989) in order to model correlations between the white-noise variables \(\mathcal {\MakeUppercase {U}}_{1,t}\), \(\ldots \), \(\mathcal {\MakeUppercase {U}}_{N,t}\). Instead of (18), the stochastic model assumption now is

$$\begin{aligned} \varvec{\mathcal {U}}_t \, | \, \varvec{\theta } {\mathop {\sim }\limits ^{\mathrm {ind.}}} t_{\nu }(\mathbf {0},\varvec{\varSigma }) \;\;\;\;\; (t = 1,\ldots , n), \end{aligned}$$
(25)

where \(\varvec{\varSigma }\) denotes an N-by-N cofactor matrix, being a factor of the covariance matrix \(\varvec{\varSigma }_0 = \frac{\nu }{\nu -2} \cdot \varvec{\varSigma }\) of each random vector \(\varvec{\mathcal {U}}_t\) for \(\nu > 2\). All unknown entries of the inverse matrix \(\varvec{\varSigma }^{-1}\) are treated as parameters to be estimated. Instead of this variance–covariance model, a variance-component model as described in Koch (2014) could also be here. When the covariance matrix \(\varvec{\varSigma }_0\) of the random errors is given a priori, then the cofactor matrix \(\varvec{\varSigma }\) is computable via the equation \(\varvec{\varSigma } = \frac{\nu -2}{\nu } \cdot \varvec{\varSigma }_0\). In certain applications, such a covariance matrix might depend on time (e.g., analysis of GNSS time series preprocessed by means of standard software); one would then simply add the time index to the corresponding cofactor matrix (\(\varvec{\varSigma }_t\)) in (25) and in the sequel.

The total parameter vector reads \(\varvec{\theta } = [\varvec{\xi }^T \; \varvec{\alpha }^T \; (\varvec{\sigma ^{-1}})^T \; \nu ]^T\), where \(\varvec{\sigma ^{-1}}\) symbolizes the vectorized matrix \(\varvec{\varSigma }^{-1}\), that is, the \((N^2 \times 1)\)-vector obtained by stacking the columns of the matrix \(\varvec{\varSigma }^{-1}\). A certain drawback of the current model is that all observation groups share the same df\(\nu \) and thus the same tail or outlier characteristics, which might be an unrealistic assumption in some geodetic applications.

Due to the assumed stochastic independence of the random vectors \(\varvec{\mathcal {U}}_1\), \(\ldots \), \(\varvec{\mathcal {U}}_n\), their joint pdf factors into

$$\begin{aligned} f_B(\mathbf {u}|\varvec{\theta })&= \prod \limits _{t=1}^n f_B(\mathbf {u}_t|\varvec{\theta }) \nonumber \\&= \prod \limits _{t=1}^n \frac{{\Gamma }\left( \frac{\nu +N}{2} \right) }{(\sqrt{\nu \pi })^N \, \sqrt{\det \varvec{\varSigma }} \, {\Gamma }\left( \frac{\nu }{2} \right) } \left( 1 + \frac{\mathbf {u}_t^T \varvec{\varSigma }^{-1} \mathbf {u}_t}{\nu } \right) ^{-\frac{\nu +N}{2}} \end{aligned}$$
(26)

and leads to the log-likelihood function

$$\begin{aligned} \log L_B(\varvec{\theta };\varvec{\ell })&= n \log \left( \frac{{\Gamma }\left( \frac{\nu +N}{2} \right) }{(\sqrt{\nu \pi })^N \, \sqrt{\det \varvec{\varSigma }} \, {\Gamma }\left( \frac{\nu }{2} \right) } \right) \nonumber \\&\quad - \frac{\nu +N}{2} \cdot \sum _{t=1}^n \log \left( 1 + \frac{\mathbf {u}_t^T \varvec{\varSigma }^{-1} \mathbf {u}_t}{\nu } \right) , \end{aligned}$$
(27)

with \(\mathbf {u}_t = \mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))\) according to (16). The variance inflation model corresponding to this multivariate Student random error model is now defined by

$$\begin{aligned} \mathcal {\MakeUppercase {W}}_t \, | \, \varvec{\theta } {\mathop {\sim }\limits ^{\mathrm {ind.}}} G \left( \frac{\nu }{2}, \frac{\nu }{2} \right) \;\;\;\;\; (t = 1,\ldots , n) \end{aligned}$$
(28)

and

$$\begin{aligned} \varvec{\mathcal {U}}_t \, | \, w_t,\varvec{\theta } {\mathop {\sim }\limits ^{\mathrm {ind.}}} N \left( \mathbf {0},\frac{1}{w_t} \cdot \varvec{\varSigma } \right) , \end{aligned}$$
(29)

where the preceding conditional independence means that

$$\begin{aligned}&f(\mathbf {u}_t | \mathbf {u}_1,w_1\ldots ,\mathbf {u}_{t-1},w_{t-1},\mathbf {u}_{t+1},w_{t+1},\ldots ,\mathbf {u}_n,w_n,w_t,\varvec{\theta }) \nonumber \\&\;\;\; = f(\mathbf {u}_t | w_t,\varvec{\theta }). \end{aligned}$$
(30)

In contrast to model (18), a single weight \(w_t\) is assigned to the entire time series term \(\varvec{\mathcal {U}}_t\); the larger the weight, the more extreme is the location of the vector \(\mathbf {u}_t\) under the multidimensional tail of the multivariate distribution. The pdfs defining the preceding distributions are given by

$$\begin{aligned} f(w_t|\varvec{\theta }) = \left\{ \begin{array}{ccl} \frac{\left( \frac{\nu }{2} \right) ^{\frac{\nu }{2}}}{{\Gamma }\left( \frac{\nu }{2} \right) } \cdot w_t^{\frac{\nu }{2}-1} \cdot \exp \left( -\frac{\nu }{2} w_t \right) &{} \; \mathrm {for} \; &{} w_t > 0 \\ 0 &{} \; \mathrm {for} \; &{} w_t \le 0 \end{array} \right. \end{aligned}$$

and

$$\begin{aligned} f_B(\mathbf {u}_t | w_t,\varvec{\theta }) = \frac{1}{\sqrt{(2 \pi )^N \det (\varvec{\varSigma }/w_t)}} \, \exp \left( -\frac{w_t}{2} \mathbf {u}_t^T \varvec{\varSigma }^{-1} \mathbf {u}_t \right) . \end{aligned}$$

Then, the joint distribution of all the white-noise variables \(\varvec{\mathcal {U}}\) and all latent variables \(\varvec{\mathcal {W}}\) can be written as

$$\begin{aligned} f_B(\mathbf {u},\mathbf {w} | \varvec{\theta })&= \prod \limits _{t=1}^n f_B(\mathbf {u}_t,w_t|\varvec{\theta }) \nonumber \\&= \prod \limits _{t=1}^n f(w_t|\varvec{\theta }) \, f_B(\mathbf {u}_t|w_t,\varvec{\theta }) \nonumber \\&= \prod \limits _{t=1}^n \frac{\left( \frac{\nu }{2} \right) ^{\frac{\nu }{2}}}{{\Gamma }\left( \frac{\nu }{2} \right) } \, w_t^{\frac{\nu }{2}-1} \exp \left( -\frac{\nu }{2} w_t \right) \nonumber \\&\quad \times \frac{1}{\sqrt{(2 \pi )^N \det (\varvec{\varSigma }/w_t)}} \, \exp \left( -\frac{w_t}{2} \mathbf {u}_t^T \varvec{\varSigma }^{-1} \mathbf {u}_t \right) \end{aligned}$$
(31)

Since the unknown values \(\mathbf {u}\) of the white-noise variables depend on the given observations \(\varvec{\ell }\) as well as the unknown parameters \(\varvec{\xi }\) and \(\varvec{\alpha }\), this pdf may be used as a likelihood function, for which one obtains

$$\begin{aligned} \log L_B(\varvec{\theta }; \varvec{\ell },\mathbf {w})&= \log f_B(\mathbf {u},\mathbf {w} | \varvec{\theta }) \nonumber \\&= -\frac{Nn}{2} \log (2\pi ) -\frac{n}{2} \log \det \varvec{\varSigma } + \frac{n\nu }{2} \log \left( \frac{\nu }{2} \right) \nonumber \\&\quad - n \log {\Gamma } \left( \frac{\nu }{2} \right) + \left( \frac{N}{2} - 1 \right) \sum \limits _{t=1}^n \log w_t \nonumber \\&\quad - \frac{1}{2} \sum \limits _{t=1}^n w_t [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))]^T \varvec{\varSigma }^{-1} \nonumber \\&\quad \times [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))] + \frac{\nu }{2} \sum \limits _{t=1}^n (\log w_t - w_t). \end{aligned}$$
(32)

3 Adjustment of the observation models

In this section, ML estimates of the model parameters \(\varvec{\theta }\) under the stochastic models A and B are made. For this purpose, the fact that the structure of log-likelihood functions (24) and (27) allows for the setup of GEM algorithms is exploited. These algorithms iteratively perform an expectation (“E-step”) and a maximization step (“M-step”). Let the number s denote the iteration step and \(\widehat{\varvec{\theta }}^{(s)}\) the parameter estimates obtained in that step. Then, the \((s+1)\)th iteration of the E-step consists of the computation of the conditional expectation

$$\begin{aligned} Q_{A/B}(\varvec{\theta } | \widehat{\varvec{\theta }}^{(s)}) = E_{\varvec{\mathcal {W}}|\varvec{\ell };\widehat{\varvec{\theta }}^{(s)}} \left\{ \log L_{A/B} \left( \varvec{\theta };\varvec{\ell },\varvec{\mathcal {W}} \right) \right\} , \end{aligned}$$
(33)

which is called “Q-function” below; “A/B” indicates that either log-likelihood function (24) for stochastic model A or log-likelihood function (32) for stochastic model B is substituted. The E-step will be seen to provide values \(\mathbf {w}\) of the latent variables \(\varvec{\mathcal {W}}\). Next, one would solve within the M-step the maximization problem

$$\begin{aligned} \widehat{\varvec{\theta }}^{(s+1)} = \mathrm {arg} \, \max _{\varvec{\theta }} Q_{A/B}(\varvec{\theta } | \widehat{\varvec{\theta }}^{(s)}). \end{aligned}$$
(34)

3.1 Generalized expectation maximization algorithm for stochastic model A

Replacing the decorrelation filter \(\varvec{\alpha }_k(L)(\ell _{k,t} - h_{k,t}(\varvec{\xi }))\) by \(\mathbf {A}_k(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))\) as in Alkhatib et al. (2018), the Q-function of the current stochastic model A can similarly be derived, that is,

$$\begin{aligned} Q_A(\varvec{\theta } | \widehat{\varvec{\theta }}^{(s)})&= \mathrm {const.} - \frac{n}{2} \sum \limits _{k=1}^N \log (\sigma _k^2) \nonumber \\&\quad - \sum _{k=1}^N \frac{1}{2\sigma _k^2} \sum \limits _{t=1}^n {\widetilde{w}}_{k,t}^{(s)} \left[ \mathbf {A}_k(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi })) \right] ^2 \nonumber \\&\quad + \frac{n}{2} \sum \limits _{k=1}^N \nu _k \log \nu _k - n \sum _{k=1}^N \log {\Gamma } \left( \frac{\nu _k}{2} \right) \nonumber \\&\quad + \frac{n}{2} \sum \limits _{k=1}^N \nu _k \left[ \psi \left( \frac{{\widehat{\nu }}_k^{(s)}+1}{2} \right) - \log \left( {\widehat{\nu }}_k^{(s)}+1 \right) \right. \nonumber \\&\quad \left. + \frac{1}{n} \sum \limits _{t=1}^n \left( \log {\widetilde{w}}_{k,t}^{(s)} - {\widetilde{w}}_{k,t}^{(s)} \right) \right] , \end{aligned}$$
(35)

where

$$\begin{aligned} {\widetilde{w}}_{k,t}^{(s)}&= E_{\mathcal {\MakeUppercase {W}}_{k,t}|u_{k,t};\widehat{\varvec{\theta }}^{(s)}} \{ \mathcal {\MakeUppercase {W}}_{k,t} \} \nonumber \\&= \frac{{\widehat{\nu }}_k^{(s)}+1}{{\widehat{\nu }}_k^{(s)} + \left( \frac{\widehat{\mathbf {A}}_k^{(s)}(L)(\varvec{\ell }_t - \mathbf {h}_t(\widehat{\varvec{\xi }}^{(s)})}{{\widehat{\sigma }}^{(s)}} \right) ^2}. \end{aligned}$$
(36)

The tilde symbol is used with these conditional expectations to distinguish these quantities more clearly from parameter estimates (which obtain a “hat”). The M-step is carried out in four sequential conditional maximization (CM) steps (in the sense of Meng and Rubin 1993) that correspond to the parameter groups \(\varvec{\xi }\), \(\varvec{\alpha }\), \(\varvec{\sigma }\) and \(\varvec{\nu }\) forming \(\varvec{\theta }\). The essential idea behind conditioning is to substitute the most recent available estimates for parameters belonging to groups other than the one to be determined by the current CM-step.

In the first CM-step the functional parameters \(\varvec{\xi }\), which occur as variables of the nonlinear regression functions \(h_{k,t}\), are solved for. To resolve the nonlinearity, the functions are linearized by Taylor series expansion, which is possible within the framework of a GEM algorithm, as shown for instance in Phillips (2002). A natural choice for the Taylor point is the solution \(\widehat{\varvec{\xi }}^{(s)}\) from the preceding iteration step s. GEM algorithm (Algorithm 1) is started with approximate, initial parameter values \(\widehat{\varvec{\xi }}^{(0)}\), which may be known from a previous adjustment or determined as part of a preprocessing step. Setting the partial derivative of the previous Q-function with respect to \(\xi _j\) equals to zero, as shown in Appendix B.2, one arrives at the normal equations

$$\begin{aligned} \mathbf {0}&= \sum \limits _{k=1}^N \frac{1}{\sigma _k^2} \begin{bmatrix} \mathbf {A}_k(L) \mathbf {X}_{1,:1} &{} \cdots &{} \mathbf {A}_k(L) \mathbf {X}_{n,:1} \\ \vdots &{} &{} \vdots \\ \mathbf {A}_k(L) \mathbf {X}_{1,:m} &{} \cdots &{} \mathbf {A}_k(L) \mathbf {X}_{n,:m} \end{bmatrix} \widetilde{\mathbf {W}}_k^{(s)} \nonumber \\&\quad \times \begin{bmatrix} \mathbf {A}_k(L) \left( \varvec{\varDelta \ell }_1 - \mathbf {X}_1 \varvec{\varDelta \xi } \right) \\ \vdots \\ \mathbf {A}_k(L) \left( \varvec{\varDelta \ell }_n - \mathbf {X}_n \varvec{\varDelta \xi } \right) \end{bmatrix} \end{aligned}$$
(37)

with the incremental observations

$$\begin{aligned} \varvec{\varDelta \ell }_t = \varvec{\ell }_t - \mathbf {h}_t(\widehat{\varvec{\xi }}^{(s)}), \end{aligned}$$
(38)

the incremental parameters

$$\begin{aligned} \varvec{\varDelta \xi } = \varvec{\xi } - \widehat{\varvec{\xi }}^{(s)} \end{aligned}$$
(39)

and \(\mathbf {X}_{t,:j}\) being the jth column of the Jacobi matrix

$$\begin{aligned} \mathbf {X}_t = \frac{\partial \mathbf {h}_t(\varvec{\xi })}{\partial \varvec{\xi }} \left| _{\varvec{\xi } = \widehat{\varvec{\xi }}^{(s)}} \right. = \begin{bmatrix} \frac{\partial h_{1,t}(\varvec{\xi })}{\partial \xi _1} &{} \cdots &{} \frac{\partial h_{1,t}(\varvec{\xi })}{\partial \xi _m} \\ \vdots &{} &{} \vdots \\ \frac{\partial h_{N,t}(\varvec{\xi })}{\partial \xi _1} &{} \cdots &{} \frac{\partial h_{N,t}(\varvec{\xi })}{\partial \xi _m} \end{bmatrix} \left| _{\varvec{\xi } = \widehat{\varvec{\xi }}^{(s)}} \right. .\nonumber \\ \end{aligned}$$
(40)

The step number s is omitted for brevity of expressions. Conditioning these equations on the estimated variance components \(\widehat{\varvec{\sigma }}^{(s)}\) and VAR coefficients \(\widehat{\varvec{\alpha }}^{(s)}\) from the preceding iteration step s, and performing also filtering operation (14) on the incremental observations and the Jacobi matrix results in the estimates

$$\begin{aligned} \overline{\varDelta \ell }_{k,t}&= \widehat{\mathbf {A}}_k^{(s)}(L) \varvec{\varDelta \ell }_t, \end{aligned}$$
(41)
$$\begin{aligned} {\overline{X}}_{t,k,j}&= \widehat{\mathbf {A}}_k^{(s)}(L) \mathbf {X}_{t,:j}, \end{aligned}$$
(42)
$$\begin{aligned} \overline{\mathbf {X}}_{t,k:}&= \widehat{\mathbf {A}}_k^{(s)}(L) \mathbf {X}_t, \end{aligned}$$
(43)

where \(\overline{\mathbf {X}}_{t,k,j}\) is the entry in the kth row and the jth column of the filtered Jacobi matrix at time t and \(\overline{\mathbf {X}}_{t,k:}\) is accordingly the kth row of that matrix. The solution \(\widehat{\varvec{\varDelta \xi }}^{(s+1)}\) satisfies

$$\begin{aligned} \mathbf {0}&= \sum \limits _{k=1}^N \frac{1}{(\sigma _k^2)^{(s)}} \begin{bmatrix} {\overline{X}}_{1,k,1} &{} \cdots &{} {\overline{X}}_{n,k,1} \\ \vdots &{} &{} \vdots \\ {\overline{X}}_{1,k,m} &{} \cdots &{} {\overline{X}}_{n,k,m} \end{bmatrix} \widetilde{\mathbf {W}}_k^{(s)} \\&\quad \times \begin{bmatrix} \overline{\varDelta \ell }_{k,1} - \overline{\mathbf {X}}_{1,k:}\widehat{\varvec{\varDelta \xi }}^{(s+1)} \\ \vdots \\ \overline{\varDelta \ell }_{k,n} - \overline{\mathbf {X}}_{n,k:}\widehat{\varvec{\varDelta \xi }}^{(s+1)} \end{bmatrix} \\&= \sum \limits _{k=1}^N \frac{1}{({\widehat{\sigma }}_k^2)^{(s)}} \begin{bmatrix} {\overline{X}}_{1,k,1} &{} \cdots &{} {\overline{X}}_{n,k,1} \\ \vdots &{} &{} \vdots \\ {\overline{X}}_{1,k,m} &{} \cdots &{} {\overline{X}}_{n,k,m} \end{bmatrix} \widetilde{\mathbf {W}}_k^{(s)} \\&\quad \times \left( \begin{bmatrix} \overline{\varDelta \ell }_{k,1} \\ \vdots \\ \overline{\varDelta \ell }_{k,n} \end{bmatrix} - \begin{bmatrix} {\overline{X}}_{1,k,1} &{} \cdots &{} {\overline{X}}_{1,k,m} \\ \vdots \\ {\overline{X}}_{n,k,1} &{} \cdots &{} {\overline{X}}_{n,k,m} \end{bmatrix} \widehat{\varvec{\varDelta \xi }}^{(s+1)} \right) \\&=: \sum \limits _{k=1}^N \frac{1}{({\widehat{\sigma }}_k^2)^{(s)}} \overline{\mathbf {X}}_{k:}^T \widetilde{\mathbf {W}}_k^{(s)} \left( \overline{\varvec{\varDelta \ell }}_{k:} - \overline{\mathbf {X}}_{k:} \widehat{\varvec{\varDelta \xi }}^{(s+1)} \right) . \end{aligned}$$

If the matrices \(\overline{\mathbf {X}}_{1:}\), \(\ldots \), \(\overline{\mathbf {X}}_{N:}\) have full rank m, estimates of the incremental parameters can be computed by

$$\begin{aligned} \widehat{\varvec{\varDelta \xi }}^{(s+1)}&= \left( \sum \limits _{k=1}^N \frac{1}{({\widehat{\sigma }}_k^2)^{(s)}} \overline{\mathbf {X}}_{k:}^T \widetilde{\mathbf {W}}_k^{(s)} \overline{\mathbf {X}}_{k:} \right) ^{-1} \nonumber \\&\quad \times \sum \limits _{k=1}^N \frac{1}{({\widehat{\sigma }}_k^2)^{(s)}} \overline{\mathbf {X}}_{k:}^T \widetilde{\mathbf {W}}_k^{(s)} \overline{\varvec{\varDelta \ell }}_{k:}, \end{aligned}$$
(44)

which yield with (39) the new solution for the full parameters

$$\begin{aligned} \widehat{\varvec{\xi }}^{(s+1)} = \widehat{\varvec{\xi }}^{(s)} + \widehat{\varvec{\varDelta \xi }}^{(s+1)}. \end{aligned}$$
(45)

Notice that (44) involves additions of the normal equation matrices \(\overline{\mathbf {X}}_{k:}^T \widetilde{\mathbf {W}}_k^{(s)} \overline{\mathbf {X}}_{k:}\) and corresponding “right-hand sides” \(\overline{\mathbf {X}}_{k:}^T \widetilde{\mathbf {W}}_k^{(s)} \overline{\varvec{\varDelta \ell }}_{k:}\) across the N individual time series, because the white-noise time series were modeled independently by univariate (t-)distributions.

Next, estimates of random errors (4), also called “colored-noise residuals” below, are immediately obtained through

$$\begin{aligned} \widehat{\mathbf {e}}_t^{(s+1)} = \varvec{\ell }_t - \mathbf {h}_t(\widehat{\varvec{\xi }}^{(s+1)}). \end{aligned}$$
(46)

To derive the new solution for the VAR coefficients \(\varvec{\alpha }\) within the second CM-step, the matrix

$$\begin{aligned} \widehat{\mathbf {E}}^{(s+1)} = \begin{bmatrix} \widehat{\mathbf {e}}_0^{(s+1)T} &{} \cdots &{} \widehat{\mathbf {e}}_{1-p}^{(s+1)T} \\ \vdots &{} &{} \vdots \\ \widehat{\mathbf {e}}_{n-1}^{(s+1)T} &{} \cdots &{} \widehat{\mathbf {e}}_{n-p}^{(s+1)T} \end{bmatrix} \end{aligned}$$
(47)

is defined. As the colored-noise residuals can be computed by (46) only for \(t=1,\ldots ,n\), the residuals \(\widehat{\mathbf {e}}_0^{(s+1)}\), \(\widehat{\mathbf {e}}_{-1}^{(s+1)}\), \(\ldots \) with nonpositive time index values occurring in matrix (47) are undefined. To account for these missing values, the initial conditions \(\widehat{\mathbf {e}}_0^{(s+1)} = \widehat{\mathbf {e}}_{-1}^{(s+1)} = \cdots = \mathbf {0}\) are used. Appendix B.3 shows that the first-order conditions for \(\varvec{\alpha }_{k:}\) result in

$$\begin{aligned} \widehat{\varvec{\alpha }}_{k:}^{(s+1)} = \left( \widehat{\mathbf {E}}^{(s+1)T} \widetilde{\mathbf {W}}_k^{(s)} \widehat{\mathbf {E}}^{(s+1)} \right) ^{-1} \widehat{\mathbf {E}}^{(s+1)T} \widetilde{\mathbf {W}}_k^{(s)} \widehat{\varvec{e}}_{k:}^{(s+1)},\nonumber \\ \end{aligned}$$
(48)

given that the matrix \(\widehat{\mathbf {E}}^{(s+1)}\) has full rank. Note that this matrix is the same for each group \(k \in \{ 1,\ldots ,N \}\). Stacking the vectors \(\widehat{\varvec{\alpha }}_{1:}^{(s+1)}\), \(\ldots \), \(\widehat{\varvec{\alpha }}_{N:}^{(s+1)}\) yields the vector \(\widehat{\varvec{\alpha }}^{(s+1)}\), which defines the updated VAR(p) model. Applying the decorrelation filter based on this model to correlated residuals (46), as indicated by (16), gives

$$\begin{aligned} \widehat{\mathbf {u}}_t^{(s+1)} = \widehat{\mathbf {A}}^{(s+1)}(L) \widehat{\mathbf {e}}_t^{(s+1)}. \end{aligned}$$
(49)

Now, the variance components are estimated in the third CM-step group-by-group according to

$$\begin{aligned} ({\widehat{\sigma }}_k^2)^{(s+1)}&= \frac{1}{n} \sum \limits _{t=1}^n {\widetilde{w}}_{k,t}^{(s)} {\widehat{u}}_{k,t}^{(s+1)2} \end{aligned}$$
(50)
$$\begin{aligned}&= \frac{\widehat{\mathbf {u}}_{k:}^{(i+1)T} \widetilde{\mathbf {W}}_k^{(s)} \widehat{\mathbf {u}}_{k:}^{(s+1)}}{n}, \end{aligned}$$
(51)

as the result of the corresponding first-order condition (see Appendix B.4 for details). To increase the computational efficiency of the remaining estimation of the df, the first-order condition with respect to the original log-likelihood function \(L_A(\varvec{\theta };\varvec{\ell })\) is used. This turns the fourth CM-step into a CM either-v(CME-) step (as suggested by Liu and Rubin 1994), which consists of the group-wise zero search of the equation (derived in Appendix B.5)

$$\begin{aligned} 0&= \log {\widehat{\nu }}_k^{(s+1)} + 1 - \psi \left( \frac{{\widehat{\nu }}_k^{(s+1)}}{2} \right) + \psi \left( \frac{{\widehat{\nu }}_k^{(s+1)}+1}{2} \right) \nonumber \\&\quad - \log \left( {\widehat{\nu }}_k^{(s+1)}+1 \right) + \frac{1}{n} \sum \limits _{t=1}^n \left( \log {\widetilde{w}}_{k,t}^{(s+1)} - {\widetilde{w}}_{k,t}^{(s+1)} \right) \end{aligned}$$
(52)

with

$$\begin{aligned} {\widetilde{w}}_{k,t}^{(s+1)} = \frac{{\widehat{\nu }}_k^{(s+1)}+1}{{\widehat{\nu }}_k^{(s+1)} + \left( \frac{\widehat{\mathbf {A}}_k^{(s+1)}(L)(\varvec{\ell }_t - \mathbf {h}_t(\widehat{\varvec{\xi }}^{(s+1)})}{{\widehat{\sigma }}^{(s+1)}} \right) ^2} \end{aligned}$$
(53)

The steps of the current GEM algorithm under stochastic model A can be implemented easily as shown in following Algorithm 1, which includes the initialization and stop criterion.

figure a

3.2 Generalized expectation maximization algorithm for stochastic model B

The solution of maximization problem (34) for the log-likelihood function \(\log L_B(\varvec{\theta };\varvec{\ell },\varvec{w})\) involves the multivariate t-distribution. To bring out clearly the modifications that have to be applied to the previously given derivation of EM algorithms in the univariate case (cf. Koch and Kargoll 2013; Kargoll et al. 2018a), the E-step is firstly elaborated in some detail.

3.2.1 The E-step

To derive the E-step analytically, it is convenient to separate within log-likelihood function (32) the terms involving \(w_t\) from those involving \(\log w_t\), that is,

$$\begin{aligned} \log L_B(\varvec{\theta };\varvec{\ell },\mathbf {w})&= -\frac{Nn}{2} \log (2\pi ) \, -\frac{n}{2} \log \det \varvec{\varSigma } \\&\quad + \frac{n\nu }{2} \log \left( \frac{\nu }{2} \right) - n \log {\Gamma }\left( \frac{\nu }{2} \right) \\&\quad - \sum \limits _{t=1}^n \frac{1}{2} \left( \nu + [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))]^T \varvec{\varSigma }^{-1} \right. \\&\quad \quad \left. \times [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))] \right) w_t \\&\quad + \sum \limits _{t=1}^n \frac{1}{2} (\nu +N-2) \log w_t. \end{aligned}$$

As mentioned in Appendix A, the previously considered univariate case is embedded in this multivariate case as the special case \(N=1\). Having defined the probabilistic model of the random variables in \(\varvec{\mathcal {U}}\), we condition directly on the values \(\mathbf {u}\), so that the Q-function becomes

$$\begin{aligned} Q_B(\varvec{\theta } | \widehat{\varvec{\theta }}^{(s)})&= E_{\varvec{\mathcal {W}}|\mathbf {u};\widehat{\varvec{\theta }}^{(s)}} \left\{ \log L_B \left( \varvec{\theta };\varvec{\ell },\varvec{\mathcal {W}} \right) \right\} = -\frac{Nn}{2} \log (2\pi ) \nonumber \\&\quad - \frac{n}{2} \log \det \varvec{\varSigma } \, + \frac{n\nu }{2} \log \left( \frac{\nu }{2} \right) - n \log {\Gamma }\left( \frac{\nu }{2} \right) \nonumber \\&\quad - \sum \limits _{t=1}^n \frac{1}{2} \left( \nu + [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))]^T \varvec{\varSigma }^{-1} \right. \nonumber \\&\quad \left. \times [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))] \right) E_{\varvec{\mathcal {W}}|\varvec{u};\widehat{\varvec{\theta }}^{(s)}} \{ \mathcal {\MakeUppercase {W}}_t \} \nonumber \\&\quad + \sum \limits _{t=1}^n \frac{1}{2} (\nu - 1) E_{\varvec{\mathcal {W}}|\mathbf {u};\widehat{\varvec{\theta }}^{(s)}} \{ \log \mathcal {\MakeUppercase {W}}_t \}. \end{aligned}$$
(54)

The conditional expectations are determined in analogy to those occurring in the Q-function for the univariate t-distribution model (see Kargoll et al. 2018a, Eqs. (39)–(41)). In the present case, the required conditional distribution of \(\mathcal {\MakeUppercase {W}}_t|\mathbf {u}_t;\widehat{\varvec{\theta }}^{(s)}\) can be shown to be a univariate gamma distribution with parameter values \(a = \frac{{\widehat{\nu }}^{(s)}+N}{2}\) and \(b = \frac{1}{2} ({\widehat{\nu }}^{(s)} + \mathbf {u}_t^T (\widehat{\varvec{\varSigma }}^{(s)})^{-1} \mathbf {u}_t)\), according to Appendix B.1. Using the property that the expected value of a gamma-distributed random variable is given by a/b and that of its natural logarithm by \(\psi (a) - \log b\) (see Appendix A.2 in Kargoll et al. 2018a) one finds

$$\begin{aligned} {\widetilde{w}}_t^{(s)}&= E_{\varvec{\mathcal {W}}|\mathbf {u};\widehat{\varvec{\theta }}^{(s)}} \{ \mathcal {\MakeUppercase {W}}_t \} = E_{\mathcal {\MakeUppercase {W}}_t|\mathbf {u}_t;\widehat{\varvec{\theta }}^{(s)}} \{ \mathcal {\MakeUppercase {W}}_t \} \nonumber \\&= \frac{{\widehat{\nu }}^{(s)}+N}{{\widehat{\nu }}^{(s)} + \mathbf {u}_t^T (\widehat{\varvec{\varSigma }}^{(s)})^{-1} \mathbf {u}_t} \end{aligned}$$
(55)

as well as

$$\begin{aligned} E_{\varvec{\mathcal {W}}|\mathbf {u};\widehat{\varvec{\theta }}^{(s)}} \{ \log \mathcal {\MakeUppercase {W}}_t \}&= E_{\mathcal {\MakeUppercase {W}}_t|\mathbf {u}_t;\widehat{\varvec{\theta }}^{(s)}} \{ \log \mathcal {\MakeUppercase {W}}_t \} = \psi \left( \frac{{\widehat{\nu }}^{(s)}+N}{2} \right) \nonumber \\&\quad - \log \left( \frac{1}{2} \left[ {\widehat{\nu }}^{(s)} + \mathbf {u}_t^T (\widehat{\varvec{\varSigma }}^{(s)})^{-1} \mathbf {u}_t \right] \right) \nonumber \\&= \log {\widetilde{w}}_t^{(s)} + \psi \left( \frac{{\widehat{\nu }}^{(s)}+N}{2} \right) \nonumber \\&\quad - \log \left( \frac{{\widehat{\nu }}^{(s)}+N}{2} \right) . \end{aligned}$$
(56)

Since (55) is conditioned on \(\mathbf {u}_t\) and \(\widehat{\varvec{\theta }}^{(s)}\) in (55), (16) becomes \(\mathbf {u}_t = \widehat{\mathbf {A}}^{(s)}(L)(\varvec{\ell }_t - \mathbf {h}_t(\widehat{\varvec{\xi }}^{(s)}))\), which is substituted into (55). Substituting these into (54) and omitting all terms that do not involve any parameter in \(\varvec{\theta }\), the Q-function can be written as

$$\begin{aligned} Q_B(\varvec{\theta } | \widehat{\varvec{\theta }}^{(s)})&= -\frac{n}{2} \log \det \varvec{\varSigma } - \frac{1}{2} \sum \limits _{t=1}^n {\widetilde{w}}_t^{(s)} [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))]^T \nonumber \\&\quad \times \varvec{\varSigma }^{-1} [\mathbf {A}(L)(\varvec{\ell }_t - \mathbf {h}_t(\varvec{\xi }))] + \frac{n\nu }{2} \log \nu \nonumber \\&\quad - n \log {\Gamma }\left( \frac{\nu }{2} \right) + \frac{n\nu }{2} \left[ \psi \left( \frac{{\widehat{\nu }}^{(s)}+N}{2} \right) \right. \nonumber \\&\quad \left. - \log \left( {\widehat{\nu }}^{(s)}+N \right) + \frac{1}{n} \sum \limits _{t=1}^n \left( \log {\widetilde{w}}_t^{(s)} - {\widetilde{w}}_t^{(s)} \right) \right] . \end{aligned}$$
(57)

The omission of parameter-independent constants will not affect maximization with respect to the parameters \(\varvec{\theta }\), which is described next.

3.2.2 The M-step

To obtain estimates of the regression parameters in the first CM-step, the regression function is linearized exactly as before (see Sect. 3.1). The first-order conditions with respect to \(\varvec{\xi }\) can then be shown to lead to the normal equations

$$\begin{aligned} \sum \limits _{t=1}^n [\mathbf {A}(L) \mathbf {X}_t]^T \left[ {\widetilde{w}}_t^{(s)} \varvec{\varSigma }^{-1} \right] [\mathbf {A}(L)(\varvec{\varDelta \ell }_t - \mathbf {X}_t \varvec{\varDelta \xi })] \end{aligned}$$
(58)

with the incremental observations \(\varvec{\varDelta \ell }_t\) computed by (38), incremental parameters \(\varvec{\varDelta \xi }\) by (39), and Jacobi matrix \(\mathbf {X}_t\) by (40). If these equations depend on the latest available estimates \(\widehat{\varvec{\varSigma }}^{(s)}\) and \(\widehat{\varvec{\alpha }}^{(s)}\), and the incremental observations and the Jacobi matrix are filtered as in (12) to obtain the vectors

$$\begin{aligned} \overline{\varvec{\varDelta \ell }}_t&= \widehat{\mathbf {A}}^{(s)}(L) \varvec{\varDelta \ell }_t, \end{aligned}$$
(59)
$$\begin{aligned} \overline{\mathbf {X}}_t&= \widehat{\mathbf {A}}^{(s)}(L) \mathbf {X}_t, \end{aligned}$$
(60)

then one can write for the normal equations

$$\begin{aligned} \mathbf {0} = \sum \limits _{t=1}^n \overline{\mathbf {X}}_t^T \left[ {\widetilde{w}}_t^{(s)} (\widehat{\varvec{\varSigma }}^{(s)})^{-1} \right] \left( \overline{\varvec{\varDelta \ell }}_t - \overline{\mathbf {X}}_t \widehat{\varvec{\varDelta \xi }}^{(s+1)} \right) . \end{aligned}$$

Given that the filtered Jacobi matrices have full rank, the solution of the normal equations is given by

$$\begin{aligned} \widehat{\varvec{\varDelta \xi }}^{(s+1)}&= \left( \sum \limits _{t=1}^n \overline{\mathbf {X}}_t^T \left[ {\widetilde{w}}_t^{(s)} (\widehat{\varvec{\varSigma }}^{(s)})^{-1} \right] \overline{\mathbf {X}}_t \right) ^{-1} \nonumber \\&\quad \times \sum \limits _{t=1}^n \overline{\mathbf {X}}_t^T \left[ {\widetilde{w}}_t^{(s)} (\widehat{\varvec{\varSigma }}^{(s)})^{-1} \right] \overline{\varvec{\varDelta \ell }}_t, \end{aligned}$$
(61)

resulting in the new solution \(\widehat{\varvec{\xi }}^{(s+1)}\) via (45). Since the white-noise time series are now generally correlated across the N groups through the multivariate t-distribution, the normal equation matrices \(\overline{\mathbf {X}}_t^T [{\widetilde{w}}_t^{(s)} (\widehat{\varvec{\varSigma }}^{(s)})^{-1}] \overline{\mathbf {X}}_t\) and right-hand sides \(\overline{\mathbf {X}}_t^T [{\widetilde{w}}_t^{(s)} (\widehat{\varvec{\varSigma }}^{(s)})^{-1}] \overline{\varvec{\varDelta \ell }}_t\) are formed group-wise. As the white-noise vectors are stochastically independent throughout time, these normal equations can be added up accordingly, as shown in (61). Next, the residuals \(\widehat{\mathbf {e}}_t^{(s+1)}\) and the associated matrix \(\widehat{\mathbf {E}}^{(s+1)}\) are computed as in (46) and (47), respectively.

Regarding the second CM-step, one obtains from the first-order conditions with respect to the VAR coefficient matrices (see Appendix B.3)

$$\begin{aligned} \begin{bmatrix} \widehat{\mathbf {A}}^{(1)(s+1)}&\cdots \widehat{\mathbf {A}}^{(p)(s+1)} \end{bmatrix}&= \begin{bmatrix} \widehat{\mathbf {e}}_1^{(s+1)}&\cdots&\widehat{\mathbf {e}}_n^{(s+1)} \end{bmatrix} \widetilde{\mathbf {W}}^{(s)} \widehat{\mathbf {E}}^{(s+1)} \nonumber \\&\quad \times \left( \widehat{\mathbf {E}}^{(s+1)T} \widetilde{\mathbf {W}}^{(s)} \widehat{\mathbf {E}}^{(s+1)} \right) ^{-1}. \end{aligned}$$
(62)

The new filter \(\widehat{\mathbf {A}}(L)^{(s+1)}\) enables the estimation of the white-noise errors \(\widehat{\mathbf {u}}_t^{(s+1)}\) through (49). These enter, alongside the current weights, the third CM-step, in which the cofactor matrix \(\varvec{\varSigma }\) is estimated. Observe that the relevant first two terms of Q-function (57) are similar to the log-likelihood function for the multivariate Gauss–Markov model based on normally distributed observations, so that the first-order conditions with respect to \(\varvec{\varSigma }^{-1}\) can be shown to lead to (see Appendix B.4)

$$\begin{aligned} \widehat{\varvec{\varSigma }}^{(s+1)} = \frac{1}{n} \sum \limits _{t=1}^n {\widetilde{w}}_t^{(s)} \widehat{\mathbf {u}}_t^{(s+1)} \widehat{\mathbf {u}}_t^{(s+1)T}. \end{aligned}$$
(63)

Finally, the first-order condition with respect to the log-likelihood function and \(\nu \) yields the fourth CM-step (in the form of CME, similarly as shown for stochastic model A in Appendix B.5), consisting of the zero search

$$\begin{aligned} 0&= \log {\widehat{\nu }}^{(s+1)} + 1 - \psi \left( \frac{{\widehat{\nu }}^{(s+1)}}{2} \right) + \psi \left( \frac{{\widehat{\nu }}^{(s+1)}+N}{2} \right) \nonumber \\&\quad - \log \left( {\widehat{\nu }}^{(s+1)}+N \right) + \frac{1}{n} \sum \limits _{t=1}^n \left( \log {\widetilde{w}}_t^{(s+1)} - {\widetilde{w}}_t^{(s+1)} \right) \end{aligned}$$
(64)

where

$$\begin{aligned} {\widetilde{w}}_t^{(s+1)} = \frac{{\widehat{\nu }}^{(s+1)}+N}{{\widehat{\nu }}^{(s+1)} + \mathbf {u}_t^T (\widehat{\varvec{\varSigma }}^{(s+1)})^{-1} \mathbf {u}_t} \end{aligned}$$
(65)

and \(\mathbf {u}_t = \widehat{\mathbf {A}}^{(s+1)}(L)(\varvec{\ell }_t - \mathbf {h}_t(\widehat{\varvec{\xi }}^{(s+1)}))\). Following Algorithm 2 summarizes the computational steps of the E- and M-step under the current stochastic model B.

figure b

3.3 Model selection

In practical situations, it might not be clear at the outset which VAR model order p or which stochastic model (A or B) is most adequate for the given observations. Therefore, this section provides some techniques for identifying the model that is best in a statistical sense. First, a statistical white noise test is elaborated, which enables one to assess the capability of an estimated VAR model to fully capture the auto- and cross-correlation pattern. Subsequently, a standard information criterion adapted to the generic observation models of Sect. 2 is defined.

3.3.1 Multivariate portmanteau test

The portmanteau test by Box and Pierce (1970) may be used to test whether the residuals of an autoregressive moving average (ARMA) are uncorrelated. Hosking (1980) suggested an extension of the portmanteau statistic to vector ARMA (VARMA) models, of which the VAR models considered in our current paper constitute special cases. This statistic has an approximate Chi-square distribution with df\(N^2(h-p)\) and may be written as

$$\begin{aligned} P = n \sum _{l=1}^h \text {vec}(\widehat{\varvec{\varSigma }}_l)^T \left( \widehat{\varvec{\varSigma }}_0^{-1} \otimes \widehat{\varvec{\varSigma }}_0^{-1} \right) \text {vec}(\widehat{\varvec{\varSigma }}_l), \end{aligned}$$
(66)

where h is the maximum lag to be considered and \(\widehat{\varvec{\varSigma }}_l\) the empirical covariance matrix at lag l with respect to the decorrelation-filtered residuals \(\widehat{\mathbf {u}}_t\)—estimated by

$$\begin{aligned} \widehat{\varvec{\varSigma }}_l = \frac{1}{n} \sum \limits _{t=1}^n \widehat{\mathbf {u}}_t \widehat{\mathbf {u}}_{t+l}^T \quad (l = 0,\ldots ,h). \end{aligned}$$
(67)

In particular, the matrix \(\widehat{\varvec{\varSigma }}_0\) occurring in (66) represents the empirical covariance matrix at lag \(l=0\). Furthermore, “\(\text {vec}\)” is the vectorization operator that stacks all columns of a matrix within a single column vector, and \(\otimes \) is the Kronecker product. In view of the similarity of the estimation of the lag-dependent covariance matrix with third CM step (63), we propose to replace estimates (67) by the reweighted estimates

$$\begin{aligned} \widehat{\varvec{\varSigma }}_l = \frac{1}{n} \sum \limits _{t=1}^n {\widetilde{w}}_t \widehat{\mathbf {u}}_t \widehat{\mathbf {u}}_{t+l}^T \quad (l = 0,\ldots ,h). \end{aligned}$$
(68)

in connection with the multivariate t-distribution of stochastic model B. Similarly, the weights regarding the univariate t-distributions of stochastic model A can be taken into account by

$$\begin{aligned} \widehat{\varvec{\varSigma }}_l = \frac{1}{n} \sum \limits _{t=1}^n \sqrt{{\widetilde{w}}_t {\widetilde{w}}_{t+l}} \, \widehat{\mathbf {u}}_t \widehat{\mathbf {u}}_{t+l}^T \quad (l = 0,\ldots ,h). \end{aligned}$$
(69)

In Sect. 4.2 the effects of these modifications regarding the portmanteau test’s type-I error rate are quantified in a Monte Carlo simulation.

3.3.2 Akaike information criterion

Various information criteria for selecting an adequate AR model order p have been proposed (cf. Brockwell and Davis 2016, Sect. 5.5). Although our focus is not on forecasting, we follow these authors’ recommendation to adopt a criterion that strongly penalizes and thus avoids overly large model orders. Since the assumption that the auto-and cross-correlation pattern of a measured geodetic time series is always generated by an AR process appears to be unrealistic (see Sect. 5), the Bayesian information criterion (BIC) is not considered in the sequel. Instead, the Akaike information criterion (AIC) and a bias-corrected version of the AIC, commonly termed AIC corrected (AICC), will be used for VAR model selection (cf. Brockwell and Davis 2016, Sects. 5.5.2 and 8.6). Since the usual assumption of normally distributed white noise is in general not justified for the stochastic models A and B, the general definitions of the AIC and AICC (cf. Burnham and Anderson 2002, Sects. 2.2 and 2.4) are considered here. Using log-likelihood functions (20) and (27) and letting

$$\begin{aligned} K_A&= m + N^2 \cdot p + 2N, \end{aligned}$$
(70)
$$\begin{aligned} K_B&= m + N^2 \cdot p + N^2 + 1 \end{aligned}$$
(71)

be the corresponding total numbers of unknown model parameters (where N is the number of groups of observations, m the number of parameters of the deterministic model, and p the order of the VAR process specified by the user) for stochastic model A and B, the corresponding AIC and AICC are then given by

$$\begin{aligned} \mathrm {AIC}_A&= -2\log L_A(\widehat{\varvec{\theta }};\varvec{\ell }) + 2K_A, \end{aligned}$$
(72)
$$\begin{aligned} \mathrm {AIC}_B&= -2\log L_B(\widehat{\varvec{\theta }};\varvec{\ell }) + 2K_B, \end{aligned}$$
(73)
$$\begin{aligned} \mathrm {AICC}_A&= -2\log L_A(\widehat{\varvec{\theta }};\varvec{\ell }) + 2K_A + \frac{2K_A(K_A + 1)}{N \cdot n - K_A - 1}, \end{aligned}$$
(74)
$$\begin{aligned} \mathrm {AICC}_B&= -2\log L_B(\widehat{\varvec{\theta }};\varvec{\ell }) + 2K_B + \frac{2K_B(K_B + 1)}{N \cdot n - K_B - 1}. \end{aligned}$$
(75)

The required log-likelihood values are computable using the outputs of GEM Algorithms 1 and 2, whereas the scalar quantities \(K_A\) and \(K_B\) determined by (70)–(71) are computed in the setup of the adjustment problem.

4 Monte Carlo results

4.1 Parameter estimation

In the first part of this section, the performance of Algorithms 1 and 2 in producing correct estimates within a Monte Carlo simulation is studied. For this purpose, generated measurements are approximated by a 3D circle defined by the (nonlinear) equations

$$\begin{aligned} x_t&= \ell _{1,t} = h_{1,t}(r,c_x,c_y,c_z,r,\varphi ,\omega ) \nonumber \\&= -r\cos (T_t)\cos (\varphi ) + c_x \end{aligned}$$
(76)
$$\begin{aligned} y_t&= \ell _{2,t} = h_{2,t}(r,c_x,c_y,c_z,r,\varphi ,\omega ) \nonumber \\&= r\cos (T_t)\sin (\varphi )\sin (\omega ) + r\sin (T_t)\cos (\omega ) + c_y \end{aligned}$$
(77)
$$\begin{aligned} z_t&= \ell _{3,t} = h_{3,t}(r,c_x,c_y,c_z,r,\varphi ,\omega ) \nonumber \\&= -r\cos (T_t)\sin (\varphi )\cos (\omega ) + r\sin (T_t)\sin (\omega ) + c_z, \end{aligned}$$
(78)

where the center coordinates \(c_x,c_y,c_z\), the radius r and the rotation angles \(\varphi ,\omega \) about the x- and y-axis, respectively, are treated as unknown parameters to be estimated. The (simulated) true values are specified by

$$\begin{aligned} \varvec{\xi } = \begin{bmatrix} c_x \\ c_y \\ c_z \\ r \\ \varphi \\ \omega \end{bmatrix} = \begin{bmatrix} -1663.1 \\ 1223.4 \\ 1.6 \\ 29.7 \\ 0 \\ 0 \end{bmatrix}. \end{aligned}$$

This model approximately describes the trajectory of two GNSS receivers mounted on a terrestrial laser scanner (TLS), which rotates about the z-axis. The GNSS data are observed for the purpose of geo-referencing the TLS point clouds by using 3D coordinates in a coordinate frame of superordinate precision (see Paffenholz 2012, for details of this multi-sensor system and the geo-referencing methodology). In our current simulation study, the time instances are simply determined by \(T_t = (t-1) \cdot \frac{2\pi }{n}\) s for \(t \in \{ 1,\ldots ,n \}\), where \(n = 1000\), \(n = 10{,}000\) or \(n = 100{,}000\) is the total number of modeled 3D points \(\varvec{\ell }_t = [x_t,y_t,z_t]^T\). Thus, the measurements are equidistantly distributed along the circle. The numerical realizations \(\mathbf {e}_t = [e_{1,t},e_{2,t},e_{3,t}]^T\) of the corresponding random errors are assumed to be determined by the VAR process

$$\begin{aligned} \begin{bmatrix} e_{1,t} \\ e_{2,t} \\ e_{3,t} \end{bmatrix} = \begin{bmatrix} 0.5653 &{} -0.0066 &{} -0.0197 \\ 0.0150 &{} 0.6657 &{} 0.0102 \\ -0.0431 &{} 0.0207 &{} 0.7577 \end{bmatrix} \begin{bmatrix} e_{1,t-1} \\ e_{2,t-1} \\ e_{3,t-1} \end{bmatrix} + \begin{bmatrix} u_{1,t} \\ u_{2,t} \\ u_{3,t} \end{bmatrix}, \end{aligned}$$

where random numbers \(\mathbf {u}_t = [u_{1,t},u_{2,t},u_{3,t}]^T\) for the white-noise components are generated based on the following instances of stochastic model A and B:

  1. (A1)

    stochastic model A with an approximate normal distribution (\(\nu = 120\)) and different variance components:

    $$\begin{aligned} \mathcal {\MakeUppercase {U}}_{1,t}&{\mathop {\sim }\limits ^{\mathrm {ind.}}} t_{120}(0,\sigma _0^2), \\ \mathcal {\MakeUppercase {U}}_{2,t}&{\mathop {\sim }\limits ^{\mathrm {ind.}}} t_{120}(0,2\sigma _0^2), \\ \mathcal {\MakeUppercase {U}}_{3,t}&{\mathop {\sim }\limits ^{\mathrm {ind.}}} t_{120}(0,4\sigma _0^2). \end{aligned}$$

    with \(\sigma _0^2 = 0.001^2\).

  2. (B1)

    stochastic model B with an approximate normal distribution (\(\nu = 120\)) and a fully populated cofactor matrix:

    $$\begin{aligned} \begin{bmatrix} \mathcal {\MakeUppercase {U}}_{1,t} \\ \mathcal {\MakeUppercase {U}}_{2,t} \\ \mathcal {\MakeUppercase {U}}_{3,t} \end{bmatrix} {\mathop {\sim }\limits ^{\mathrm {ind.}}} t_{120}\left( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \; \sigma _0^2 \cdot \begin{bmatrix} 1 &{} 0.98 &{} 1.4 \\ 0.98 &{} 2 &{} 1.96 \\ 1.4 &{} 1.96 &{} 4 \end{bmatrix} \right) , \end{aligned}$$

    where the given cofactor matrix reflects the assumption that the entries of \(\varvec{\mathcal {U}}_t\) are pairwise correlated by correlation coefficient \(\rho = 0.7\).

  3. (A2)

    stochastic model A with variance components as in model A1 and different small dfs:

    $$\begin{aligned} \mathcal {\MakeUppercase {U}}_{1,t}&{\mathop {\sim }\limits ^{\mathrm {ind.}}} t_3(0,\sigma _0^2), \\ \mathcal {\MakeUppercase {U}}_{2,t}&{\mathop {\sim }\limits ^{\mathrm {ind.}}} t_4(0,2\sigma _0^2), \\ \mathcal {\MakeUppercase {U}}_{3,t}&{\mathop {\sim }\limits ^{\mathrm {ind.}}} t_5(0,4\sigma _0^2). \end{aligned}$$
  4. (B2)

    stochastic model B with cofactor matrix as in model B1 and small df:

    $$\begin{aligned} \begin{bmatrix} \mathcal {\MakeUppercase {U}}_{1,t} \\ \mathcal {\MakeUppercase {U}}_{2,t} \\ \mathcal {\MakeUppercase {U}}_{3,t} \end{bmatrix} {\mathop {\sim }\limits ^{\mathrm {ind.}}} t_3 \left( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \; \sigma _0^2 \cdot \begin{bmatrix} 1 &{} 0.98 &{} 1.4 \\ 0.98 &{} 2 &{} 1.96 \\ 1.4 &{} 1.96 &{} 4 \end{bmatrix} \right) , \end{aligned}$$

In total 1000 Monte Carlo (MC) samples of the white-noise components for the stochastic models A1–B2 were randomly generated using MATLAB’s trnd and mvtrnd routines. The resulting simulated observations in models A1 and A2 were then adjusted by means of GEM Algorithm 1, whereas the samples in models B1 and B2 were adjusted using Algorithm 2. The maximum number of iterations was set to \(\mathrm {itermax} = 100\), and the convergence thresholds to \(\varepsilon = 10^{-8}\) and \(\varepsilon _{\nu } = 0.0001\). For each MC run i the root-mean-square errors (RMSE) was computed for different parameters. For example, the RMSE for the center of the circle was obtained by

$$\begin{aligned} \mathrm {RMSE}_c^{(i)} = \left[ \left( c_{x} - {\hat{c}}_x^{(i)} \right) ^2 + \left( c_{y} - {\hat{c}}_y^{(i)} \right) ^2 + \left( c_{z} - {\hat{c}}_z^{(i)} \right) ^2 \right] ^{1/2}, \end{aligned}$$

where \(c_{x},c_{y},c_{z}\) and \({\hat{c}}_x^{(i)},{\hat{c}}_y^{(i)},{\hat{c}}_z^{(i)}\), respectively, are the true values and the estimates. The RMSE for the other parameters shown in Table 1 is defined similarly (the RMSE with respect to the VAR coefficients was also determined collectively). Only the results for the Student random error models A2 and B2 are given since the RMSE values for the Gaussian random error models A1 and B1 were similar. To compute the interval estimates, the empirical 2.5 and 97.5 percentiles were determined from the sorted MC estimates; e.g., \({\widehat{r}}_{0.025}\) and \({\widehat{r}}_{0.975}\) for the radius estimates \({\widehat{r}}^{(i)}\), so that

$$\begin{aligned} P({\widehat{r}}_{0.025}< r < {\widehat{r}}_{0.975}) = 0.95. \end{aligned}$$

As it is intended to analyze the degree to which the bias is reduced by increasing the number of observations, decimal floating point numbers are listed with only one digit before the decimal point of the mantissa.

Table 1 Monte Carlo RMSE statistics for the selected parameter estimates \({\widehat{c}}_{x},{\widehat{c}}_{y},{\widehat{c}}_{z}\) (circle center coordinates), \({\widehat{r}}\) (circle radius), \({\widehat{\varphi }}\) (one of the rotation angles), \(\widehat{\varvec{\alpha }}\) (all VAR coefficients), \({\widehat{\sigma }}_1^2\) (scale factor for the first time series), \({\widehat{\nu }}_1\) (df for the first time series with model A2) or \({\widehat{\nu }}_1\) (df with model B2)

In the majority of cases the mean, minimum and maximum RMSE decreases by one order of magnitude when the number observations per time series are increased from 1000 to 100, 000 for the multivariate Student random error model B2. With model A2 the improvement of the RMSE is notable but usually less than one order of magnitude. The largest RMSE values occur for the df and the VAR coefficients. It may therefore be concluded that these parameters are most difficult to estimate. This problem is most obvious for the model involving univariate t-distributions. A similar finding was obtained through the MC study in Kargoll et al. (2018a) for the df and the (univariate) AR coefficients. We therefore conclude that the estimation of the df of the multivariate Student random error model is much more accurate than the df of the univariate Student random error models. The improvement of the accuracy of the estimated df obtained for the former model when the number of observations per time series is increased from 1000 to 100, 000 is clearly visible in histogram plots (Fig. 1). On the one hand, the histogram for \(n = 100{,}000\) is much narrower and more centered about the true value \(\nu = 3\) than the histogram for \(n = 1000\) and \(n = 10{,}000\). The fact that these histogram are skewed is a phenomenon well known in the context of estimating the df of a Student distribution (e.g., Koop 2003, Sect. 6.4.3).

4.2 White noise test

Reweighted and weighted multivariate portmanteau statistics (66) based on (67)–(69) were evaluated for 1000 MC runs with respect to all four simulation models A1–B2 and by generating either \(n = 1000\), \(n = 10{,}000\) or \(n = 100{,}000\) observations per time series. For each of these scenarios the number of times that the white-noise hypothesis was rejected due to the test with significance level 0.05 was counted, and the obtained numbers were divided by the number of MC samples. Ideally, the resulting rejection rates (see Table 2) are equal to the significance level since the actual frequency of falsely rejecting a true null hypothesis corresponds to the prescribed probability of a type-I error in connection with the test distribution used. As the MC simulation is a closed-loop simulation in the sense that the same model is used both for the generation of the observations and their subsequent adjustment, large deviations of the rejection rates from the significance level indicate that the model is not estimated well by the EM algorithm or that the test distribution does not apply exactly. To differentiate these systematic deviations from random fluctuations, the rejection rates were validated by an additional MC simulation using newly generated samples. The resulting differences between the rejection rates of the two MC simulations turned out to be a few per mille. Based on these results, it can be concluded that differences between the four rejection rates for model A1 are not significant. Furthermore, for model B1 there is no substantial difference in rate between reweighted and unweighted application of the portmanteau test, whereas increasing the number n of observations per time series from 1000 to 100, 000 reduces the deviations from the significance level notably (giving even the exact value 0.05 for the reweighted test). With the Student models A2 and B2 the reweighted test performs slightly better than the unweighted test, whereas an increase of the number of observations does not bring about a general improvement. The reweighted test produced rejection rates between 0.036 and 0.076, which do not grossly deviate from the specified significance level of 0.05. We therefore conclude that both the usage of the EM algorithm and of the approximate test distribution yield satisfactory estimation and testing results.

5 Analysis of Global Positioning System time series from the EUREF Permanent Network

5.1 Introduction

The monitoring of geophysical phenomena such as tectonic velocity can be achieved by analyzing the site velocities of Global Positioning System (GPS) permanent stations. For the proper interpretation of the results and deduction of physically relevant processes, knowledge of the noise characteristics of the GPS coordinate time series is required. This includes a description of their temporal correlations, the omission of which affects the uncertainty of the estimated quantities, leading to an unrealistic estimation of precision (Mao et al. 1999; Altamimi et al. 2002; Dong et al. 2006; Santamaria-Gomez et al. 2011; Klos et al. 2018). Cross-correlations between time series, sometimes called spatial correlations, should also be considered, particularly for regional networks (Williams et al. 2004). This kind of correlation is due, e.g., to atmospheric effects and mismodeling of satellite orbits or Earth Orientation Parameters (Klos et al. 2015a). A reduction of their impact can be achieved by the use of stacking filtering techniques (Wdowinski et al. 1997). Another approach is to model the temporal and spatial correlations within the framework of a multivariate regression model. Amiri-Simkooei (2009) made use of the least-squares variance component estimation (LS-VCE) to assess the noise characteristics of daily GPS time series of permanent stations, restricting attention to the estimation of a combination of white and flicker noise. Whereas many empirical studies confirmed this modeling approach (e.g. Amiri-Simkooei et al. 2007; Teferle et al. 2008; Kenyeres and Bruyninx 2009), other studies found the noise as being a combination of white and different power-law noises: white, flicker and/or random-walk or fractional Brownian motion (see, e.g., Mao et al. 1999; Klos et al. 2015b). Spectral indices estimated by MLE assuming normally distributed noise (Langbein and Johnson 1997; Williams et al. 2004) were shown to depend on the stations or the component under consideration (North, East, Up, abbreviated in the following by N, E and U, respectively).

Fig. 1
figure 1

Histograms of the estimated df of the model B2 for \(n = 1000\) (blue), for \(n = 10{,}000\) (orange) and for \(n = 100{,}000\) (yellow)

Table 2 Rejection rate of the null hypothesis that the decorrelation-filtered residuals \(\hat{{\mathbf {u}}}_1,\ldots ,\hat{{\mathbf {u}}}_n\) are uncorrelated both throughout time and across the three time series

Nearly all previous studies intending to fit a deterministic AR model to the GPS coordinate time series neglected the potential cross-correlations (Didova et al. 2016) and used a simplified AR model. The strength of our methodology, as developed in the previous sections, is to allow specifically for their estimation. We follow thus the results of previous noise analysis and model the noise of GPS coordinate time series as being a combination of a white and a correlated noise. To take potential cross-correlations into account, a VAR model is used, whose coefficients are estimated alongside functional parameters such as the intercept, trend and the amplitudes of harmonic functions of the GPS position time series. As the outlier characteristics are expected to be heterogeneous, stochastic model A is applied, which allows for individual variance components and dfs of the univariate t-distributions associated with the different time series. As we do not estimate a fractal dimension by setting up an autoregressive fractionally integrated moving average (ARFIMA) model (see Hosking 1981), which is beyond the scope of the present paper, a direct comparison between the results obtained with the VAR and power law approaches to describe the noise structure can hardly be done. We believe, however, that our method allows for an alternative, complementary determination of the noise structure with respect to the usual method based on spectral analysis, by achieving simultaneously a robust estimation of the parameters of the deterministic observation model. To achieve a robust adjustment of the GPS data, t-distributions are used for two reasons: Firstly, the t-distributions are used as outlier distributions to deal with outliers in the GPS time series. To the best knowledge of the authors, there currently is no other modeling framework that incorporates a general deterministic model, multiple outliers as well as a VAR (auto- and cross-correlation) model. Secondly, there exists an intimate connection of t-distributions with the well-known Mátern covariance model, which has been successfully employed in the stochastic modeling of GPS time series (e.g., Kermarrec and Schön 2014). Using the concept of scale invariance, Schön et al. (2018) show the relationship between the t-distribution model and processes having a power spectral density, such as correlated processes from a Mátern model. Both models share the property of being heavy tailed. Moreover, whereas most of the previous studies concentrate on the analysis of the noise of GPS coordinate time series of one station (univariate model), we make use of a multivariate regression. Auto-correlations as well as cross-correlations between GPS coordinate time at different stations can be easily estimated in one step and if needed, accounted for in further calculations. Specific statistical tests for multivariate observations are further used to judge the goodness of the decorrelation.

Fig. 2
figure 2

Location of the four stations under consideration with their mutual distances

5.2 Description of the dataset

The multivariate regression is applied to the GPS daily coordinates time series provided by the EUREF Permanent NetworkFootnote 4 (EPN) (cf. Bruyninx et al. 2012). For the sake of simplicity and without loss of generality, we restrict our analysis to the N component. The aim of this case study is not to provide a detailed and general description of the noise structure of GPS position time series, but to highlight a possible usage of the proposed algorithm. Further contributions will analyze more systematically how the correlated noise of the N, E and U components or the velocity can be modeled by means of a VAR process. Four stations located in the Czech Republic are considered in the following. Their respective locations are shown in Fig. 2. We intentionally selected three stations (CPAR, GOPE and CTAB, station A) close to each other, thereby creating a small network with a maximum distance between stations of approximately 100 km. The station KRAW (station B of the EPN network) is 400 km away from the other stations and allows us to analyze medium-distance cross-correlations. A total of 171.85 weeks (i.e., approximately 3 years) of solution were randomly downloaded, starting at GPS second of week 1824 and ending at 1995.857. We followed thus the recommendation of Bos et al. (2010) to avoid an absorbed correlated noise content in estimating trends of time series and unreliable seasonal signals (cf. Blewitt and Lavallée 2002). In order to test the feasibility of the proposed adjustment technique, neither outliers were removed nor data gaps were interpolated. Instead, missing solutions were filled with random numbers, which we generated using the MATLAB function trnd as following a t-distribution with df\(\nu = 5\) and scaling \(\sigma = 1\) mm. The latter value is thus larger than the estimated standard deviation (0.57 mm) of the white noise components of the time series. Figure 2 shows the four stations under consideration. Although it may seem crude to insert—usually undesired—outliers instead of interpolating the missing values, our robust multivariate regression approach will downweight the outlying observations within the IRLS algorithm and thus yield parameter estimates that are affected little by irregular observations. It could be further shown that outliers with higher or lower standard deviation (10 mm and 0.5 mm were tested) did not change the VAR coefficients by more than 1 percent, indicating a high robustness of the algorithm. The corresponding results are not presented for the sake of brevity.

5.3 Functional model

We describe the deterministic part of the observations as a combination of a mean, a linear trend and periodic effects. The vector for the \(N = 4\) groups of observations is expressed as \(\varvec{\ell }_t = [\ell _{\mathrm {GOPE},t},\ell _{\mathrm {KRAW},t},\ell _{\mathrm {CPAR},t},\ell _{\mathrm {CTAB},t}]^T\). Following Klos et al. (2015a), we estimate an annual and a semiannual period, i.e., 365 and 365/2 days. Since the estimation of lower periodicities (cf. Amiri-Simkooei et al. 2007) did not influence the coefficients of the VAR process more than 2 percent, they will not be considered in this contribution. We acknowledge that further investigations with more observations are needed to decide whether additional frequency parameters should be estimated and to adapt their determination to station specificities. This is, however, beyond the scope of the present paper on robust multivariate regression. Interested readers should see Klos et al. (2018) for an analysis of the impact of misspecifying the functional model on velocity estimates, as well as Santamaria-Gomez et al. (2011) for a detailed description of the frequencies present in GPS coordinate time series. Consequently, we simplify the model chosen by Amiri-Simkooei (2009) and express the functional model for each time series as

$$\begin{aligned} h_{k,t}(\varvec{\xi }) = a_0 + a_1 t + \sum _{j=1}^2 \left[ A_j \cos \omega _j t + B_j \sin \omega _j t \right] \end{aligned}$$
(79)

where \(a_0\) is the intercept and \(a_1\) the slope or trend, which can be interpreted as the permanent GPS station velocity (Klos et al. 2015b). \(A_j\) and \(B_j\) are the coefficients with respect to frequency \(f_j = 2\pi \omega _j\), where \(\omega _j\) is the angular velocity. As mentioned previously, two periods were considered as having a significant amplitude, which correspond to \(\omega _1 = \frac{2\pi }{365}\) and \(\omega _2 = \frac{2\pi }{365/2}\). Thus, the chosen functional model contains six parameters to estimate. Here, we apply stochastic model A (Sect. 2.1), where white noise components are considered to be independently t-distributed, which is justified by the assumption that the noise is station specific.

5.4 Determination of the vector-autoregressive model order

5.4.1 Application of Akaike’s Information Criterion

The order p of the VAR model to be estimated was determined by means of the Akaike information criteria based on the log-likelihood function \(L_A\) with respect to the univariate t-distributions, as presented in Sect. 3.3.2. For this purpose, the order was varied from \(p = 1\) up to \(p = 38\) by steps of 1, and the AIC as well as AICC value was computed for each model. The results are presented in Fig. 3 (top) only for the AIC since the AICC gave very similar values. Using proposed functional model (79), the order \(p = 8\) is clearly identified as the optimal one since the AIC is a minimum there. It should be mentioned that the identified order may vary depending on the modeling of the periodic components, in particular on the number and values of the modeled frequencies. A similar interaction between a Fourier model and the characteristics of a fitted univariate AR model was observed in Kargoll et al. (2018a, Sect. 5). The reason for this is the AR model’s capacity to capture certain systematic effects not included in the deterministic model.

Fig. 3
figure 3

Results of the AIC and portmanteau test (p value) obtained for VAR model orders \(p = 1,\ldots ,38\)

Fig. 4
figure 4

Original observations (top), colored noise residuals (middle) and white noise residuals (bottom); the four segments of these time series are sorted per stations versus time in day

5.4.2 Application of the portmanteau test

The multivariate portmanteau test as presented in Sect. 3.3.1 was applied to the adjustment results obtained for the \(N=4\) stations and VAR model orders \(p = 1,\ldots ,38\), as was done for the AIC. The significance level was defined as \(\alpha = 0.05\), and the maximum lag to be considered was set to \(h=20\). Due to the lack of practical recommendations in the literature, these somewhat arbitrary choices were adopted for use in MATLAB’s Ljung–Box test, which corresponds to the univariate portmanteau test. Then, to obtain more detailed information, the p values resulting from the applications of the multivariate portmanteau test were computed. Since the test is one-sided in the sense that the null hypothesis of white noise is rejected if the value of test statistic (66) exceeds the critical value of the Chi-square distribution with df\(N^2(h-p)\), the p value (denoted by pv below) is defined as the probability that the portmanteau test statistic exceeds the computed value of the test statistic. This probability can be computed by

$$\begin{aligned} pv = 1 - F_{\chi ^2(4^2 \cdot (20-p))}(P). \end{aligned}$$
(80)

where P is the value of test statistic (66) and \(F_{\chi ^2(4^2 \cdot (20-p))}\) the cumulative distribution function of the Chi-square distribution with df\(4^2 \cdot (20-p)\). A p value below \(\alpha \) indicates that this test statistic exceeds the critical value (and vice versa). Thus, the p value pv indicates the strength of evidence in favor of the white noise hypothesis, which hypothesis is rejected in case of \(pv < \alpha \; (= 0.05)\). Figure 3 (bottom) shows that the p value pv increases sharply between the VAR model orders \(p = 14\) and \(p = 15\), taking values \(pv < 0.05\) for \(p < 15\) and \(pv > 0.05\) for \(p \ge 15\). Thus, \(p=15\) constitutes the least VAR model order for which the portmanteau test accepts the white noise hypothesis. A p value \(pv < 0.015\) is obtained for the optimal model order of 8 found by means of the AIC, so that both model selection procedures prefer quite different orders. This difference could be linked with an inaccuracy of the functional model. Indeed, modeling the greater number of coefficients may be an overfitting of the VAR model and an indication of the presence of nonperiodic signals or/and amplitude-varying harmonic components, as displayed by the estimated white noise time series in Fig. 4. As low-order models should be preferred for the description of VAR processes, we consider a VAR model order of \(p=8\) as more adequate to describe the noise structure.

5.5 Results

5.5.1 Vector-autoregressive coefficients

Figure 5 depicts the 32 coefficients of the fitted VAR(8) model, which describe the correlation structure of the time series of the station GOPE as well the cross-correlations between GOPE and the three other time series (stations CPAR, CTAB and KRAW, respectively). The plotted coefficients, which are sorted by distance, generally decrease in value with increasing lag \(j = 1,\ldots ,8\). However, no clear physically explainable dependency on distance can be deduced from Fig. 5. The coefficients become smaller as the distance between stations grows, although the sixth coefficient for GOPE-KRAW is surprisingly larger than the other one. This pattern was confirmed for the other stations (not presented here for the sake of brevity). This large coefficient may be the reason why an optimal VAR model order of \(p = 8\) was found with the AIC. The first VAR coefficient is greater than the other one (distance 0), i.e., current values depend more strongly on the previous one than on higher time-lagged values. This dependency does not hold true for the cross-correlation coefficients. Figure 6 shows the standardized coefficients, i.e., the coefficients divided by their estimated standard deviations. The latter have been extracted after execution of the final EM iteration step from the inverted normal equation matrix in (48) multiplied by the corresponding variance component estimate. They clearly have greater values for the correlation of the GOPE time series (distance 0), which are determined with a lower variance than the cross-correlation coefficients.

Fig. 5
figure 5

Estimated VAR coefficients are sorted per distance for the time series of station GOPE and represented by stars with different colors

Fig. 6
figure 6

Normalized VAR coefficients (log plot)

5.5.2 Degree of freedom and weights

The df of the t-distribution was found to reach 30 for all time series, indicating the presence of a low level of heavy-tailedness or outlier contamination. The histogram of the adjusted weights is presented in Fig. 7. As found in Koch and Kargoll (2013) and Kargoll et al. (2018a), the weights increase rather smoothly from smaller to larger values. The white-noise components associated with small weights are located in the tails of the t-distribution, so that the corresponding observations could be labeled as outliers.

Fig. 7
figure 7

Histogram plot of the weights

5.5.3 Interpretation of the residuals

The colored noise residuals \(\widehat{\mathbf {e}}_1,\ldots ,\widehat{\mathbf {e}}_n\) and the white noise residuals \(\widehat{\mathbf {u}}_1,\ldots ,\widehat{\mathbf {u}}_n\) are plotted together with the original observations in Fig. 4. As already mentioned in Sect. 5.4.2, some oscillations which do not correspond to physically plausible ones remain in these residuals (see, e.g., Santamaria-Gomez et al. 2011). These harmonics may stem from periodic loading or multipath effects that (i) cannot be modeled with the chosen sine and cosine functions or (ii) have varying amplitudes. The residuals of KRAW, classified as “B station” by the EPN clearly show that the low quality of the observations is transferred into the residuals. In such a case the VAR coefficients take the role of “waste containers” in the sense that they express residual deterministic effects rather than the actual auto- and cross-correlations, so that the coefficients lose their physical meaning as noise. As an example, Stralkowski et al. (1970) show how some periodic signals can be modeled by an AR(2) process. Alternatively, a periodic autoregressive moving average (ARMA) model could be used, or an ARFIMA model to take the fractal dimension additionally into consideration. This problem may be the subject of further research, although such fine differences are mostly interesting for the purpose of forecasting, which is not a goal of our study. Time-varying (V)AR coefficients to account for the time variability of the correlations (cf. Kargoll et al. 2018b) is another option to investigate, particularly for longer coordinate time series.

5.5.4 Ratio white noise to correlated noise

The ratios of the standard deviation of the white-noise residuals to the standard deviation of the correlated noise residuals are given in Table 3. For station GOPE, this ratio is slightly larger than for the other stations. This result matches the conclusions obtained by Amiri-Simkooei (2009) for the ratio of flicker noise to white-noise amplitude. The power law description of the noise and its AR modeling both have advantages and shortcomings: Whereas the power laws may be difficult to estimate accurately, the AR coefficients by themselves are physically hardly interpretable. We believe that the two approaches are complementary and benefit from each other for a better comprehension of the correlation structure. Further analysis should be performed in that direction.

Table 3 Ratio of standard deviations of the white and correlated noise for the four time series analyzed

6 Conclusions and outlook

Two structurally different stochastic outlier models for multivariate time series involving generally nonlinear deterministic model functions and a VAR model have been introduced. The first stochastic model is based on univariate-scaled t-distributions for the different time series (groups) and allows for group-specific df (i.e., outlier characteristics), but not for correlations between white-noise variables at the same time instance. Thus, it is recommended in situations where time series from independent, heterogeneous sensors are combined. The second model is based on a multivariate-scaled t-distribution and allows for the modeling of correlations between white-noise variables at the same time instance, but not for group-specific df. This model in turn may be useful when time series from different, but inter-connected sensor components are fused. These two stochastic frameworks are therefore viewed as complementary. Two corresponding GEM algorithms were derived, which show that the estimation of the VAR model is based on normal equations that are structurally similar to those for the estimation of a single or multiple independent AR model(s). The extension of the univariate to the multivariate scaled t-distribution does not complicate the algorithm significantly, either.

The MC closed-loop simulation demonstrated a fast and stable convergence of the proposed GEM algorithms for a nonlinear observation model. Furthermore, it was shown that the bias decreased for the majority of estimated parameters when the number of observations is increased. The accurate estimation of the df of the model involving univariate t-distributions remains a challenging task to be investigated in future research. Moreover, it was demonstrated that the empirical size of the standard multivariate portmanteau test using the Chi-square distribution is reasonably close to the specified significance level. As this test allows for an evaluation of the fitted VAR model’s capacity to fully de-correlate the colored-noise residuals, it may also serve as a tool for selecting an adequate VAR model order, as an alternative to Akaike’s information criterion derived for the two stochastic models considered.

The GEM algorithm derived for the stochastic model based on univariate-scaled t-distributions and group-specific df was tested on observations from the EPN Network for a particular case study with four stations. We applied the AIC, which is based directly on the maximum likelihood estimates obtained by means of the GEM algorithm, to identify the relatively most adequate order of the VAR model to model the auto- and cross-correlation structure of the GPS coordinate time series. We also employed the multivariate portmanteau test to evaluate the VAR noise models of different orders absolutely, by testing whether the decorrelation of the estimated colored-noise residuals by means of the inverted VAR models resulted in white noise or not. Thus, a VAR model of order 8 was found to be relatively optimal by the AIC, whereas an order of 15 was preferred by the portmanteau test. These findings highlight that an AR(1) model, as often assumed, is suboptimal to describe the correlation structure. However, no physically explainable dependency of the coefficient values regarding the distance between stations could be found, a finding we link to the respective quality of the observations. The presented approach is an alternative to the usual power-law noise description. It would, therefore, be very valuable to compare the performance of these different noise models in detail. Such comparisons were beyond the scope of our present contribution since it is focused on the methodology of stochastic Student-VAR modeling, which is general enough to be applicable to different types of geodetic time series.

Further investigations to improve the modeling of the noise structure could include, e.g., ARFIMA or periodic ARMA models. Additionally, the impact of the stochastic model chosen on velocity estimates should be investigated. Since these time series have gaps, it would appear to be a useful further step to extend the proposed GEM algorithms to handle such missing observations as well.